Ray Boyce and I first met E.F. (Ted) Codd at a symposium he organized at the IBM T.J. Watson Research Center in Yorktown Heights, New York, in 1972. Ray and I were both recent hires at the Watson Center. I had recently completed my PhD at Stanford University, and Ray had completed his at Purdue University. We were members of a recently reorganized IBM group that was looking for a mission. At that time, Ted Codd was a computer scientist at IBM's San Jose Research Laboratory and was proposing a new way of organizing data that he called the "relational data model."
One of the most important research areas in computer science in the early 1970s was the development of systems and languages for handling what computer scientists call persistent data. This term denotes data that remains in a computer system indefinitely, until it is explicitly deleted. Systems for managing persistent data were spreading quickly in the business world. A database management language proposed by the Codasyl Data Base Task Group (DBTG)1 was receiving a lot of attention. Ray and I spent some time studying this language, learning concepts such as "currency indicators" and "set occurrence selection." With a little practice, we learned how to represent a database query in the form of a program that navigated through a network of pointers to find the desired information.
Designing a Relational Language
For Ray and me, our exposure to the relational data model at Codd's research symposium was a revelation. For the first time, we could see how a query that would require a complex program in the DBTG language could be reduced to a few simple lines using one of Codd's relational languages. It became a game for the two of us to invent queries and challenge each other to express them in various query languages.
One of the queries that came out of this game was as follows: "Find names of employees who earn more than their managers." The query was based on a three-column employee table. Each row of the table represented an employee and contained a name, a salary, and the name of the employee's manager. (This is a simple example. In a real application, employees would be identified by some unique identifier such as an employee number.) Table 1 shows the structure of the table with four example rows.
Click for larger view
View full resolution
The third row of the table indicates that Baker's salary is $50,000 and Baker's manager is Smith. The first row indicates that Smith's salary is $45,000, so Baker earns more than his manager. Similarly, Nelson's salary is $55,000, but Nelson's manager is Baker, who earns $50,000, so Nelson also earns more than his manager. The result of the query, based on these four sample rows, is Baker and Nelson.
In his research papers, Codd introduced two relational query languages, called Relational Algebra2 and Relational Calculus (also known as the Data Sublanguage Alpha3). Relational Algebra consists of several operators, usually represented by symbols such as those in Figure 1. Using these operators, the query about well-paid employees could be represented as in Figure 2a.
Codd's Relational Calculus was based on a notation used in formal logic, using an existential quantifier ∃ (meaning "for each") and a universal quantifier ∀ (meaning "for all"). Similar to Relational Algebra, Relational Calculus could represent the well-paid employee query compactly (see Figure 2b).
Ray and I were impressed by how compactly Codd's languages could represent complex queries. However, at the same time, we believed that it should be possible to design a relational language that would be more accessible to users without formal training in mathematics or computer programming. We believed that barriers to widespread acceptance of Codd's languages existed on two levels. The first barrier came from the mathematical notation, which was hard to enter at a keyboard. This barrier was superficial and could be easily dealt with by replacing symbols with keywords—for example, replacing π with "project" and ∀ with "for all...