Relational Model

The Relational Model was the first theoretically founded and well thought out Data Model, proposed by Ef Codd in 1970. It has been the foundation of most database software and theoretical database research ever since.

It is ironic, however, that largely because of historical circumstances, its faithful implementations haven't yet succeeded in the marketplace. Early on, computers were thought not powerful enough to support it, and later on users got used to the shortcuts and the compatibility with previous implementations was a good enough excuse. While research in database theory built upon the foundation of the relational model, the DBMS industry has yet to faithfully implement the ideas that Codd laid out in the 70's.

Some of the foundation for Codd's relational model (1970) was laid by research from David Childs of the University of Michigan. Childs published two papers in 1968 about set-theoretic data structures that supported operations such as intersection, union, domain, and range and that enabled applications to query data without knowing its physical structure. The first citation by Codd in his 1970 paper was "Feasibility of a set-theoretic data structure : a general structure based on a reconstituted definition of relation", a paper Childs presented at the 1968 IFIP Congress. See Extended Set Theory.

Briefly, the relational model structures the logical view of data around two mathematical constructs: domains (i.e., data types) and relations. The name relational comes from "relation" as known and widely used in mathematics (math n00bs, see simple.wikipedia.org ), although in database theory the definition of relation is slightly extended.

A domain is simply a set of values, together with its associated operators. It is equivalent to the notion of a type in programming languages.

A relation over the domains D1, D2, ..., Dn is simply a subset of the cartesian product; the usual notation is R "included in" D1 x D2 x ... x Dn. An element of the cartesian set is called a tuple. A database is a a collection of "relation valued" variables (aka Rel Vars, variables whose value at any point in time is a relation), together with the set of integrity constraints that the data must satisfy.

In order to facilitate programming, a named perspective has been introduced. Each domain that defines a relation is associated with a string label (that will be called column name). A column is then the association between a column name and a domain. A relation header is then a set of columns. A tuple becomes then the mapping between each column in the relation header and a value. And a relation is a set of tuples, all corresponding to the relation header. Because column names are unique in a "relation header", the positional ordering in the mathematical definition becomes inessential, and we can therefore identify each data value in a tuple by its column name. This is essentially a programming convenience and the two definitions are essentially equivalent.

Besides the structure of data, the relational model also defines the means for data manipulation (relational algebra or relational calculus) and the means for specifying and enforcing data integrity (integrity constraints).

That's the basics of the relational model. Despite its apparent simplicity, the relational model is very rich and powerful, and is a wonderful tool for doing real software engineering as well as theoretical research.


A Relational Database as implemented today (with tables, rows, SQL as query language) is much more complicated and less powerful than what a database should be in the Relational Model. Tables and rows aren't equivalent to relations and tuples, because SQL doesn't support user-defined data types and because tables are bags, not sets. What is good enough varies with the complexity of the problem you are facing, and for some problems, the implementation of the relational model by current SQL DBMSes becomes really annoying. It is, unfortunately, one of the many cases of Software Engineering Vs Computer Science.

The drawbacks of "bags" have been exaggerated by those overly enamored by purity. While I agree that query languages and RDBMS should gently discourage their use, they shouldn't forbid them outright. Related: Bag Versus Set Controversy Roadmap.

{The benefits of "bags" have not been convincingly portrayed, and the defence of "bags" appears to be limited to a single participant on the C2 wiki. The downsides of bags are considerable, and are understood, recognised, and accepted throughout the database community. Whether there should be NULLs or not, or how to handle update-able views, or what degree of normalisation is an acceptable minimum -- these are all subject to debate amongst both database practitioners and theoreticians. But whether duplicate rows should be allowed or not? The answer is obvious, resounding, and universal: Not.}

Almost all the existing RDBMS vendors have also "voted" to allow bags. That's at least a dozen companies/organizations. In that sense it can be considered the road-tested option of the two, which should give it some weight. I know some of you feel the case against bags is very strong. However, I've yet to see a convincing practical argument, especially when it comes to interfacing to existing systems, which are not going away anytime soon even if you want them to really badly. -t

{Beware of naively conflating historical conceptual mistakes, and performance hacks imposed by old hardware limitations -- plus an awkward "need" to provide legacy support for those within a vendor's line of products -- with any genuine requirement for "bags". There are programming languages that still support GOTOs, for essentially the same reason.}

There are also people who agree that GOTO's are on occasion the best solution to a problem, regardless of legacy support issues. And further, performance issues can still be a bottleneck on today's equipment. Customers always want "more".

{In those rare cases when GOTOs seem appropriate, GOTOs are never a best solution in general, they are only a local "best" when a limited language provides nothing better. Performance issues related to using bags inside DBMSs are virtually unknown on modern hardware of any kind, including highly-restricted mobile and embedded devices. A device that is so constrained as to actually require "bags" in order to meet performance minima will inevitably have so many other limitations that anything mistakenly still called a "DBMS" in that context will be virtually unrecognisable as such.}

Objective techniques to measure whether goto's are or can be "good" have yet to be found. As far as an illustrative performance case where bags are more efficient, consider the case of a event logger table. Non-keyed inserts will generally be faster than keyed inserts because processing the key takes resources and is difficult to split to multiple processors (unique numbering requires central coordination unless pre-assigned ranges are used, making them time-ignorant). But we are digressing off-topic here. There are already plenty of bag-versus-set topics. We should try to link to them rather than copy their concepts here. -t

{Objective techniques to measure whether the smell of poop on your shoe is or can be "unpleasant" have yet to be found, too. That doesn't mean it's any less obviously unpleasant. As for your event-logger table, yes, it might be that inserting into a "bag" is faster than inserting into a Rel Var, but by the same argument it would probably be faster to turn off ACID compliance or not use a DBMS at all. Performance, aside from that implied by algorithmic complexity, is never a reasonable basis for an argument about theoretical matters. Feel free to move this to one of the bag-versus-set topics.}

Your "obvious" has something to be desired. You've been poor at communicating why it's "obvious" to you, and thus remains Walled Garden evidence for you and you alone. As best I can tell, they appear tied to your personal esthetic but impractical obsession with conceptual "purity". And one person's shit is another's fertilizer. Choice can be a good thing.

{It's obvious to almost the entire database community. It's you who seems stuck in a Walled Garden.}

[I can tell you that he isn't the only one. Bags are an elegant way to implement the vertices of a sparse-matrix implementation for a weighted graph, for example -- a data structure par excellence for relational modelling. So, sush now, an apologize already. (At issue, is really whether databases themselves with their preimposed taxonomies are right for relational modelling.) --Mark Janssen]

not be appropriate for any graph representation -- or you're using some other model to implement relations which is not necessarily the Relational Model. As for databases being "right for relational modelling" or not, I'm not clear what you mean by "preimposed taxonomies". Indeed, I'm not clear what you mean at all.>

I could again say the same thing about DB implimentors. Enough Argument From Authority. Show the reasoning in clear steps, not gab. This ain't the goddam dark-ages anymore.


Goto Discussion:

The "Structured Programming" articles and notes written decades ago showed that goto's are mostly bad ("mostly", because even Knuth makes an exception, and many programmers use Halt(), Break(), Exit(), Return() which are higher level forms of Goto). Does anyone read or study the history of programming and the progress we made decades ago with structured programming? If you have read anything about structured programming you would know that there IS in fact objective evidence against goto's and there is logical reasoning against goto's. The latest kids on the block and programmers haven't even read "structured programming" articles and notes - this is a serious problem.

I think Top's continual demand for "objective evidence" is kind of like the Creationist Christian who demands more and more evidence for evolution, even though there is plenty of evidence for evolution. If we provide the Creationist Christian evidence, that christian shouts "Book Stop, I'm not reading more books or articles from Dawkins or Hitchens on evolution, sorry". There is plenty of evidence out there that goto's are bad (but semi-okay for local jumps, like exit() and break() or return(). Have you read Donald Knuth's "structured programming with Goto's" and have you read anything on structured programming before Knuth wrote that? Please don't yell Book Stop because in order for you to actually find the objective evidence, you'll have to read some things that PROVIDE the evidence (not necessarily books, some are articles, some are essays, some are notes).

Are you looking for scientific statistics and data on Goto's? Sort of like how a creationist christian will argue that evolution still doesn't have enough stats and science behind it?

Also, an Edit Hint: should Goto discussion be moved to some page?


Advantages of the relational model:

It is extensively studied, proven in practice, and based on a formal theoretical model. Almost all of the things that are known about it are actually proven as mathematical theorems. The data manipulation paradigm is based on first order logic and is in full support of Database Is Representer Of Facts.

It offers an abstracted view of data. It was among the first major application of abstraction as a way to manage software complexity. It basically abstracts the physical structure of data storage, from the logical structure of data.

It offers a declarative interface (relational calculus) for the specification of data manipulation, that is actually translated to an efficient (sometimes the most efficient) implementation, given a physical data layout and within reasonable heuristic limits.

Drawbacks:

It's never been fully, faithfully implemented. This is by far its biggest handicap.

In spite its simplicity, it's likely you'll find lots of developers, architects, DBAs, book authors, committees who have no clue, but pretend that they have. After all, it would be quite embarrassing for someone to admit that he doesn't know what the Relational Model is.

Much of this can perhaps be blamed on existing implementations and not the theory or concept. See Sql Flaws.

Number 1) and 2) usually generate a vicious circle, because DBMS vendors react to what the market demands and spend money and time implementing purportedly useful extension, which are in fact not only less useful than having a true implementation of the relational model, but they are actually harmful. These extensions tend to be generically called Object/Relational features. The most glaring example is Oracle RDBMS, which introduced "objects", "references" and "collections", together with other essential accompanying features like the cool sounding operator IS_DANGLING. IS_DANGLING is supposed to rhyme with data integrity. Project managers, CTOs and other staff can read some very nice brochures, PDFs and even be entertained with cool sales movies on O/R features. IS_DANGLING hasn't made it to any marketing presentation as yet.


References:

Ef Codd's initial paper is available at www.acm.org

For a substantial exposition of the relational model and the reasons it is so powerful and valuable for the modern software technology, see An Introduction To Database Systems.

For the more mathematically inclined people, Foundations Of Databases is an absolutely delightful reading, although the connection between the theory and the practical engineering values is to some extent left as an exercise for the reader.

For the layman software engineer who won't spend money on database theory, www.brcommunity.com [Broken Link] and the subsequent articles should be convincing enough to make him buy a book (free subscription to the site is required, but won't hurt).

Any standard text book on databases (like Elmasri And Navathe, O'Neil and O'Neil, Ullman). Each of these book tend to elaborate more on some aspects, while taking some shortcuts on others.

For funny examples from really unexpected sources, you can read books like "Oracle XXX, The Complete Reference" where XXX in (7,8,8i), right from Oracle Press.

The most spread misconception about the relational model is that "relational" comes from primary keys and foreign keys that represent a "relation" or "relationships", "relating" rows in different tables. It is not the case; relation is a well-established mathematical concept and the relational model builds upon mathematical properties of relations.


See also Db Debunk



See original on c2.com