Foreword: Appolonius' ship is a beautiful ship. Over the years it has been repaired so many times that there is not a single piece of the original materials remaining. The question is, therefore, is it really still Appolonius' ship?
This analogy is full of ship :-)
Consider the Rdb Tuple (in the sense of a Relational Database). It is an atomic lump of data like a row in a database, or an object, or an element of an array, etc. Basically, it is a single identifiable piece of information.
The question is how do we identify the information? What are the options? What are the implications?
Firstly, what do we mean by identity? There are two varieties of identity. There is internal identity, and there is external identity. Internal identity is logical identity. External identity is physical identity. Internal identity is independent of context. External identity is not.
A data tuple has a logical identity according to its value(s). We can copy it to different places. We can duplicate it as many times as we like. Yet at all times we can identify it, by its value. Since we identify it by its value, we can say that this identity is internal. Furthermore we can say that this identity is logical. It is independent of the physical context or representation. If it is duplicated then it is still the same object. This identity is the one we mean if we consider that the new ship is indeed still Appolonius' ship.
There are a class of objects, known as Value Objects, with similar properties.
A data tuple also has an identity according to its physical representation. We cannot copy this it to different places. We cannot duplicate it. Its identity is bound to the particular physical location of the tuple. This is true irrespective of the value of the tuple. Thus this identity is physical identity. Since the identity is independent of the value of the identity, it is also external identity. This identity is the one we mean if we consider that the new ship is no longer Appolonius' ship.
But even this is squishy. The "particular physical location" of a given tuple may change at any time, at the whim of the RDBMS. Generally, one doesn't care about where the canonical set of bits are; that's an Implementation Detail.
If we have a computing environment with a single, globally addressable space, and it is true that a tuple need never be copied, or moved, and still be considered the same object, then simple physical external identity is enough. The bonus is that since the object is in one place, access may be controlled.
If we have an environment in which these constraints are not true, then physical identity is not enough in itself. Virtually all production environments are of this latter type; you may have a database server, some number of application servers, and a greater number of clients -- all with their own address space.
Instead a scheme which uses logical identity is indicated. The minimum requirement for logical identity is an explicit identifying key which is immutable and associated with the tuple we wish to identify, for as long as we wish to identify it. The penalty is that the object cannot be reliably shared as there is no single point of access control.
This is, of course, how the Relational Model implements "identity", such as it is. Every tuple within a Relational Database has a unique (Relational Variable, Candidate Key) pair; if table X has a tuple with Candidate Key Y; there will be no other tuples with those properties.
Pseudo-physical schemes explicitly decorate an object with a pseudo-address such as a GUID, and clients work using a convention which emulates the properties of a physical space. This is necessarily complicated and error-prone and still requires that there is some single physical point (somewhere) for access control.
Understanding identity (for all values, not just object references) is just a matter of understanding how to define an appropriate identity relation. Equal Rights For Functional Objects describes how to do that. The issue is much simpler than is commonly believed; there's no need to have interminable philosophical discussions about it. Basically, if two values are identical then they are necessarily observationally equivalent. Identity should be the finest implementable equivalence relation on values that satisfies this rule. So if two objects can change independently, references to them cannot be identical. For immutable objects, OTOH, identity should be determined by intentional equality. -- David Sarah Hopwood
Like you said, understanding identity is not big deal. But this observation only shifts the problem elsewhere, without resolving much. Namely if we take the ability to test identity of object references for granted, is it useful to actually deploy it? When and where?
Suppose you want to be able to test whether two objects are "the same", in some loose sense, given references to them. (This also includes situations like using objects as keys in a mapping.) Then, in any particular case, it will usually be obvious whether directly using the identity operator will do what you want, or whether you need to explicitly code a different operation to test for "sameness". In general, using the identity operator won't work iff the identity of the objects does not canonically correspond to whatever notion you want to use in determining sameness. (If they do correspond, consider also whether this is a maintainable property of the design or just an artefact of the current implementation.) >> artefact implementation
For example if you have a structure A having a relationship with a structure B, should that relationship be reflected by a reference to B as part of structure A, or should it be implemented as an identifying value for B stored in structure A? That's what this page is mostly about.
Mu Answer. First ask whether the relationship should be implemented as something stored in structure A at all. Maybe it should, but another possibility would be to implement the relationship as a separate relation instance (tuple, in the Relational Model) that refers to both A and B. In any case, you still have to refer to an instance of B, so let's look at how to do that.
Clearly, an object reference to B is an example of an "identifying value for B". Assuming you're using a language and persistence mechanism with first-class support for references, there are few reasons not to use references as identifying values. But check that you really do want to refer to B, and not to "the object with property P, which currently happens to be B". In the latter case, the reference should be to P, and you need to recurse (from "Mu" above) to establish how to get from P to B. (Maybe you will end up adding shortcut links later, but don't do Premature Optimization.)
Congratulations, you just reiterated the same position that I claimed both on this page and other related pages. I'm not sure whether your arguments are much more persuading, indeed this can be subject to interminable philosophical discussion. Basically it's about folks with strong OO convictions, who will not readily accept "the object with property P that currently happens to be B". And part of this blame is that a good part of OO thinking has been hijacked by philosophical thought.
We have examples like Bertrand Meyer's very fine discussion on the topic where he mentions the question whether you can ever swim in the same river twice, or whether you stay the same although the atoms in your body are replaced by physical processes. Very touchy piece of philosophy, but utterly irrelevant for software engineers. So we have yet to find the most eloquent argument that will convince the smuggest smug object weenie to make the psychological move from "pointer to B" to "the identifying property P". After all, almost the whole OO literature is filled with repetitive examples of how one "manages" relationships as collections of pointers, and I have yet to find one piece of popular OO literature that says otherwise.
Please feedback here:
The strong conclusion of Object Identity should be that there's no real alternative to logical object identity.
Not even GUIDs and global address spaces. And here comes the real life problem: many people and especially OO fans, when dealing with relational databases are happy to [deal?] through a GUID/OID/ID/surrogate key in every table, and declare themselves victorious over the Object Identity problem. Most of the times this approach is an Anti Pattern, but anyways, it has been published in books, architectural blueprints, guidelines and all over the place, because Authors Dont Read. And what OO advocates should be reading is Fundamentals Of Object Oriented Databases.
Many funny example are seen in books and in published papers. For example, zip codes table like ZIP_CODES (ZIP_ID, ZIP_CODE). Or LANGUAGES(LANGUAGE_ID, LANGUAGE_ISO_CODE, LANGUAGE_NAME). Why do you ever need an OIDs for such information? The other terrible thing to do is to attach OIDs to relationship tables for example SUPPLIER_PRODUCT( OID, SUPPLIER_ID, PRODUCT_ID).
What really makes these OIDs thingies very tempting are the rudimentary implementations of relational databases. Because the only possible and justifiable reason for these OIDs thingies in relational database schema, is that primary keys and foreign keys are badly implemented at the physical level by database engines. An OID can be usually implemented as 4 bytes integer, and this pays very well in the efficiency of joins, checking foreign key constraints, key lookups and all these essential things in a system. The problem is that physical layout concerns surface in the logical model, thus selling short the promise of the relational model. There's absolutely no reason why OID thingies shouldn't be handled automatically at the physical level by the database engine.
For example, if you create a natural primary key for a object like ADDRESS, you're going to get a huge performance penalty in almost all database systems. The only primary key for the ADDRESS should be STREET_ADDRESS/ZIP_CODE/COUNTRY, and a database engine should be smart enough to realize that these 3 fields are quite bulky in physical terms. So internally the system should put physical pointers wherever is necessary to enforce foreign keys constraint, and use a smart hashing to enforce the primary keys. If he's not smart enough to do this automatically, at least it should have a manual option with the CREATE TABLE. But with current database systems to use these three attributes as primary key/foreign key is almost pure suicide. So here goes our logical identity down the drains. Well, it's nice in theory and what you should never forget to do, is at least to enforce it as a unique constraint. And don't use OIDs for zip codes, languages, currencies, accounts and all these things that have perfectly implementable natural keys.
Nevermind, the DB vendors are very busy implementing the latest nonsense in the hype magazines, like storing data in "native-XML", or putting "full-fledged" object systems inside the database so that the marketing department [?]. My favorite example is Oracle where you get objects and object references, but with them you also get the suggestively named operator IS_DANGLING(ref).
Glad you dropped in Costin. Quite a lot here. I won't argue about definitions too much as I hope my meanings are clear even if my terminology is a little off. Identity is an architecturally relevant subject IMO. A purely logical definition of tuple may be correct, but I hope it is clear that as soon as it is represented, it has a physical identity. It must be written somewhere. This location is what is generally used when people talk about object identity.
With respect to the non-portability of object identity, this is a subtlety of implementation. Objects are effectively decorated with an OID. This makes the identity logical (part of the data), but it is really pseudo-physical (an address). Thus it doesn't save us from the issues of duplication and equality. One thing I'm trying to explore is why these things are a bad idea; as well as the fundamental metaphysical phenomena which cause these issues to arise. I will look at the references you indicated. Cheers -- Richard Henderson.
Re: The only primary key for the ADDRESS should [ideally] be STREET_ADDRESS/ZIP_CODE/COUNTRY
I disagree. Addresses can change, but still be the same location. For example, if they build a house nextstore and there is not enough address space, then you may be assigned "23 and 3/4" instead of just "23". Or, the street may be renamed for a famous person. It is generally not prudent to use attribute values for keys in my opinion except in very narrow circumstances. Thus a generated addressID or contactID is nice to have.
What is so bad from a software point of view if a street changes its name and all its houses now have a new identity?
[There are issues of efficiency. There is a combinatorial cost factor between changing a street name and having all views of houses automatically 'capture' this change, and of changing every house on the street. Not only do you need to update a factor more objects in the house table itself, but, supposing you were shipping orders to these houses then those orders would also need updated... and same with every other table using this immutable address. Storage and duplication isn't an issue... any RDBMS is free to intern and garbage-collect large values. It's the impact of change that is relevant from a software point of view.]
[Frankly, software shouldn't concern itself all that much with 'external' object identity. Attempting to model the external world in terms of objects is only useful insofar as you are then able to run a good simulation or prediction engine. I.e. go ahead and do it for physics or chemistry, but not for businesses or politics. For the systems where we don't have a good object-based simulation model, it is better if software is designed in terms of 'data collected, inferred, and assumed' instead of objects. Optimizing the data representation for certain mutations can occur based on the pattern of actual mutations seen in practice. Issues of 'object identity' can be left to the dining philosophers - it's a Difference That Makes No Difference.]
Object identity is 'natural' in object space, since it is the defining property which creates the space by distinguishing between objects. So no decoration is needed... until portability outside object space becomes required. This could be portability either to a relational database, or to some different object space.
Portability requires an identification space *external* to the spaces it connects. Allocating numbers using a database seems good to me... not so keen on GUID generators. Number spaces can involve the database without belonging to the database's object tables; using separate allocator tables, and possibly client range caching.
I've got some discussion of object identity issues in an article on persistence layer architecture.
There are many aspects to Object Identity.
'''value identity''' (logical identity mentioned above) - an object is identified by its value, like 01Jan2002. '''global identity''' (e.g. GUID) '''hierarchical global identity''' (e.g. www.c2.com/cgi/wiki?ObjectIdentity) '''dictionary, list, and variable storage''' (e.g. customer.name, customers[3], x := 7) '''intersections''' (e.g. tom jordan + joe smith + dave jones) '''machine id vs. human readable id''' '''rdbms rows''' '''relationships with attributes''' (e.g. tom ''has wife'' becky ''for ten years'') '''etc.'''
I am working on designing a simple OO language. What I need is almost a hierarchical global identity system (plus value id'd objects). So it is very tempting to use hierarchical global identity. A mathematically "clean" solution. But this doesn't handle intersections or temporary variables very well, and makes machine addressing problematic. And I want my language to handle people (e.g. John Smith).
So I gave up on a "theoretical" solution, and reverted to a "best practices" solution - I looked at every good object identity technique I could find in all the different languages I knew about, and mixed them together.
What I came up with (so far) is the concept of "names". A storage space can have a name. An object can have a name. Names can be JohnSmith or XJW3974TZ. Plus there are value identified objects. I haven't figured out intersections yet. But I might not really need them. -- Stan Silver
Been a while since I looked in here. Good to see people are thinking along the same lines as me. On global identity. After much thought on how one might approach identity I also started with the idea of a GUID, but that doesn't allow an identity to be augmented or chained. Furthermore an object may be identified differently by multiple systems etc. So I thought that an object's identity needs to be regarded separately from its content if we are to gain this sort of flexibility. An envelope-letter idiom is thus appropriate which may be expressed as a decorator in the object-world or whatever.
One problem is that the object's identity can become rather unwieldy, and tends to pollute local object-spaces. So instead I think it is better to have a separation of global and local object identities. From here I prefer the word 'shared' to 'global' as it is more correct. The scheme I envisage involves object domains. An object's original domain is its source domain. This domain represents the master-copy of the object. Write-transactions will need to be applied in this domain if ACID properties are desired.
Systems which share objects will need to understand domains, presumably as a part of an object's functional interface e.g. shared objects may inherit a 'sharedObject' interface. I imagine such systems will have a component which handles shared objects. The shared-object interface will interact with this component, passing an implicit parameter of the local object-identity. This general scheme generates an explicit micro-architectural element to handle the particular architectural problem of object-identity boundaries. It maps easily to existing identity schemes e.g. registries. -- Richard Henderson.
It seems to me that Object Identity gets complicated mostly because we allow objects to change. For example, uniqueness is only required when an object needs to change. An Immutable Object can be freely replicated because it can't change. I think that when we assert that identity is wrapped up in the Primary Key (for the relational folks) or Memory Address (for the Object Oriented folks), we confuse the name with the entity being named. This is common and pernicious, and still wrong.
So the approach I've come to prefer is something more like Envy Developer, Lotus Notes, and Kala: the notion of "Write Once" objects (almost like Functional Programming).
The premise is that any object that has a name is also immutable and persistent. The environment promises that that object will always be available with that name in its current state. When I change it, I make a copy that contains the new values and that has a new name. There is probably a container of some sort (its own different object) that denotes the relationship between the new and old objects. In Envy, think versions and editions. Instead of attempting to manage the change process through transactions, the system sends sets of changed objects together - a Unit Of Work. In a distributed network, this can be likened to a wave-front propagating through the network. The system can maintain as many copies of any object as it likes, with as many aliases as are convenient. The only uniqueness constraint is that within any transitive closure, each alias resolves to one and only one object. In this kind of environment, "Object Identity" is preserved, but softly. A Big Idea for me was to realize that ANY externalized name for an object in such an environment is specific to a namespace - whether a Primary Key in a relational database, an index in an object table, and address in memory, whatever. The Object Identity subsumes all the names for the object, and an important role of the environment is to ensure that that subsumption follows known and predictable rules.
Read-only objects do not require any more than logical identity i.e. they are 'value objects'. They are in the minority, however. Sharing of mutable data requires a pseudo-physical identity beyond any one value. Metaphorically, the labelled box which contains the value. Versioning, and its general case of time-series data is a useful approach to soft transactionality via optimistic concurrency. See Time Series In Sql for a rather short discussion.
Metaphorically the labelled box which contains the value.
Being the purist that I am, in my view this "labelled box" itself is a versioned, read-only object. I posit that an object is only mutable within a local computation, during which that object has a local ID. At the conclusion of that computation, any new values are hardened and distributed if they are to be available elsewhere.
Meanwhile, Brian Cantwell-Smith wrote a paper years ago about reflection in lisp in which he identifies three domains: Notation, Representation, and Meaning. An "object" is by construction an entity that exists in each of the three, and also transcends all three. For example, there are notations like "4" "Four" and "IV". Those notations can translated into a representation of "four" in some system (which presumably manipulates it). Associated with that symbol (if the system is reasonable) is the platonic idea of "fourness". Each of those three entities is itself an object, and there is yet another that ties them all together. The object that represents the numeral "4" itself has a notation, a symbol that represents it, and a meaning.
I think it's a mistake to elevate any one aspect of this to primary importance while minimizing the others - and Cantwell-Smith's approach (which I attempt to reflect in the environments I build) has been for me a very helpful way to conceptualize Object Identity.
Okay. How would you handle an account balance, say? Multiple clients have access. Updates are conditional on the existing value e.g. a negative value is not allowed.
I don't think I understand your question. I receive a versioned read-only "Account" object upon request from some service. That account object has a behavior whose selector is "balance". What's the problem?
The problem is one of sharing. An account balance may be updated by a number of independent processes. Those processes need to synchronize somewhere to avoid an invalid update. An object's identity needs a concrete singularity when dealing with these issues.
I don't know what you mean by "a concrete singularity", and I don't understand your last sentence at all.
I suspect we are in cognitive dissonance here. A pure value, without identity, is equivalent to a value-object. Any copy of it will do, with the proviso that it cannot change, or else it becomes something else. A specific identity for something is established when it is not the content of something which identifies it, but the space it occupies. It has a physical presence which gives it continuity, even if it changes in some respect.
In computing the classic example is an 'account balance'. An account balance's value changes as we deposit cash into the account, and remove cash from it. Therefore the identity of the account balance is stable despite it changing value. When the balance may be modified by a number of unsynchronized processes, there needs to be some sort of access control. This relies on what we have logically established as the identity of the object, which needs to have some kind of pseudo-physical uniqueness which I'm describing as 'concrete singularity'. Is that nay clearer?
Purely 'functional' languages seem to have a bit of a problem dealing with this aspect of computation. I suspect the 'monad' which gets mentioned occasionally fills the need, but I'm not sure.
Yes, that helps, and yes, I think the "monad" (I learned the term from Kala) fills the need. I think of this as the container, or envelope. The difference (in comparison to "regular" code) is that the container also has an identity, so that the reference itself is reified (and therefore has an version). An "Account" object has a version and provides the ability to answer a "balance" (I prefer to hide the specifics of whether the balance is itself an object or is instead a possibly cached computation). An Account object may also provide for some or all of its clients behavior that allows its state to change, such as "deposit" or "withdraw". In such a case, the Account is "softened" - a copy of the current version is created, but such that its contents can be modified. This softened object is not available to any other computations. When the appropriate change in value is finished, the soft Account object is "hardened" and assigned a version. Each "hard" object has its own version identifier, different from whatever version identifier the Account object had when the computation began. A different behavior of the Account (or its container) manages which of the hardened versions of the Account is the "current" account. Reconciliation (and synchronization) of multiple parallel changes is handled by this second behavior. Please note that each of those changes is retained as its own hardened version; the reconciler might need to walk those and construct its own merged version of the changes. I hope this helps.
Okay, so we all end up at the same place, in the end :). Thanks for the explanation.
The OO way where object identity is reference identity is the right way to go. It is the general case which relational databases are hacked to reproduce by the construction of artificial object primary keys.
Take for example a bunch of students. A "student ID" is not part of being a student. It's better to simply refer to each student by reference and not have any student IDs. In the special case where these references are distributed outside of computer systems and into human minds, an Auto Generated Key provides a solution. But see Auto Keys Versus Domain Keys.
As someone who works in a college, I'd like to interject that quite the contrary, having a student ID is very much a part of being a student. It's written on your course notes, on your invoice, on your student ID card, and you have to produce it when you sit exams. Since the student records computer system is only part of the whole college experience, it wouldn't make any sense for that computer system to not have a "student ID" field - it wouldn't be true to the reality that it's modelling. Score one for the relational team and zero for the object oriented team. Sorry.
Occasionally, it's necessary to refer to objects by value. To make two or more distinct object be "identical" due to their value. Generally, this is only the case with distributed systems. And in that context, you have several different ways that you can go.
If you genuinely have to identify objects by value, putting an artificial object ID number in the object is still the wrong way to go. The OO way to solve the problem is to keep the object ID numbers outside of the objects. This is superior because it lets you manage the namespace of these objects completely independently of the objects. So you don't need to create separate "student ID", "teacher ID" and "institution ID" when a single OID independent of the objects themselves will do.
In most OO systems the Object Identity is coupled with some discoverable attribute of the object - most commonly it's address in memory. Few if any object systems keep any sort of external table associating Object Identitys with the objects to which they point.
Any halfway functional OO distribution framework has to keep such a table.
"discoverable attribute of the object" can't be very meaningful when it encompasses both pass by value and pass by reference.
Re: "Take for example a bunch of students. A "student ID" is not part of being a student. It's better to simply refer to each student by reference and not have any student IDs."
But how do you refer to a student by reference? Unlike objects in memory which have unique physical addresses (assuming the blasted things don't move around on you), students -- and humans in general - don't have a natural Primary Key -- at least one that's convenient for data processing. (One could, I suppose, use a DNA sequence or a fingerprint analysis, but that would be cumbersome, difficult, and expensive). So you got to make one up.
Note that you're no longer talking about objects in a computer system. You're now talking about distributed objects, with the added proviso that they be human-readable.
From an administration, processing, and debugging standpoint, it is really nice to have some kind of unique, readable key that does not change over time. Further, without unique ID's, fraud and embarrassing mixups are more likely. Would you like somebody else with a similar name to get your report card? Social Security Numbers don't work with foreign students, by the way. Unique, stable identifiers just make office life simpler.
29/4/03 - Refactored to be more general (and shorter). Re-emphasized that some minimal physical identity is always necessary in a shared environment. Considering summarizing responses. Hmmmm.
See original on c2.com