Task #561
Updated by Aaron Marcuse-Kubitza almost 12 years ago
h3. Rationale
* this makes it possible to append data from multiple sources without having pkey collisions
h3. Forming the ID value
* for datasource-specific data, use the datasource name + _intrinsic_ identifiers such as a barcode or author plot code
** _avoid_ using the datasource's pkey if it's autogenerated, because this is random and not related to the data itself
** use a standard path syntax:
*** "LDAP dn":http://www.ietf.org/rfc/rfc2253.txt
*** "URN":http://www.ietf.org/rfc/rfc3406.txt
*** "XPath":http://www.w3schools.com/xpath/xpath_syntax.asp
* for hierarchical names, create a path from the names
** include only as many parent nodes as are needed to guarantee global uniqueness
*** e.g. for taxonomic names, the family is globally unique, so higher taxa do not need to be included
* note that the ID field does not need to be the _only_ way to uniquely identify a record, it just needs to be a way that is a _single text value_
* the value should be a human-readable text string rather than the cryptographic hash of such a string, to make it visually obvious what the ID refers to
h3. Performance considerations
* to avoid affecting row lookup performance when using arbitrary-length strings as pkeys, a hash index should be added on the ID field
** this is in addition to the usual btree index, which is still needed for merge joins and retrieving the data in sorted order
** a hashed string is roughly equivalent to using just the hash value because the hash is usually cached
** a hash value is roughly equivalent to an autogenerated integer, because it is a similar number of bytes long