Project

General

Profile

Task #561

Updated by Aaron Marcuse-Kubitza about 11 years ago

h3. Rationale 

 * this makes it possible to append data from multiple sources without having pkey collisions 

 h3. Forming the ID value 

 * for datasource-specific data, use the datasource name + _intrinsic_ identifiers such as a barcode or author plot code 
 ** _avoid_ using the datasource's pkey if it's autogenerated, because this is random and not related to the data itself 
 ** use a standard path syntax: 
 *** "LDAP dn":http://www.ietf.org/rfc/rfc2253.txt 
 *** "URN":http://www.ietf.org/rfc/rfc3406.txt 
 *** "XPath":http://www.w3schools.com/xpath/xpath_syntax.asp 
 * for hierarchical names, create a path from the names 
 ** include only as many parent nodes as are needed to guarantee global uniqueness 
 *** e.g. for taxonomic names, the family is globally unique, so higher taxa do not need to be included 
 * note that the ID field does not need to be the _only_ way to uniquely identify a record, it just needs to be a way that is a _single text value_ 
 * the value should be a human-readable text string rather than the cryptographic hash of such a string, to make it visually obvious what the ID refers to 

 h3. Performance considerations 

 * to avoid affecting row lookup performance when using arbitrary-length strings as pkeys, a hash index should be added on the ID field 
 ** this is in addition to the usual btree index, which is still needed for merge joins and retrieving the data in sorted order 
 ** a hashed string is roughly equivalent to using just the hash value because the hash is usually cached 
 ** a hash value is roughly equivalent to an autogenerated integer, because it is a similar number of bytes long

Back