Task #561
Updated by Aaron Marcuse-Kubitza almost 12 years ago
h3. Rationale * this makes it possible to append data from multiple sources without having pkey collisions h3. Forming the ID value * for datasource-specific data, use the datasource name + _intrinsic_ identifiers such as a barcode or author plot code ** _avoid_ using the datasource's pkey if it's autogenerated, because this is random and not related to the data itself * for hierarchical names, create a path from the names ** include only as many parent nodes as are needed to guarantee global uniqueness *** e.g. for taxonomic names, the family is globally unique, so higher taxa do not need to be included * note that the ID field does not need to be the _only_ way to uniquely identify a record, it just needs to be a way that is a _single text value_ * the value should be a human-readable text string rather than the cryptographic hash of such a string, to make it visually obvious what the ID refers to h3. Performance considerations * to avoid affecting row lookup performance when using arbitrary-length strings as pkeys, a hash index should be added on the ID field ** this is in addition makes it possible to the usual btree index, which is still needed for merge joins and retrieving the append data in sorted order ** a hashed string is roughly equivalent to using just the hash value because the hash is usually cached ** a hash value is roughly equivalent to an autogenerated integer, because it is a similar number of bytes long from multiple sources without having pkey collisions