Actions
Task #561
openmake VegBIEN ID fields plain-text instead of numeric
Start date:
02/14/2013
Due date:
% Done:
0%
Estimated time:
Activity type:
Description
Rationale¶
- this makes it possible to append data from multiple sources without having pkey collisions
Forming the ID value¶
- for datasource-specific data, use the datasource name + intrinsic identifiers such as a barcode or author plot code
- for hierarchical names, create a path from the names
- include only as many parent nodes as are needed to guarantee global uniqueness
- e.g. for taxonomic names, the family is globally unique, so higher taxa do not need to be included
- include only as many parent nodes as are needed to guarantee global uniqueness
- note that the ID field does not need to be the only way to uniquely identify a record, it just needs to be a way that is a single text value
- the value should be a human-readable text string rather than the cryptographic hash of such a string, to make it visually obvious what the ID refers to
Performance considerations¶
- to avoid affecting row lookup performance when using arbitrary-length strings as pkeys, a hash index should be added on the ID field
- this is in addition to the usual btree index, which is still needed for merge joins and retrieving the data in sorted order
- a hashed string is roughly equivalent to using just the hash value because the hash is usually cached
- a hash value is roughly equivalent to an autogenerated integer, because it is a similar number of bytes long
Updated by Aaron Marcuse-Kubitza almost 11 years ago
- Subject changed from make VegBIEN ID fields globally unique instead of autogenerated to make VegBIEN ID fields plain-text instead of numeric
Actions