Task #561: make VegBIEN ID fields plain-text instead of numeric - BIEN 3 - NCEAS Projects

Actions

Task #561

open

Status:

New

Priority:

Normal

Assignee:

Start date:

02/14/2013

Due date:

% Done:

Estimated time:

Activity type:

Description

this makes it possible to append data from multiple sources without having pkey collisions

for datasource-specific data, use the datasource name + intrinsic identifiers such as a barcode or author plot code
- avoid using the datasource's pkey if it's autogenerated, because this is random and not related to the data itself
- use a standard path syntax:
  - LDAP dn
  - URN
  - XPath
for hierarchical names, create a path from the names
- include only as many parent nodes as are needed to guarantee global uniqueness
  - e.g. for taxonomic names, the family is globally unique, so higher taxa do not need to be included
note that the ID field does not need to be the only way to uniquely identify a record, it just needs to be a way that is a single text value
the value should be a human-readable text string rather than the cryptographic hash of such a string, to make it visually obvious what the ID refers to

to avoid affecting row lookup performance when using arbitrary-length strings as pkeys, a hash index should be added on the ID field
- this is in addition to the usual btree index, which is still needed for merge joins and retrieving the data in sorted order
- a hashed string is roughly equivalent to using just the hash value because the hash is usually cached
- a hash value is roughly equivalent to an autogenerated integer, because it is a similar number of bytes long

Actions

Also available in: Atom PDF