Project

General

Profile

Actions

Task #561

open

make VegBIEN ID fields plain-text instead of numeric

Added by Aaron Marcuse-Kubitza almost 12 years ago. Updated almost 11 years ago.

Status:
New
Priority:
Normal
Start date:
02/14/2013
Due date:
% Done:

0%

Estimated time:
Activity type:

Description

Rationale

  • this makes it possible to append data from multiple sources without having pkey collisions

Forming the ID value

  • for datasource-specific data, use the datasource name + intrinsic identifiers such as a barcode or author plot code
    • avoid using the datasource's pkey if it's autogenerated, because this is random and not related to the data itself
    • use a standard path syntax:
  • for hierarchical names, create a path from the names
    • include only as many parent nodes as are needed to guarantee global uniqueness
      • e.g. for taxonomic names, the family is globally unique, so higher taxa do not need to be included
  • note that the ID field does not need to be the only way to uniquely identify a record, it just needs to be a way that is a single text value
  • the value should be a human-readable text string rather than the cryptographic hash of such a string, to make it visually obvious what the ID refers to

Performance considerations

  • to avoid affecting row lookup performance when using arbitrary-length strings as pkeys, a hash index should be added on the ID field
    • this is in addition to the usual btree index, which is still needed for merge joins and retrieving the data in sorted order
    • a hashed string is roughly equivalent to using just the hash value because the hash is usually cached
    • a hash value is roughly equivalent to an autogenerated integer, because it is a similar number of bytes long
Actions

Also available in: Atom PDF