Project

General

Profile

stage II vs. stage IV validations

  • prediction: all validations can be implemented as either stage II, III, or IV validations, but with varying levels of difficulty
  • prediction: stage III will always be the most difficult stage to implement a validation at, due to the inability to reuse previously-calculated values
+ −
− +
stage II stage IV
schema dependence + pro: schema-independent − con:
• needs to be programmed separately for each schema
• requires an output schema with the derived column formulas in it
create time + pro: derived columns will be created without needing to go through all of column-based import − con: getting the derived columns requires column-based import (or an equivalent substitute).
because column-based import takes a long time, this adds a lot of time to the current import process before the derived columns are available.
however, this problem would be mitigated by switching to natural pkeys, which do not require column-based import.
availability + pro: already in the staging tables − con: not until the normalized DB
caching + pro: generated only once for each datasource, then left in place − con: generated for all datasources each time column-based import is run
input columns + pro: formulas can use any columns in the staging table. − con: it is tricky for formulas to use columns from a cluster of normalized tables2, instead of just columns inserted directly into the output table.
however, maybe the subset used by the derived column should always be included in the output table anyway, to provide the values the derived column was generated from?
dependencies − con: requires stage I renaming of staging table columns to VegCore.
this in turn requires disambiguating input columns that are mapped to the same VegCore column1 (collisions).
+ pro: does not require stage I renaming
skills to maintain − con: both psql and bash, as well as understanding how to use our idempotent wrapper functions (util schema > functions with "idempotent" in the Comment column) + pro: just PL/pgSQL triggers
level of automation − con: a script must be run to populate them + pro: the schema populates them automatically, behind the scenes
supported import methods − con: only works with new-style import + pro: works with any import method to the schema

1 e.g. SpeciesLink map.csv > sort by VegCore column > rows with /_alt/# in Filter column

2 e.g. in normalized VegCore, taxon_name.unique_name cannot be populated from taxon_path.family/genus/specific_epithet until taxon_path has been inserted (after taxon_name due to the fkey). the population would occur in a taxon_path trigger that cross-updates taxon_name to the concatenation of the ranks if it is still NULL.
in general, a cross-update can overwrite a parent table's field or set-default it without needing a flag field, but inserting a value into the middle of a COALESCE() list requires the parent table to maintain an additional flag to indicate whether the derived field was populated from values with higher or lower precedence than the child table's values (i.e. inserting into the middle requires cooperation from the parent table).