Project

General

Profile

New-style import

*stage I: source-specific*

  1. source-specific loading
    • source-specific derived columns
  2. map to denormalized VegCore schema
    1. rename the staging table columns to VegCore
    2. denormalize where needed
    3. materialize the denormalized view of the staging tables (with a standard table name)
    4. add NULL columns for all VegCore terms that are not present in the datasource
    5. remove source-specific columns
    6. re-order columns to match the order in the denormalized VegCore schema
  3. stage I (source-specific) validations
    • FIA filtering (see esp. 4b, which has the SQL expression combining the conditions)

*stage II: staging table augmenting*

  1. stage II validations: addition of source-general derived columns to the staging tables
    • can be populated with a DB trigger during the COPY FROM import (which fortunately does run the INSERT triggers, unlike ALTER COLUMN TYPE USING
    • addition of DB triggers occurs before the table is loaded
    • a constraint may be added7 to verify that each derived column was populated properly.
      note that this requires storing the formula twice, but the 2nd copy could be autogenerated from the 1st.
    • when a new derived column is added, the function that adds the trigger can also check if the table is already populated, and if so, run an UPDATE that gets the existing rows up to date.
      alternatively, all triggers except the new one could be disabled and an UPDATE performed on all rows to run the new trigger (followed by a VACUUM to remove the dead rows, which ALTER COLUMN avoids).

7 a constraint is more reliable than a trigger that always runs, because a trigger can be temporarily disabled (or have never been run for some rows), so without the constraint, one cannot be sure that the field was correctly populated for all rows. the caveat is that the formula will be run twice (and in the codebase twice), once to populate the column (a BEFORE trigger) and once to verify the constraint (a CHECK constraint, or an AFTER trigger that is declared a constraint trigger).
.

*stage III: normalization*

  1. normalization only, no validations in this stage
    • reduces the complexity of the normalization XPaths, which currently store both mappings and derived column formulas
  2. once complete, add fkeys from staging tables to VegBIEN output tables
    • fkeys should be marked ON UPDATE RESTRICT ON DELETE RESTRICT so that attempting to delete rows in the import does not delete rows in the staging table.
      this also ensures the current import can't be deleted until a new import has been created and the fkeys transferred to the new import.
    • to delete a datasource's rows from the VegBIEN output tables, remove the RESTRICT fkeys and then garbage collect the corresponding output tables1

1 try-DELETE every row, and ignore errors if a RESTRICT fkey prevents the deletion.
alternatively, a faster method is to subtract each referencing table's rows from the output table's rows (using an anti-join) and delete the remainder.
the referencing tables can be determined by doing a test-TRUNCATE CASCADE of the output table and seeing which tables the TRUNCATE cascades to.
.

refactor-in-place method

*stage IV: normalized database augmenting*

  1. database triggers to populate source-general derived columns during the normalization INSERTs
    • this validation stage is required for tables that are dynamically updated (e.g. TNRS)
  2. asynchronous validations
    1. add fkeys from core VegBIEN tables to validation results tables2
      • e.g. VegBIEN taxonomic tables->TNRS; VegBIEN place tables->geoscrubbing results

2 this is from a to-do item in the Stage IV validations slide notes


See also