New-style import¶
- summarized in *bien3_architecture.pptx*
- Table of contents
- New-style import
*stage I: source-specific*¶
- source-specific loading
- source-specific derived columns
- map to denormalized VegCore schema
- rename the staging table columns to VegCore
- denormalize where needed
- materialize the denormalized view of the staging tables (with a standard table name)
- add
NULL
columns for all VegCore terms that are not present in the datasource - remove source-specific columns
- re-order columns to match the order in the denormalized VegCore schema
- stage I (source-specific) validations
- FIA filtering (see esp. 4b, which has the SQL expression combining the conditions)
*stage II: staging table augmenting*¶
- stage II validations: addition of source-general derived columns to the staging tables
- can be populated with a DB trigger during the
COPY FROM
import (which fortunately does run theINSERT
triggers, unlike ALTER COLUMN TYPE USING - addition of DB triggers occurs before the table is loaded
- a constraint may be added7 to verify that each derived column was populated properly.
note that this requires storing the formula twice, but the 2nd copy could be autogenerated from the 1st. - when a new derived column is added, the function that adds the trigger can also check if the table is already populated, and if so, run an
UPDATE
that gets the existing rows up to date.
alternatively, all triggers except the new one could be disabled and anUPDATE
performed on all rows to run the new trigger (followed by aVACUUM
to remove the dead rows, whichALTER COLUMN
avoids).
- can be populated with a DB trigger during the
7 a constraint is more reliable than a trigger that always runs, because a trigger can be temporarily disabled (or have never been run for some rows), so without the constraint, one cannot be sure that the field was correctly populated for all rows. the caveat is that the formula will be run twice (and in the codebase twice), once to populate the column (a BEFORE
trigger) and once to verify the constraint (a CHECK
constraint, or an AFTER
trigger that is declared a constraint trigger).
.
*stage III: normalization*¶
- normalization only, no validations in this stage
- reduces the complexity of the normalization XPaths, which currently store both mappings and derived column formulas
- once complete, add fkeys from staging tables to VegBIEN output tables
- fkeys should be marked
ON UPDATE RESTRICT ON DELETE RESTRICT
so that attempting to delete rows in the import does not delete rows in the staging table.
this also ensures the current import can't be deleted until a new import has been created and the fkeys transferred to the new import. - to delete a datasource's rows from the VegBIEN output tables, remove the
RESTRICT
fkeys and then garbage collect the corresponding output tables1
- fkeys should be marked
1 try-DELETE
every row, and ignore errors if a RESTRICT
fkey prevents the deletion.
alternatively, a faster method is to subtract each referencing table's rows from the output table's rows (using an anti-join) and delete the remainder.
the referencing tables can be determined by doing a test-TRUNCATE CASCADE
of the output table and seeing which tables the TRUNCATE
cascades to.
.
refactor-in-place method¶
*stage IV: normalized database augmenting*¶
- database triggers to populate source-general derived columns during the normalization
INSERTs
- this validation stage is required for tables that are dynamically updated (e.g. TNRS)
- asynchronous validations
- add fkeys from core VegBIEN tables to validation results tables2
- e.g. VegBIEN taxonomic tables->TNRS; VegBIEN place tables->geoscrubbing results
- add fkeys from core VegBIEN tables to validation results tables2
2 this is from a to-do item in the Stage IV validations slide notes
See also¶