Project

General

Profile

staging table column renaming

part of switching to new-style import

stage I: source-specific

Perform stage I renaming of staging table columns to VegCore

this not only enables the stage II validations, but also allows generating stage I derived columns without the entire XPath infrastructure

93% complete: 38 of 41 datasources switched (remaining: NVS, SALVIAS, TEAM)

  1. change all metadata values to inlined constant columns
  2. use same mappings/VegCore-VegBIEN.csv for all datasources to ensure that all potential derived columns are included in the mapping to VegBIEN, not just the datasource's input columns
  3. add new-style import to datasources without any explicit1 collisions:
    NCBI, TNRS, herbaria, FIA, GBIF, HIBG, IRMNG, IUCN, JBM, MO, MT, QFA, TEAM, TRT, TRTE, UBC, VASCAN, WIN, XAL, bien2_traits, bien_web, newWorld, test_taxonomic_names (22 of 23)
  4. translate single-column filters2 to postprocessing derived columns (starting with ones that don't also have multi-column filter collisions), using the steps at Adding new-style import to a datasource:
    geoscrub, ACAD, ARIZ, BRIT, CVS, NCU, NVS, NY, REMIB, SALVIAS, U, UNCC, VegBank (11 of 13)
  5. remove multi-column filter collisions3 (involves adding stage I derived columns):
    BRIT, CTFS, CVS, HVAA, NCU, NVS, NY, SALVIAS, SpeciesLink, TEX, U, UNCC, VegBank (11 of 13)
  6. remove filter-less collisions (involves adding stage I derived columns): (applicable datasources will be found as new-style import is added to them)
    CTFS, CVS, SpeciesLink, HVAA, TEX (5 of >=5)
  7. translate one-to-many mappings to stage I derived columns: (applicable datasources will be found as new-style import is added to them)
    Madidi, CVS, CTFS, VegBank (4 of >=4)
  8. add new-style import to remaining datasources:
    geoscrub, ACAD, ARIZ, BRIT, CTFS, CVS, HVAA, Madidi, NCU, NVS, NY, REMIB, SALVIAS, SpeciesLink, TEX, U, UNCC, VegBank (16 of 18)
  9. reimport renamed datasources into VegBIEN:
    NCBI, TNRS, geoscrub, herbaria, ACAD, ARIZ, BRIT, CTFS, CVS, FIA, GBIF, HIBG, HVAA, IRMNG, IUCN, JBM, MO, MT, Madidi, NCU, NVS, NY, QFA, REMIB, SALVIAS, SpeciesLink, TEAM, TEX, TRT, TRTE, U, UBC, UNCC, VASCAN, VegBank, WIN, XAL, bien2_traits, bien_web, newWorld, test_taxonomic_names (38 of 41)
    (note that some datasources were new-style from the start)

1 i.e. collisions whose presence is indicated by a filter combining the colliding columns

2 these are different from the multi-column filters that cause collisions

3 found by searching all map.csv for /_(?!map|nullIf) and excluding matches in archived tables
.

FIA filtering

stage II: staging table augmenting

  1. move stage III (and IV?) validations to this stage
  2. add validations for new derived columns
  3. automate geoscrubbing as part of the import pipeline

stage III: normalization

  1. clean up mappings once stage III validations have been moved
  2. if switching to normalized VegCore, implement refactor-in-place
    1. start by generating most important tables from denormalized VegCore, using successive DISTINCT ONs:
      taxon_determination, taxon_occurrence, event, place, ...

stage IV: normalized database augmenting

  1. add fkey from VegBIEN taxonomic tables to TNRS results table
    • initially disabled, and then enabled after all VegBIEN taxonomic names are scrubbed
      in PostgreSQL, can do this by turning on inheritance from a table that has the fkey once the names are scrubbed.
      alternatively, could defer fkey until the end of the transaction.
      however, a rollback will cause the rollback of any changes made to other tables, such as the TNRS results table.
      this problem could be reduced by only inserting new names into the DB a few at a time, so the rollback only affects a few names.
    • what happens when TNRS is offline? can no new names be added to the DB?
  2. add fkey from VegBIEN location tables to geoscrubbing results table
  3. add database triggers to populate derived columns which are not added in stage II