Import steps

  • Underlined steps are remaining to do
  • The normalize step is the most complicated. We are trying to replace the XML normalization algorithm with a much simpler insert-ignore.

export .sql file from live DB connection (idempotent)

translate DB engines' SQL to PostgreSQL (idempotent)

standardize CSVs to PostgreSQL-compatible format (idempotent)

  • Excel dialect; no ragged rows

install staging tables (idempotent)

  • README sections: Datasource setup > Install the staging tables
  1. run each SQL script (idempotent)
  2. create tables from CSV headers (idempotent)
  3. COPY FROM each standardized CSV (idempotent)

map to VegCore (idempotent)

rename staging table columns according to the VegCore mapping (idempotent)

  • also needed for datasource-general postprocessing step

datasource-specific postprocessing (idempotent)

planned: datasource-general postprocessing (idempotent)

  1. populate ID fields
  2. populate analytical derived fields
    1. run TNRS and left join TNRS fields

planned: create occurrence_all view which joins the datasource's tables together (idempotent)

planned: create occurrence_VegCore view which includes just the VegCore columns in occurrence_all (idempotent)

planned: create public.occurrence_VegCore view which appends each datasource's occurrence_VegCore view (idempotent)

normalize (idempotent)

either of the following:

run VegBank XML normalization algorithm (a.k.a. column-based import/row-based import) (idempotent)

  • this is the very complicated step we are trying to replace with the underlined steps
  • it includes several files/functions which are very long, marked (HUGE)
  • takes a very long time

planned: try inserting each input table into each output table (idempotent)

  • ignore duplicates
  • may be able to use existing import mechanism, but it must ignore the mapping to VegBIEN because the columns have already been renamed
  • runtime is less time than XML normalization