Project

General

Profile

Refactor-in-place (RFIP)

To import into the VegCore schema instead of VegBIEN, we would actually be better off with a faster approach which does not require column-based import

  1. DISTINCT ON (uniquify) the rows in the flattened VegCore table to get each smaller table ("refactor-in-place"), since the columns have already been renamed and just the normalized structure needs to be added
    • the effect is similar to continents separating: "Pangaea" (full_occurrence) splits apart into several "continents" (each color-coded category), which then split off "islands" (individual tables)
  2. link source-specific tables to VegCore template tables using PostgreSQL inheritance (a form of table partitioning)
    • this ensures that all VegCore fields are present in the source-specific table (some will be NULL)
    • querying the VegCore template table will then also query the source-specific tables that have been linked to it
    • source-specific columns shared by multiple datasources can be queried all at once by creating a custom subclass of a VegCore table which includes the additional columns, and linking the datasource tables to it
  • With the refactor-in-place approach, the time to switch to normalized VegCore will be proportional to the number of normalized tables2 rather than the number of columns to map3, so it makes sense to normalize the flat VegCore incrementally with the most important tables first (place, event, taxon_occurrence, taxon_determination, ...).
  • This approach is only possible with globally unique4 primary keys, because the refactored tables must be appendable. If you instead use serial integer pkeys, there will be inter-datasource collisions unless the data is inserted into a common output table5 to generate the keys (which column-based import does).
  • One major advantage is that each column does not need to be explicitly mapped to the destination schema. Source columns with no counterpart in VegCore will just be left as-is, and remain present in the final table8.

It would actually be interesting (and educational) to watch the refactoring happen in real-time, as the columns are renamed and the normalized tables appear. One could create a time-lapse screen-capture video of a datasource's ERD changing, and speed it up to the pace of a YouTube video (which could be used to advertise the database and import scripts).

2 each normalized table represents one DISTINCT ON to implement

3 column-based import uses one or more mappings for each column

4 they can be autogenerated within the datasource, but must contain a globally-unique prefix for each datasource

5 or at least use a shared serial sequence to generate the keys

8 this is especially helpful for ensuring that all the VegBank tables and columns are present in the VegCore/VegBIEN DB