- Table of contents
- Import steps
- export .sql file from live DB connection
- translate DB engines' SQL to PostgreSQL
- standardize CSVs to PostgreSQL-compatible format
- install staging tables
- map to VegCore
- rename staging table columns according to the VegCore mapping
- datasource-specific postprocessing
- planned: datasource-general postprocessing
- planned: create occurrence_all view which joins the datasource's tables together
- planned: create occurrence_VegCore view which includes just the VegCore columns in occurrence_all
- planned: create public.occurrence_VegCore view which appends each datasource's occurrence_VegCore view
- normalize
Import steps¶
- General import steps
- Specific import steps
- Adding ACAD--a Darwin Core datasource
- Adding Cyrille traits--a traits datasource
- Adding Madidi--a flat-file plots datasource
- Mapping a new table in VegBank--a SQL plots datasource
- Refreshing ACAD--a Darwin Core datasource
- Refreshing CVS--an MS Access plots datasource
- Refreshing VegBank--a SQL plots datasource
- Underlined steps are remaining to do
- The normalize step is the most complicated. We are trying to replace the XML normalization algorithm with a much simpler insert-ignore.
export .sql file from live DB connection ¶
- README sections: Datasource setup > MySQL inputs > .sql exports
- examples:
inputs/SALVIAS/_MySQL/salvias_plots.{schema,data}.sql.make
- templates:
lib/MySQL.{schema,data}.sql.make
- scripts:
bin/my2pg_export
;bin/mysql_bien
translate DB engines' SQL to PostgreSQL ¶
- README sections: Datasource setup > Install the staging tables
- make targets:
inputs/input.Makefile
>`%.sql: _MySQL/%.sql`
,`%.sql: %.sql.make`
- scripts:
bin/my2pg
;bin/my2pg.data
standardize CSVs to PostgreSQL-compatible format ¶
- Excel dialect; no ragged rows
- README sections: Datasource setup > Install the staging tables
- make targets:
inputs/input.Makefile
>`%/install: _always`
containing$(import_install_)
- scripts:
bin/csv2db
- utils:
lib/csvs.py
>reader_and_header()
,InputRewriter
install staging tables ¶
- README sections: Datasource setup > Install the staging tables
- run each SQL script (idempotent)
- make targets:
inputs/input.Makefile
>sql/install
- scripts:
inputs/*/*.sql
- make targets:
- create tables from CSV headers (idempotent)
- make targets:
inputs/input.Makefile
>`%/install: _always`
containing$(import_install_)
- scripts:
bin/csv2db
- utils:
lib/sql_io.py
>import_csv()
- make targets:
- COPY FROM each standardized CSV (idempotent)
- utils:
lib/sql_io.py
>append_csv()
,cleanup_table()
- utils:
map to VegCore ¶
- README sections: Datasource setup > Map each table's columns; Maintenance > VegCore data dictionary
- make targets:
inputs/input.Makefile
>%/.map.csv.last_cleanup
;mappings/Makefile
> targets containingVegCore.htm
- scripts:
bin/canon
;bin/translate_ci
;bin/redmine_synonyms
- mappings:
inputs/*/*/map.csv
;mappings/VegCore.thesaurus.csv
;mappings/VegCore.vocab.csv
rename staging table columns according to the VegCore mapping ¶
- also needed for datasource-general postprocessing step
- scripts:
lib/runscripts/table.run
>map_table()
- mappings:
inputs/*/*/map.csv
datasource-specific postprocessing ¶
- make targets:
inputs/input.Makefile
>%.sql/run
,postprocess
,%/install
($(selfMake) $*/postprocess.sql/run
) - scripts:
inputs/*/*/postprocess.sql
;inputs/*/*/import
;inputs/*/import
planned: datasource-general postprocessing ¶
- needed for adding derived columns
- scripts:
schemas/VegCore/mk_derived
;lib/import.sh
>mk_derived()
;inputs/*/*/import
>mk_derived
.
- populate ID fields
- populate analytical derived fields
- run TNRS and left join TNRS fields
planned: create occurrence_all view which joins the datasource's tables together ¶
- scripts:
inputs/*/occurrence_all/import
. - the column names must all be globally unique for this to work
planned: create occurrence_VegCore view which includes just the VegCore columns in occurrence_all ¶
planned: create public.occurrence_VegCore view which appends each datasource's occurrence_VegCore view ¶
normalize ¶
either of the following:
run VegBank XML normalization algorithm (a.k.a. column-based import/row-based import) ¶
- this is the very complicated step we are trying to replace with the underlined steps
- it includes several files/functions which are very long, marked (HUGE)
- takes a very long time
- README sections: Single datasource import, Full database import (HUGE)
- scripts:
bin/import_all
;bin/map
(HUGE) - make targets:
inputs/input.Makefile
>%/import
using$(import)
- utils:
lib/db_xml.py
>put_table()
,put()
(HUGE) ;lib/sql_io.py
>put_table()
(HUGE) - mappings:
mappings/VegCore-VegBIEN.csv
(HUGE) ;inputs/*/*/VegBIEN.csv
;inputs/*/*/test.xml.ref
planned: try inserting each input table into each output table ¶
- ignore duplicates
- may be able to use existing import mechanism, but it must ignore the mapping to VegBIEN because the columns have already been renamed
- runtime is less time than XML normalization