set up datasource folder¶
from README.TXT > Datasource setup
- add folder for datasource in
inputs/$datasrc/
:make inputs/$datasrc/add "cp" -f inputs/.NCBI/{Makefile,run,table.run} inputs/$datasrc/ # add new-style import files mkdir inputs/$datasrc/_src/
- place extract in
inputs/$datasrc/_src/
map to VegCore¶
from README.TXT > Datasource setup
map metadata¶
- add subdir:
echo Source >>inputs/$datasrc/import_order.txt "cp" -f inputs/.NCBI/Source/{run,data.csv} inputs/$datasrc/Source/ # add new-style import files inputs/$datasrc/Source/run # create map spreadsheet
- fill out metadata in
inputs/$datasrc/Source/map.csv
map each table¶
- add subdir:
make inputs/$datasrc/$table/add "cp" -f inputs/.NCBI/nodes/run inputs/$datasrc/$table/ # add new-style import files echo $table >>inputs/$datasrc/import_order.txt
- extract flat files from the compressed extract, if applicable
- translate flat files to a supported format (CSV/TSV):
.xls
:(cd inputs/$datasrc/_src/; /Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to csv *.xls)
- place extracted flat file(s) for the table in the table subdir
- note that if the dataset consists only of flat files and all the flat files are used by a table subdir, the
_src/
subdir will end up empty after the flat files have been moved
- note that if the dataset consists only of flat files and all the flat files are used by a table subdir, the
- rename files so their names don't contain any spaces
- if the header is repeated in each segment, standardize it:
- check the headers:
inputs/$datasrc/$table/run check_headers
- if there is a "header mismatched" error, fix the header in the corresponding segment file
- repeat until no errors
- check the headers:
- install the staging table and create the map spreadsheet:
inputs/$datasrc/$table/run
- for plots datasources, prevent column collisions upon left-join:
ininputs/$datasrc/$table/map.csv
:- replace text
,*
with text,*$table--
where$table
should be replaced with the actual value of that var
- replace text
- map the columns in
inputs/$datasrc/$table/map.csv
to terms in the VegCore data dictionary - rename staging table columns to VegCore:
inputs/$datasrc/$table/run
- if any datasource-specific postprocessing is needed:
- add
postprocess.sql
:"cp" -f inputs/VegBank/plot/postprocess.sql inputs/$datasrc/$table/
- in
postprocess.sql
, modify the existing code to fit the datasource. remove any template code that doesn't apply. - run the postprocessing:
inputs/$datasrc/$table/run
- add
ensure all pre-import actions have been performed¶
inputs/$datasrc/run # runtime: 1 min ("1m0.511s")