set up datasource folder¶
from README.TXT > Datasource setup
- add folder for datasource in
inputs/$datasrc/
:make inputs/$datasrc/add "cp" -f inputs/.NCBI/{Makefile,run,table.run} inputs/$datasrc/ # add new-style import files mkdir inputs/$datasrc/_src/
- place extract in
inputs/$datasrc/_src/
map to VegCore¶
from README.TXT > Datasource setup
map metadata¶
- add subdir:
echo Source >>inputs/$datasrc/import_order.txt "cp" -f inputs/.NCBI/Source/{run,data.csv} inputs/$datasrc/Source/ # add new-style import files inputs/$datasrc/Source/run # create map spreadsheet
- fill out metadata in
inputs/$datasrc/Source/map.csv
map each table¶
- add subdir:
make inputs/$datasrc/$table/add "cp" -f inputs/.NCBI/nodes/run inputs/$datasrc/$table/ # add new-style import files echo $table >>inputs/$datasrc/import_order.txt
- extract flat files from the compressed extract, if applicable
- translate flat files to a supported format (CSV/TSV):
.xls
:(cd inputs/$datasrc/_src/; /Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to csv *.xls)
- place extracted flat file(s) for the table in the table subdir
- note that if the dataset consists only of flat files and all the flat files are used by a table subdir, the
_src/
subdir will end up empty after the flat files have been moved
- note that if the dataset consists only of flat files and all the flat files are used by a table subdir, the
- rename files so their names don't contain any spaces
- if the header is repeated in each segment, standardize it:
- check the headers:
inputs/$datasrc/$table/run check_headers
- if there is a "header mismatched" error, fix the header in the corresponding segment file
- repeat until no errors
- check the headers:
- install the staging table and create the map spreadsheet:
inputs/$datasrc/$table/run
- for plots datasources, prevent column collisions upon left-join:
ininputs/$datasrc/$table/map.csv
:- replace text
,*
with text,*$table--
where$table
should be replaced with the actual value of that var
- replace text
- map the columns in
inputs/$datasrc/$table/map.csv
to terms in the VegCore data dictionary - rename staging table columns to VegCore:
inputs/$datasrc/$table/run
- if any datasource-specific postprocessing is needed:
- add
postprocess.sql
:"cp" -f inputs/VegBank/plot/postprocess.sql inputs/$datasrc/$table/
- in
postprocess.sql
, modify the existing code to fit the datasource. remove any template code that doesn't apply. - run the postprocessing:
inputs/$datasrc/$table/run
- add
ensure all pre-import actions have been performed¶
inputs/$datasrc/run # runtime: 1 min ("1m0.511s")
→ see sample call graph for VegBank
left-join tables¶
not applicable to this datasource
check in new datasource¶
from README.TXT > Datasource setup
- commit & upload:
make inputs/$datasrc/add # place files under version control svn di inputs/$datasrc/*/test.xml.ref # make sure the test outputs are correct svn st # make sure all non-data files are under version control svn ci -m "added inputs/$datasrc/" make inputs/upload # check that changes are as expected make inputs/upload live=1
- if you ran the mapping steps on the local machine, sync to vegbiendev:
- log in to vegbiendev:
ssh -t vegbiendev.nceas.ucsb.edu exec sudo -u aaronmk -i
- download:
svn up make inputs/download # check that changes are as expected make inputs/download live=1
- set vars as above
- perform pre-import actions:
inputs/$datasrc/run
- log in to vegbiendev:
import to VegBIEN¶
- log in to vegbiendev:
ssh -t vegbiendev.nceas.ucsb.edu exec sudo -u aaronmk -i
- set vars as above
- run column-based import: (from README.TXT > Single datasource refresh)
make inputs/$datasrc/reimport_scrub by_col=1 & tail -150 inputs/$datasrc/*/logs/public.log.sql # view progress
- see README.TXT > Single datasource refresh > steps after
reimport_scrub