- Table of contents
- Adding ACAD--a Darwin Core datasource
Adding ACAD--a Darwin Core datasource¶
what is needed from the user¶
- extract (so we can go back to the raw data if needed)
- extracted flat file(s) that should be imported
- mappings to VegCore for any non-DwC columns
- code to create any source-specific derived columns
steps¶
underlined: user input needed (other steps can be automated)
note that this is not necessarily the process that was used to add ACAD, but the one that would be used to add it again
connect to vegbiendev¶
ssh -t vegbiendev.nceas.ucsb.edu exec sudo -u aaronmk -i
alternatively, use a blank VM or the local machine
obtain extract¶
set vars¶
- start a subshell so the vars only affect these commands:
$0
- set vars:
datasrc=ACAD table=Specimen
set up datasource folder¶
from README.TXT > Datasource setup
- add folder for datasource in
inputs/$datasrc/
:make inputs/$datasrc/add "cp" -f inputs/.NCBI/{Makefile,run,table.run} inputs/$datasrc/ # add new-style import files mkdir inputs/$datasrc/_src/
- place extract in
inputs/$datasrc/_src/
map to VegCore¶
from README.TXT > Datasource setup
map metadata¶
- add subdir:
echo Source >>inputs/$datasrc/import_order.txt "cp" -f inputs/.NCBI/Source/{run,data.csv} inputs/$datasrc/Source/ # add new-style import files inputs/$datasrc/Source/run # create map spreadsheet
- fill out metadata in
inputs/$datasrc/Source/map.csv
map each table¶
- add subdir:
make inputs/$datasrc/$table/add "cp" -f inputs/.NCBI/nodes/run inputs/$datasrc/$table/ # add new-style import files echo $table >>inputs/$datasrc/import_order.txt
- extract flat files from the compressed extract, if applicable
- translate flat files to a supported format (CSV/TSV):
.xls
:(cd inputs/$datasrc/_src/; /Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to csv *.xls)
- place extracted flat file(s) for the table in the table subdir
- note that if the dataset consists only of flat files and all the flat files are used by a table subdir, the
_src/
subdir will end up empty after the flat files have been moved
- note that if the dataset consists only of flat files and all the flat files are used by a table subdir, the
- rename files so their names don't contain any spaces
- if the header is repeated in each segment, standardize it:
- check the headers:
inputs/$datasrc/$table/run check_headers
- if there is a "header mismatched" error, fix the header in the corresponding segment file
- repeat until no errors
- check the headers:
- install the staging table and create the map spreadsheet:
inputs/$datasrc/$table/run
- for plots datasources, prevent column collisions upon left-join:
ininputs/$datasrc/$table/map.csv
:- replace text
,*
with text,*$table--
where$table
should be replaced with the actual value of that var
- replace text
- map the columns in
inputs/$datasrc/$table/map.csv
to terms in the VegCore data dictionary - rename staging table columns to VegCore:
inputs/$datasrc/$table/run
- if any datasource-specific postprocessing is needed:
- add
postprocess.sql
:"cp" -f inputs/VegBank/plot/postprocess.sql inputs/$datasrc/$table/
- in
postprocess.sql
, modify the existing code to fit the datasource. remove any template code that doesn't apply. - run the postprocessing:
inputs/$datasrc/$table/run
- add
ensure all pre-import actions have been performed¶
inputs/$datasrc/run # runtime: 1 min ("1m0.511s")
→ see sample call graph for VegBank
left-join tables¶
not applicable to this datasource
check in new datasource¶
from README.TXT > Datasource setup
- commit & upload:
make inputs/$datasrc/add # place files under version control svn di inputs/$datasrc/*/test.xml.ref # make sure the test outputs are correct svn st # make sure all non-data files are under version control svn ci -m "added inputs/$datasrc/" make inputs/upload # check that changes are as expected make inputs/upload live=1
- if you ran the mapping steps on the local machine, sync to vegbiendev:
- log in to vegbiendev:
ssh -t vegbiendev.nceas.ucsb.edu exec sudo -u aaronmk -i
- download:
svn up make inputs/download # check that changes are as expected make inputs/download live=1
- set vars as above
- perform pre-import actions:
inputs/$datasrc/run
- log in to vegbiendev:
import to VegBIEN¶
- log in to vegbiendev:
ssh -t vegbiendev.nceas.ucsb.edu exec sudo -u aaronmk -i
- set vars as above
- run column-based import: (from README.TXT > Single datasource refresh)
make inputs/$datasrc/reimport_scrub by_col=1 & tail -150 inputs/$datasrc/*/logs/public.log.sql # view progress
- see README.TXT > Single datasource refresh > steps after
reimport_scrub
runtimes¶
reimport_scrub
: 15 min ("14:45.73")