- Table of contents
- Adding Madidi--a flat-file plots datasource
Adding Madidi--a flat-file plots datasource¶
what is needed from the user¶
- extract (so we can go back to the raw data if needed)
- extracted flat file(s) for each table
- mappings to VegCore
- code to create any source-specific derived columns
- left-join order of tables
- the sort-by column for the left-joined view
steps¶
underlined: user input needed (other steps can be automated)
note that this is not necessarily the process that was used to add Madidi, but the one that would be used to add it again
connect to vegbiendev¶
ssh -t vegbiendev.nceas.ucsb.edu exec sudo -u aaronmk -i
alternatively, use a blank VM or the local machine
obtain extract¶
- Peter Jorgensen e-mailed me this on 2013-1-14
set vars¶
- start a subshell so the vars only affect these commands:
$0
- set vars:
datasrc=Madidi table=Location # table=LocationObservation # table=IndividualObservation
set up datasource folder¶
from README.TXT > Datasource setup
- add folder for datasource in
inputs/$datasrc/
:make inputs/$datasrc/add "cp" -f inputs/.NCBI/{Makefile,run,table.run} inputs/$datasrc/ # add new-style import files mkdir inputs/$datasrc/_src/
- place extract in
inputs/$datasrc/_src/
map to VegCore¶
from README.TXT > Datasource setup
map metadata¶
- add subdir:
echo Source >>inputs/$datasrc/import_order.txt "cp" -f inputs/.NCBI/Source/{run,data.csv} inputs/$datasrc/Source/ # add new-style import files inputs/$datasrc/Source/run # create map spreadsheet
- fill out metadata in
inputs/$datasrc/Source/map.csv
map each table¶
- add subdir:
make inputs/$datasrc/$table/add "cp" -f inputs/.NCBI/nodes/run inputs/$datasrc/$table/ # add new-style import files echo $table >>inputs/$datasrc/import_order.txt
- extract flat files from the compressed extract, if applicable
- translate flat files to a supported format (CSV/TSV):
.xls
:(cd inputs/$datasrc/_src/; /Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to csv *.xls)
- place extracted flat file(s) for the table in the table subdir
- note that if the dataset consists only of flat files and all the flat files are used by a table subdir, the
_src/
subdir will end up empty after the flat files have been moved
- note that if the dataset consists only of flat files and all the flat files are used by a table subdir, the
- rename files so their names don't contain any spaces
- if the header is repeated in each segment, standardize it:
- check the headers:
inputs/$datasrc/$table/run check_headers
- if there is a "header mismatched" error, fix the header in the corresponding segment file
- repeat until no errors
- check the headers:
- install the staging table and create the map spreadsheet:
inputs/$datasrc/$table/run
- for plots datasources, prevent column collisions upon left-join:
ininputs/$datasrc/$table/map.csv
:- replace text
,*
with text,*$table--
where$table
should be replaced with the actual value of that var
- replace text
- map the columns in
inputs/$datasrc/$table/map.csv
to terms in the VegCore data dictionary - rename staging table columns to VegCore:
inputs/$datasrc/$table/run
- if any datasource-specific postprocessing is needed:
- add
postprocess.sql
:"cp" -f inputs/VegBank/plot/postprocess.sql inputs/$datasrc/$table/
- in
postprocess.sql
, modify the existing code to fit the datasource. remove any template code that doesn't apply. - run the postprocessing:
inputs/$datasrc/$table/run
- add
ensure all pre-import actions have been performed¶
inputs/$datasrc/run # runtime: 1 min ("1m0.511s")
→ see sample call graph for VegBank
left-join tables¶
from Left-joining a datasource > prevent column collisions by prepending the table name to each column name
- add subdir:
make inputs/$datasrc/'taxon_observation.**'/add "cp" -f inputs/VegBank/'taxon_observation.**'/{run,postprocess.sql} inputs/$dest/'taxon_observation.**'/ # add new-style import files echo 'taxon_observation.**' >>inputs/$datasrc/import_order.txt
- edit
postprocess.sql
:- change the table names to those of the datasource
- set the
sort_col
(the first column in the view) to be a joined table's pkey
- install the view:
inputs/$datasrc/'taxon_observation.**'/run # fix bugs and repeat until it has a successful exit status
- prevent joined tables from also being imported (after the left-join above is successful)
# create a blank file named _no_import in each table subdir for table in $(grep -vF -e Source -e 'taxon_observation.**' inputs/$datasrc/import_order.txt); do "cp" -f inputs/FIA/COND/_no_import inputs/$datasrc/$table/ done
check in new datasource¶
from README.TXT > Datasource setup
- commit & upload:
make inputs/$datasrc/add # place files under version control svn di inputs/$datasrc/*/test.xml.ref # make sure the test outputs are correct svn st # make sure all non-data files are under version control svn ci -m "added inputs/$datasrc/" make inputs/upload # check that changes are as expected make inputs/upload live=1
- if you ran the mapping steps on the local machine, sync to vegbiendev:
- log in to vegbiendev:
ssh -t vegbiendev.nceas.ucsb.edu exec sudo -u aaronmk -i
- download:
svn up make inputs/download # check that changes are as expected make inputs/download live=1
- set vars as above
- perform pre-import actions:
inputs/$datasrc/run
- log in to vegbiendev:
import to VegBIEN¶
- log in to vegbiendev:
ssh -t vegbiendev.nceas.ucsb.edu exec sudo -u aaronmk -i
- set vars as above
- run column-based import: (from README.TXT > Single datasource refresh)
make inputs/$datasrc/reimport_scrub by_col=1 & tail -150 inputs/$datasrc/*/logs/public.log.sql # view progress
- see README.TXT > Single datasource refresh > steps after
reimport_scrub
runtimes¶
reimport_scrub
: 1.5 h ("1:23:45.782916 sec")