Table of contents
Adding Madidi--a flat-file plots datasource

Adding Madidi--a flat-file plots datasource¶

what is needed from the user¶

extract (so we can go back to the raw data if needed)
extracted flat file(s) for each table
mappings to VegCore
code to create any source-specific derived columns
left-join order of tables
the sort-by column for the left-joined view

steps¶

underlined: user input needed (other steps can be automated)

note that this is not necessarily the process that was used to add Madidi, but the one that would be used to add it again

connect to vegbiendev¶

ssh -t vegbiendev.nceas.ucsb.edu exec sudo -u aaronmk -i

alternatively, use a blank VM or the local machine

obtain extract¶

Peter Jorgensen e-mailed me this on 2013-1-14

set vars¶

start a subshell so the vars only affect these commands:
```
$0
```

set vars:

datasrc=Madidi table=Location
# table=LocationObservation
# table=IndividualObservation

set up datasource folder¶

from README.TXT > Datasource setup

add folder for datasource in inputs/$datasrc/:

make inputs/$datasrc/add
"cp" -f inputs/.NCBI/{Makefile,run,table.run} inputs/$datasrc/ # add new-style import files
mkdir inputs/$datasrc/_src/

place extract in inputs/$datasrc/_src/

map to VegCore¶

from README.TXT > Datasource setup

map metadata¶

add subdir:

echo Source >>inputs/$datasrc/import_order.txt
"cp" -f inputs/.NCBI/Source/{run,data.csv} inputs/$datasrc/Source/ # add new-style import files
inputs/$datasrc/Source/run # create map spreadsheet

fill out metadata in inputs/$datasrc/Source/map.csv

map each table¶

add subdir:

make inputs/$datasrc/$table/add
"cp" -f inputs/.NCBI/nodes/run inputs/$datasrc/$table/ # add new-style import files
echo $table >>inputs/$datasrc/import_order.txt

extract flat files from the compressed extract, if applicable

translate flat files to a supported format (CSV/TSV):

.xls:

(cd inputs/$datasrc/_src/; /Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to csv *.xls)

place extracted flat file(s) for the table in the table subdir
- note that if the dataset consists only of flat files and all the flat files are used by a table subdir, the _src/ subdir will end up empty after the flat files have been moved
rename files so their names don't contain any spaces
if the header is repeated in each segment, standardize it:
1. check the headers:
```
inputs/$datasrc/$table/run check_headers
```
2. if there is a "header mismatched" error, fix the header in the corresponding segment file
3. repeat until no errors
install the staging table and create the map spreadsheet:
```
inputs/$datasrc/$table/run
```
for plots datasources, prevent column collisions upon left-join:
in inputs/$datasrc/$table/map.csv:
1. replace text ,*
  with text ,*$table--
  where $table should be replaced with the actual value of that var
map the columns in inputs/$datasrc/$table/map.csv to terms in the VegCore data dictionary
rename staging table columns to VegCore:
```
inputs/$datasrc/$table/run
```
if any datasource-specific postprocessing is needed:
1. add postprocess.sql:
```
"cp" -f inputs/VegBank/plot/postprocess.sql inputs/$datasrc/$table/
```
2. in postprocess.sql, modify the existing code to fit the datasource. remove any template code that doesn't apply.
3. run the postprocessing:
```
inputs/$datasrc/$table/run
```

ensure all pre-import actions have been performed¶

inputs/$datasrc/run # runtime: 1 min ("1m0.511s")

→ see sample call graph for VegBank

left-join tables¶

from Left-joining a datasource > prevent column collisions by prepending the table name to each column name

add subdir:

make inputs/$datasrc/'taxon_observation.**'/add
"cp" -f inputs/VegBank/'taxon_observation.**'/{run,postprocess.sql} inputs/$dest/'taxon_observation.**'/ # add new-style import files
echo 'taxon_observation.**' >>inputs/$datasrc/import_order.txt

edit postprocess.sql:
1. change the table names to those of the datasource
2. set the sort_col (the first column in the view) to be a joined table's pkey

install the view:

inputs/$datasrc/'taxon_observation.**'/run
# fix bugs and repeat until it has a successful exit status

prevent joined tables from also being imported (after the left-join above is successful)

# create a blank file named _no_import in each table subdir
for table in $(grep -vF -e Source -e 'taxon_observation.**' inputs/$datasrc/import_order.txt); do
    "cp" -f inputs/FIA/COND/_no_import inputs/$datasrc/$table/
done

check in new datasource¶

from README.TXT > Datasource setup

commit & upload:

make inputs/$datasrc/add # place files under version control
svn di inputs/$datasrc/*/test.xml.ref # make sure the test outputs are correct
svn st # make sure all non-data files are under version control
svn ci -m "added inputs/$datasrc/" 
make inputs/upload # check that changes are as expected
make inputs/upload live=1

if you ran the mapping steps on the local machine, sync to vegbiendev:

ssh -t vegbiendev.nceas.ucsb.edu exec sudo -u aaronmk -i

download:

svn up
make inputs/download # check that changes are as expected
make inputs/download live=1

set vars as above
perform pre-import actions:
```
inputs/$datasrc/run
```

import to VegBIEN¶

ssh -t vegbiendev.nceas.ucsb.edu exec sudo -u aaronmk -i

set vars as above

run column-based import: (from README.TXT > Single datasource refresh)

make inputs/$datasrc/reimport_scrub by_col=1 &
tail -150 inputs/$datasrc/*/logs/public.log.sql # view progress

see README.TXT > Single datasource refresh > steps after reimport_scrub

runtimes¶

reimport_scrub: 1.5 h ("1:23:45.782916 sec")

Files (0)

Project

General

Profile

Wiki

Adding Madidi--a flat-file plots datasource¶

what is needed from the user¶

steps¶

connect to vegbiendev¶

obtain extract¶

set vars¶

set up datasource folder¶

map to VegCore¶

map metadata¶

map each table¶

ensure all pre-import actions have been performed¶

left-join tables¶

check in new datasource¶

import to VegBIEN¶

runtimes¶