Project

General

Profile

Adding Cyrille_traits--a traits datasource

what is needed from the user

  1. extract (so we can go back to the raw data if needed)
  2. extracted flat file(s) that should be imported
  3. mappings to VegCore
  4. code to create any source-specific derived columns

steps

underlined: user input needed (other steps can be automated)

1. connect to vegbiendev

ssh -t vegbiendev.nceas.ucsb.edu exec sudo -u aaronmk -i

alternatively, use a blank VM or the local machine

2. obtain extract

  • from Dropbox folder BIEN_traits_violle_Fall2014/

3. set vars

  1. start a subshell so the vars only affect these commands:
    $0
    
  2. set vars:
    datasrc=Cyrille_traits table=trait_observation
    

4. set up datasource folder

from README.TXT > Datasource setup

  1. add folder for datasource in inputs/$datasrc/:
    make inputs/$datasrc/add
    "cp" -f inputs/.NCBI/{Makefile,run,table.run} inputs/$datasrc/ # add new-style import files
    mkdir inputs/$datasrc/_src/
    
  2. place extract in inputs/$datasrc/_src/

5. map to VegCore

from README.TXT > Datasource setup

1. map metadata

  1. add subdir:
    echo Source >>inputs/$datasrc/import_order.txt
    "cp" -f inputs/.NCBI/Source/{run,data.csv} inputs/$datasrc/Source/ # add new-style import files
    inputs/$datasrc/Source/run # create map spreadsheet
    
  2. fill out metadata in inputs/$datasrc/Source/map.csv

2. map each table

  1. add subdir:
    make inputs/$datasrc/$table/add
    "cp" -f inputs/.NCBI/nodes/run inputs/$datasrc/$table/ # add new-style import files
    echo $table >>inputs/$datasrc/import_order.txt
    
  2. extract flat files from the compressed extract, if applicable
  3. translate flat files to a supported format (CSV/TSV):
    • .xls:
      (cd inputs/$datasrc/_src/; /Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to csv *.xls)
      
  4. place extracted flat file(s) for the table in the table subdir
    • note that if the dataset consists only of flat files and all the flat files are used by a table subdir, the _src/ subdir will end up empty after the flat files have been moved
  5. rename files so their names don't contain any spaces
  6. if the header is repeated in each segment, standardize it:
    1. check the headers:
      inputs/$datasrc/$table/run check_headers
      
    2. if there is a "header mismatched" error, fix the header in the corresponding segment file
    3. repeat until no errors
  7. install the staging table and create the map spreadsheet:
    inputs/$datasrc/$table/run
    
  8. for plots datasources, prevent column collisions upon left-join:
    in inputs/$datasrc/$table/map.csv:
    1. replace text ,*
      with text ,*$table--
      where $table should be replaced with the actual value of that var
  9. map the columns in inputs/$datasrc/$table/map.csv to terms in the VegCore data dictionary
  10. rename staging table columns to VegCore:
    inputs/$datasrc/$table/run
    
  11. if any datasource-specific postprocessing is needed:
    1. add postprocess.sql:
      "cp" -f inputs/VegBank/plot/postprocess.sql inputs/$datasrc/$table/
      
    2. in postprocess.sql, modify the existing code to fit the datasource. remove any template code that doesn't apply.
    3. run the postprocessing:
      inputs/$datasrc/$table/run
      

3. ensure all pre-import actions have been performed

inputs/$datasrc/run # runtime: 1 min ("1m0.511s")

→ see sample call graph for VegBank  

6. left-join tables

not applicable to this datasource

7. check in new datasource

from README.TXT > Datasource setup

  1. commit & upload:
    make inputs/$datasrc/add # place files under version control
    svn di inputs/$datasrc/*/test.xml.ref # make sure the test outputs are correct
    svn st # make sure all non-data files are under version control
    svn ci -m "added inputs/$datasrc/" 
    make inputs/upload # check that changes are as expected
    make inputs/upload live=1
    
  2. if you ran the mapping steps on the local machine, sync to vegbiendev:
    1. log in to vegbiendev:
      ssh -t vegbiendev.nceas.ucsb.edu exec sudo -u aaronmk -i
      
    2. download:
      svn up
      make inputs/download # check that changes are as expected
      make inputs/download live=1
      
    3. set vars as above
    4. perform pre-import actions:
      inputs/$datasrc/run
      

8. import to VegBIEN

  1. log in to vegbiendev:
    ssh -t vegbiendev.nceas.ucsb.edu exec sudo -u aaronmk -i
    
  2. set vars as above
  3. run column-based import: (from README.TXT > Single datasource refresh)
    make inputs/$datasrc/reimport_scrub by_col=1 &
    tail -150 inputs/$datasrc/*/logs/public.log.sql # view progress
    
  4. see README.TXT > Single datasource refresh > steps after reimport_scrub