Project

General

Profile

2011 working group Th BIEN Implementation

BIEN 3.0 requirements outline

  • people, prioritization

Core DB

  • design model (over next 3 weeks)
  • 1. redesign VegBank
    • people to consult: Bob Peet, Mike Lee, Nick Spencer, Brad Boyle, Steve Dolins
    • prepare preliminary create table and loading scripts (Aaron)
    • check that all elements supported in and from DwC (by end of Nov)
    • identify issues (by middle of Nov)
    • agree on revised data model (by early Dec)
    • generate and test database (Aaron)
  • redesign DwC or redesign VegBank to contain DwC
  • critical data in every herbarium database
  • extend DwC for BIEN
  • BIEN superset of VegX
  • worth extending VegX
  • cultivated missing from DwC
    • will increase quality of data
  • every herbarium has specimen desc: need explicit place in DwC
  • need cultivated and specimen fields from herbaria
  • 2. BIEN extension of DwC
    • people: Brad Boyle, Peter J., Barbara T.
    • revised DwC-BIEN (by early Dec)
  • 1. redesign VegX
    • people: Nick Spencer, also Brad Boyle
    • iterative process
      • issues in a test load -> solve and retry
    • preliminary slight revision based on issues raised at BIEN mtg (end of Jan 2012)
    • evaluate mapping to DwC, VegBank
  • mapping pipeline:
    • specimen data DwC -> VegBIEN (Aaron, Mike Lee, Brad Boyle)
      • by Mar/Apr
    • VegX -> VegBIEN (Aaron, Mike Lee, Nick Spencer)
      • create VegX -> VegBIEN script: widely reusable tool

Load data

  • Jan-Mar
  • plot data->VegX mapping tool (Aaron w/ Nick Spencer)
    • ask Mark, Jim about timing
    • NVS mapping tool
    • VegBranch
    • Access-based tools break when MS changes Access design b/c not backwards compatible 2003->2007->2010
      • all in Visual Basic
    • mapping process faster if build tool?
    • write paper on mapping tool (Brad Boyle w/ Aaron)
  • load plot data sources by Jun
  • specimen data->DwC (Aaron w/ Brad Boyle)
  • sequencing issue
  • mapping (nontrivial), coding up of pipeline (more involved than mapping)
  • FTP site for DwC site
  • make pipeline wherever possible rather than asking people for documents

Testing of loaded core db

  • by Jun/Jul

2. Get data

  • obtain old BIEN 2 data vs get new data (by end of Jan)
  • Brad Boyle providing data to Aaron
  • then create data source -> VegX scripts from existing loading scripts
  • sci prefer tweaks to BIEN 2 db
    • can work with what's there
  • use new data where possible; otherwise reuse original BIEN 2 data
  • geographic data
  • USDA data drills down into states, labels exotics
    • polygons are in GBIF as separate occurrences
  • get all missing metadata
    • VegBank, CVS (Bob Peet)
    • CTFS (Rick Condit)
    • FIA (Brad Boyle)
  • need new metadata in VegBank: observationType (a species occurrence)
  • Bob will track list of sources for occurrences

New data

  • by end of April
  • identify new sources
    • people: Brad Boyle, Brian Enquist, B. M., Barbara T., Bob Peet, Peter J., ...
  • search GIVD for additional new world databases (Bob Peet)

Source -> VegX loading scripts

  • Brad, Nick, Aaron develop loading scripts to VegX
  • start with DwC (work with Brad)
  • see metadata checklist at bottom of Brad's requirements doc
    • don't import a source until have all the metadata for that source
  • CSV dumps available
  • map to schema
  • how many sources of that data to coordinate

Open source VegX mapping tool from NVS

  • by Shirley (Nick Spencer's programmer)
  • by Jun 2012
  • mapping tool is a longer term goal
  • capture schema-schema mapping
  • Shirley will create new mappings for new use cases to VegX
  • NZ can't offer long-term support for mapping tool
    • but if open source it, then users can fix it
  • multiple groups funding mapping tool: does any group have problem w/ sharing tool
  • choose platform for scripts
  • open source license for tool
  • convert to a language than can run under Linux
  • VegX is funnel to import data
  • mapping tool belongs in BioNC/informatics: it's a paper
  • journals happy to hear about tools to integrate tools with VegX

Validation

  • should faithfully represent goals of validation scripts
  • where to apply them?
  • separate validation step to get names from TNRS and added to field in table
  • new schema will be staging database
  • don't do taxon scrubbing yet
  • load into staging table and look for scrambled, corrupted data
  • don't validate mult. times
  • sometimes validation steps have errors, so keep original data for comparison
  • the process of normalization reveals flaws in the data and makes validation possible
  • don't just dump everything into flat file
  • add results to a log file so can bounce back to data provider

Publications

  • paper comparing BIEN 2, 3 (Steve Dolins, Brad Boyle?)
  • BIEN 3 white paper
  • overarching plot data model?
    • VegX paper enough?
  • VegX mapping tool paper (Brad Boyle, Aaron, Nick, ...
  • BIEN science whitepaper (Brian Enquist)

Check-ins

  • bimonthly planning meetings/conference calls/web conferences
  • organized by Mark
  • Mark, Jim oversee devel process

Data infrastructure requirements

  • automate as many steps as possible:
    • data acquisition
    • validation pipeline
    • publishing data products
  • after 1 year, database shouldn't require maintenance

Data end product requirements

Misc

  • 2nd, 3rd polit div: filter for cultivated specimens: state, whether plant present in it
    • upper/lower political div not always filled in
      • don't get with DwC
    • flat file
    • TROPICOS: 1 line/specimen w/ locality info
    • matrix of political divison x taxon w/ whether present in division
    • determining absence, not presence, of species w/ TROPICOS data
    • 12 million records
    • checklist for species, where

BIEN workflow

  • XML schema validation before importing into schema db
    • e.g. is it valid DwC?
  • data -> DwC -> not valid -> don't accept data, send back to data provider
  • scripts write to log file
  • in BIEN 2, taxonomy, geography is raw data of all unique fields in TaxonDimension
  • meeting again Fr morning 9:30am