2011 working group Th BIEN Implementation¶
BIEN 3.0 requirements outline¶
- people, prioritization
Core DB¶
- design model (over next 3 weeks)
- 1. redesign VegBank
- people to consult: Bob Peet, Mike Lee, Nick Spencer, Brad Boyle, Steve Dolins
- prepare preliminary create table and loading scripts (Aaron)
- check that all elements supported in and from DwC (by end of Nov)
- identify issues (by middle of Nov)
- agree on revised data model (by early Dec)
- generate and test database (Aaron)
- redesign DwC or redesign VegBank to contain DwC
- critical data in every herbarium database
- extend DwC for BIEN
- BIEN superset of VegX
- worth extending VegX
- cultivated missing from DwC
- will increase quality of data
- every herbarium has specimen desc: need explicit place in DwC
- need cultivated and specimen fields from herbaria
- 2. BIEN extension of DwC
- people: Brad Boyle, Peter J., Barbara T.
- revised DwC-BIEN (by early Dec)
- 1. redesign VegX
- people: Nick Spencer, also Brad Boyle
- iterative process
- issues in a test load -> solve and retry
- preliminary slight revision based on issues raised at BIEN mtg (end of Jan 2012)
- evaluate mapping to DwC, VegBank
- mapping pipeline:
- specimen data DwC -> VegBIEN (Aaron, Mike Lee, Brad Boyle)
- by Mar/Apr
- VegX -> VegBIEN (Aaron, Mike Lee, Nick Spencer)
- create VegX -> VegBIEN script: widely reusable tool
- specimen data DwC -> VegBIEN (Aaron, Mike Lee, Brad Boyle)
Load data¶
- Jan-Mar
- plot data->VegX mapping tool (Aaron w/ Nick Spencer)
- ask Mark, Jim about timing
- NVS mapping tool
- VegBranch
- Access-based tools break when MS changes Access design b/c not backwards compatible 2003->2007->2010
- all in Visual Basic
- mapping process faster if build tool?
- write paper on mapping tool (Brad Boyle w/ Aaron)
- load plot data sources by Jun
- specimen data->DwC (Aaron w/ Brad Boyle)
- sequencing issue
- mapping (nontrivial), coding up of pipeline (more involved than mapping)
- FTP site for DwC site
- make pipeline wherever possible rather than asking people for documents
Testing of loaded core db¶
- by Jun/Jul
2. Get data¶
- obtain old BIEN 2 data vs get new data (by end of Jan)
- Brad Boyle providing data to Aaron
- then create data source -> VegX scripts from existing loading scripts
- sci prefer tweaks to BIEN 2 db
- can work with what's there
- use new data where possible; otherwise reuse original BIEN 2 data
- geographic data
- USDA data drills down into states, labels exotics
- polygons are in GBIF as separate occurrences
- get all missing metadata
- VegBank, CVS (Bob Peet)
- CTFS (Rick Condit)
- FIA (Brad Boyle)
- need new metadata in VegBank: observationType (a species occurrence)
- Bob will track list of sources for occurrences
New data¶
- by end of April
- identify new sources
- people: Brad Boyle, Brian Enquist, B. M., Barbara T., Bob Peet, Peter J., ...
- search GIVD for additional new world databases (Bob Peet)
Source -> VegX loading scripts¶
- Brad, Nick, Aaron develop loading scripts to VegX
- start with DwC (work with Brad)
- see metadata checklist at bottom of Brad's requirements doc
- don't import a source until have all the metadata for that source
- CSV dumps available
- map to schema
- how many sources of that data to coordinate
Open source VegX mapping tool from NVS¶
- by Shirley (Nick Spencer's programmer)
- by Jun 2012
- mapping tool is a longer term goal
- capture schema-schema mapping
- Shirley will create new mappings for new use cases to VegX
- NZ can't offer long-term support for mapping tool
- but if open source it, then users can fix it
- multiple groups funding mapping tool: does any group have problem w/ sharing tool
- choose platform for scripts
- open source license for tool
- convert to a language than can run under Linux
- VegX is funnel to import data
- mapping tool belongs in BioNC/informatics: it's a paper
- journals happy to hear about tools to integrate tools with VegX
Validation¶
- should faithfully represent goals of validation scripts
- where to apply them?
- separate validation step to get names from TNRS and added to field in table
- new schema will be staging database
- don't do taxon scrubbing yet
- load into staging table and look for scrambled, corrupted data
- don't validate mult. times
- sometimes validation steps have errors, so keep original data for comparison
- the process of normalization reveals flaws in the data and makes validation possible
- don't just dump everything into flat file
- add results to a log file so can bounce back to data provider
Publications¶
- paper comparing BIEN 2, 3 (Steve Dolins, Brad Boyle?)
- BIEN 3 white paper
- overarching plot data model?
- VegX paper enough?
- VegX mapping tool paper (Brad Boyle, Aaron, Nick, ...
- BIEN science whitepaper (Brian Enquist)
Check-ins¶
- bimonthly planning meetings/conference calls/web conferences
- organized by Mark
- Mark, Jim oversee devel process
Data infrastructure requirements¶
- automate as many steps as possible:
- data acquisition
- validation pipeline
- publishing data products
- after 1 year, database shouldn't require maintenance
Data end product requirements¶
Misc¶
- 2nd, 3rd polit div: filter for cultivated specimens: state, whether plant present in it
- upper/lower political div not always filled in
- don't get with DwC
- flat file
- TROPICOS: 1 line/specimen w/ locality info
- matrix of political divison x taxon w/ whether present in division
- determining absence, not presence, of species w/ TROPICOS data
- 12 million records
- checklist for species, where
- upper/lower political div not always filled in
BIEN workflow¶
- XML schema validation before importing into schema db
- e.g. is it valid DwC?
- data -> DwC -> not valid -> don't accept data, send back to data provider
- scripts write to log file
- in BIEN 2, taxonomy, geography is raw data of all unique fields in TaxonDimension
- meeting again Fr morning 9:30am