2011 working group Th BIEN Implementation¶

BIEN 3.0 requirements outline¶

people, prioritization

Core DB¶

design model (over next 3 weeks)
1. redesign VegBank
- people to consult: Bob Peet, Mike Lee, Nick Spencer, Brad Boyle, Steve Dolins
- prepare preliminary create table and loading scripts (Aaron)
- check that all elements supported in and from DwC (by end of Nov)
- identify issues (by middle of Nov)
- agree on revised data model (by early Dec)
- generate and test database (Aaron)
redesign DwC or redesign VegBank to contain DwC
critical data in every herbarium database
extend DwC for BIEN
BIEN superset of VegX
worth extending VegX
cultivated missing from DwC
- will increase quality of data
every herbarium has specimen desc: need explicit place in DwC
need cultivated and specimen fields from herbaria
2. BIEN extension of DwC
- people: Brad Boyle, Peter J., Barbara T.
- revised DwC-BIEN (by early Dec)
1. redesign VegX
- people: Nick Spencer, also Brad Boyle
- iterative process
  - issues in a test load -> solve and retry
- preliminary slight revision based on issues raised at BIEN mtg (end of Jan 2012)
- evaluate mapping to DwC, VegBank
mapping pipeline:
- specimen data DwC -> VegBIEN (Aaron, Mike Lee, Brad Boyle)
  - by Mar/Apr
- VegX -> VegBIEN (Aaron, Mike Lee, Nick Spencer)
  - create VegX -> VegBIEN script: widely reusable tool

Load data¶

Jan-Mar
plot data->VegX mapping tool (Aaron w/ Nick Spencer)
- ask Mark, Jim about timing
- NVS mapping tool
- VegBranch
- Access-based tools break when MS changes Access design b/c not backwards compatible 2003->2007->2010
  - all in Visual Basic
- mapping process faster if build tool?
- write paper on mapping tool (Brad Boyle w/ Aaron)
load plot data sources by Jun
specimen data->DwC (Aaron w/ Brad Boyle)
sequencing issue
mapping (nontrivial), coding up of pipeline (more involved than mapping)
FTP site for DwC site
make pipeline wherever possible rather than asking people for documents

Testing of loaded core db¶

by Jun/Jul

2. Get data¶

obtain old BIEN 2 data vs get new data (by end of Jan)
Brad Boyle providing data to Aaron
then create data source -> VegX scripts from existing loading scripts
sci prefer tweaks to BIEN 2 db
- can work with what's there
use new data where possible; otherwise reuse original BIEN 2 data
geographic data
USDA data drills down into states, labels exotics
- polygons are in GBIF as separate occurrences
get all missing metadata
- VegBank, CVS (Bob Peet)
- CTFS (Rick Condit)
- FIA (Brad Boyle)
need new metadata in VegBank: observationType (a species occurrence)
Bob will track list of sources for occurrences

New data¶

by end of April
identify new sources
- people: Brad Boyle, Brian Enquist, B. M., Barbara T., Bob Peet, Peter J., ...
search GIVD for additional new world databases (Bob Peet)

Source -> VegX loading scripts¶

Brad, Nick, Aaron develop loading scripts to VegX
start with DwC (work with Brad)
see metadata checklist at bottom of Brad's requirements doc
- don't import a source until have all the metadata for that source
CSV dumps available
map to schema
how many sources of that data to coordinate

Open source VegX mapping tool from NVS¶

by Shirley (Nick Spencer's programmer)
by Jun 2012
mapping tool is a longer term goal
capture schema-schema mapping
Shirley will create new mappings for new use cases to VegX
NZ can't offer long-term support for mapping tool
- but if open source it, then users can fix it
multiple groups funding mapping tool: does any group have problem w/ sharing tool
choose platform for scripts
open source license for tool
convert to a language than can run under Linux
VegX is funnel to import data
mapping tool belongs in BioNC/informatics: it's a paper
journals happy to hear about tools to integrate tools with VegX

Validation¶

should faithfully represent goals of validation scripts
where to apply them?
separate validation step to get names from TNRS and added to field in table
new schema will be staging database
don't do taxon scrubbing yet
load into staging table and look for scrambled, corrupted data
don't validate mult. times
sometimes validation steps have errors, so keep original data for comparison
the process of normalization reveals flaws in the data and makes validation possible
don't just dump everything into flat file
add results to a log file so can bounce back to data provider

Publications¶

paper comparing BIEN 2, 3 (Steve Dolins, Brad Boyle?)
BIEN 3 white paper
overarching plot data model?
- VegX paper enough?
VegX mapping tool paper (Brad Boyle, Aaron, Nick, ...
BIEN science whitepaper (Brian Enquist)

Check-ins¶

bimonthly planning meetings/conference calls/web conferences
organized by Mark
Mark, Jim oversee devel process

Data infrastructure requirements¶

automate as many steps as possible:
- data acquisition
- validation pipeline
- publishing data products
after 1 year, database shouldn't require maintenance

Data end product requirements¶

Misc¶

2nd, 3rd polit div: filter for cultivated specimens: state, whether plant present in it
- upper/lower political div not always filled in
  - don't get with DwC
- flat file
- TROPICOS: 1 line/specimen w/ locality info
- matrix of political divison x taxon w/ whether present in division
- determining absence, not presence, of species w/ TROPICOS data
- 12 million records
- checklist for species, where

BIEN workflow¶

XML schema validation before importing into schema db
- e.g. is it valid DwC?
data -> DwC -> not valid -> don't accept data, send back to data provider
scripts write to log file
in BIEN 2, taxonomy, geography is raw data of all unique fields in TaxonDimension

meeting again Fr morning 9:30am

Files (0)

Project

General

Profile

Wiki

2011 working group Th BIEN Implementation¶

BIEN 3.0 requirements outline¶

Core DB¶

Load data¶

Testing of loaded core db¶

2. Get data¶

New data¶

Source -> VegX loading scripts¶

Open source VegX mapping tool from NVS¶

Validation¶

Publications¶

Check-ins¶

Data infrastructure requirements¶

Data end product requirements¶

Misc¶

BIEN workflow¶