2011 working group Mo BIEN workflow¶
- in depth look at workflow diagram
- range of input sources: provider
- infrastructure around BIEN
- diagram derived from last year's working group
- summarize processes in square boxes
- ingest data -> staging area (db or network location)
- staging data gets taxon scrubbed and geovalidated
- automated pipeline
- integrate data with existing datasets
- connect data to external data sources
- e.g. trait data
- requirements for BIEN
- repeatable, reliable, robust resource
- traceable, granular data as it came in
- reuse data and check reliability and provenance
- over time, review taxonomic names
- rework synonymy as needed or chronically
- native traits feeding into confed db?
- analytical db
- degree of integration
- separate analysis data from databank data
- view of data at a particular time
- user/web interface: narrower set of requirements
- fundamental requirements
- need to communicate back to data provider
- report back to data providers who has used their information and what they used it for
- annotations on the data
- find holes in data when using it -> mech to flow back to providers
- process boxes
- ingestion: different mechanisms
- automatically harvest data
- like GBIF harvester scripts
- IPT
- data provider needs capability to handle inputs, often don't have
- overkill?: how often do we want to refresh BIEN
- pipeline for smaller data providers
- tools to allow mapping of data to spreadsheets
- VegX captures ontology, but not straightforward to enter or export data
- standardized formats waiting to be used
- TurboVeg plot data input (don't need VegX)
- take spreadsheet and input data
- European format data from TurboVeg
- NPS plot data standard
- need fair amount of knowledge about source data
- when get data, what will happen on inside
- automated harvesting, system of pushing data through to staging database
- standard data format: VegX/DwC
- validate data for structural correctness: plots, taxa
- migrate data into staging area
- taking closer look at breakout session
- in order to ingest data, we have std data formats, but also ad-hoc datasets and historical data
- describe mapping from source to destination data
- mapping tool for VegX->VegBank
- flatten data
- XML, XSD (schema lang), XSLT (transform XML to another form)
- flatten transformation
- high-level elements in VegX: plot, observation, taxon, stratum, individual etc.
- tool for user to update data with VegX schema
- lossless?
- VegX evolved to capture nested data
- VegX, VegBank designed to work with idiosyncracies of existing datasets
- can ignore rare items to simplify data: pruning off methodological options
- values/authority file
- map plot data from VegX->database
- sometimes 1:1 mapping, but not always: e.g. missing identifiers which need to be constructed
- calculate identifiers using expressions
- specify where to get other data from
- specify primary key
- replace values in encoded way
- automatic replacements
- where clause to restrict data
- categorical aggregations: % cover
- look through aggregated data and pull relevant parts
- check mapping on the fly
- transform whole schema
- VegX -> live modeled relational framework
- tool creates mapping file which specifies which element to map to which
- use to generate input to relational model
- growing, enhancing tool
- data that ecologists use is varied
- robust, structured data not easy to create
- that's why VegBank is a complicated model
- inverse tool: raw data to VegX? yes
- mapping requires experience, understanding of source, dest data
- need to validate mapped data
- DOM<->ER
- what tabular data will look like
- extracted definitions of methods, species code, tiers/strata, lower and upper bounds
- view results
- not significant validations
- .NET app
- by Shirley
- can answer questions about how to solve specific problems
- data, exclosure info, control, species, tag #, diameter/dbh
- what is plot, subplot
- recording dead, co-joined trees: how to read arrows off spreadsheet?
- carve up data
- ad-hoc rule: SQL script
- takes control data on right-hand side and unions to stuff on left
- some rows control, some exclosures
- flattened view
- find elements within import to use
- two sources of plot
- preserve pairings of trees?
- hand-shuffle Excel spreadsheet?
- different interpretations when shuffling spreadsheet
- mapping validated: nested relationships, data integrity
- src->dest mapping can be reused
- start from where left off?
- initial mapping takes a long time
- later mappings much quicker
- version the mappings
- programmer or person w/ good data knowledge could make the changes
- e.g. map VegX->relational, Excel->VegX
- tool designed for expert users
- tool will be shared
- tool supported?
- NVS database can import VegX
- map autonomous groups data to VegX
- BIEN needs to comprehensively map any valid VegX document
- DDL syntax will vary, but SQL server script is generated
- mainly for plots, but maybe also for specimen data
- collectors have built own tools, but prefer them to use the common mapping tool
- when present it with import source, it infers schema
- spreadsheet view: flattened view
- tweak into subset schemas? TurboVeg subset
- give mapping file to someone -> they will be able to map it
- Gentry plots in 4 flavors
- once been given example, can make minor adjustments
- XML schema doc has interpretation
- relational format
- viewFullOccurrence DDL SQL scripts
- mapping to VegX: enforcing tables?
- with DDL, it uses it
- don't have full mapping to VegX
- map between anything and anything?
- CTFS->VegX mapping: used tool?
- mapping tool: done item?
- proof of concept that VegX can handle most of the plot and specimen data
- we can integrate into VegX
- test schema using mapping tool
- a lot of vegetation data in partial dates
- combine authority data with VegX
- some issues that VegX can't solve
- validate against incoming schema
- if know schema going towards, can populate
- recast data to meet stds of destination
- user's guide? no
- tool meant for programmer, but ecologist can learn it
- is VegX the schema we're going to be going with?
- alternative is to just convert each spreadsheet individually
- if use mapping tool, don't need BIEN processes to handle range of inputs
- how general will this be? all possibilities of data ingestion?
- map anything to VegX
- details about flattening process: how much specificity lost?
- ER framework has cardinality constraints, etc.
- big frameworks get confederated, mapping much more difficult
- mapping two ERs together
- the tool is for individual datasets, not ERs
- tool in SQL Server, MS, etc.
- how to model determinations
- diameter
- XMLSpy tree diagrams
- don't need nesting to get value
- how to pass spreadsheet to XML?
For tomorrow¶
- what will db group be focusing on?
- how to constrain discussion? dates for delivery of key items
- design docs, product
- define constraints in terms of time: delivery dates and people's time available to develop this
- scientists have stake in BIEN db
- what is vision for what BIEN could be?
- what was lost in simplifying it
- simplified VegX: does it have all capabilities needed?
- what do we lose in flattening process?
- next barrier is OS
- higher level needed: what data does a scientist need for their analysis?
- map, species file w/ lat/long
- some people are data users and data providers
- return of information to project
- entering data into BIEN
- data in BIEN->make changes?
- unique chance: we have funding, people to develop this
- not too many responsibilities for data provides
- don't require data provider to map data because not worth time investment for them
- responsibility of mapping on BIEN side
- simplify tool->view it as annotating spreadsheet
- some scientists won't take time to map data even if called "annotating"
- grad students creating morpho XML data file: not able to do it
- who is customer?
- sociocultural issue
- simplifying input is technical challenge
- small time providers put off if interface too complicated
- bigger programming investment to make simple interface vs just build db
- collection methodologies
- Gentry plots: not intimidating (8 fields)
- EBIF in Bolivia w/ 100s Ha plots, vernacular names
- historical plots that have never been entered
- students in Mexico doing plots for Masters/Ph.D.
- does BIEN accommodate info in source data?
- what do we want BIEN to do for us?
- functionality on input side vs. schema, functionality on output side
- can flatten hierarchy and still interpret as hierarchy
- field testing: just getting into that stage
- Shirley has been using tool to do mappings
- 20 spreadsheets from same source
- source -> VegX: so much there, overwhelms naive user
- how to use data
- mapping tools work for a certain spectrum of users
- usability, refinement feedback
- other tools to map source -> target
- primary driver is what needed
- look at veg-flat.xsd file
- # programmer-hours to develop framework?
- rewriting it from C#
- .NET code in Linux
- don't need to reengineer to much for Mac/Linux?
- what data/BIEN group will talk about tomorrow
- grand vision of BIEN today
- how much emphasis on components of data group
- limited funding, time for broad vision
- what are time and monetary constraints
- clarify what BIEN 3 should do, in a year from now
- need something in a year
- specify what BIEN 3.0 should do
- what BIEN should do
- by end of meeting, start developing from some codebase
- minimum deliverable, not end vision of BIEN
- Nick presented tool for community users to contribute data to BIEN
- VegBranch also does this, and has similar challenges
- prototype BIEN 3.0 organic database for people to enter data
- well-trained researcher w/ some informatics knowledge can enter data
- acquiring more data from Bolivia, Mexico
- researchers willing to use Nick's tool
- what is output? plots, traits, specimens, questions at interface
- something more to do in output
- acquire new data
- shortcomings of data addressed with new schema
- may be last time we do something to BIEN
- making BIEN 2 usable requires a lot of BIEN scripting
- BIEN 3 should still be usable in 3-4 years
- should be able to grow by itself ideally
- walk away and leave server on
- find researchers in Latin America who want to be part of data network
- point to something that's a success
- unfunded server that's running and people can enter data
- e.g. SALVIAS is completely unfunded
- SALVIAS has been on for 10 years or so, but can't grow (no new data)
- db grows without maintenance?
- plots, traits, collectors
- data entry tool w/o database?
- do db first in case run out of time, funding
- db w/ web interface
- clean API to enable other people to build tools for it
- providers making data accessible
- individual groups can't build all the tools everyone wants
- no matter how nice the tools are
- schemaless? like NoSQL
- SQL dbs rather than NoSQL
- relational db: anyone born in last 20 years can figure it out
- new generation more equipped for that kind of platform
- GeoNet: 3 govt organizations put data on earthquakes together
- biggest constraint is time
- 3 people for a year: Aaron, John/Brad, Jim?
- relieved from db responsibilities
- ~1 full time person altogether
- synergize our interests
- coordination issues
- improve acquisition of data
- have core goal or objective
- a few dozen plots collected vs. developing tools
- world vs. western hemisphere
- nothing in BIEN that constrains data geographically
- TurboVeg format ingestion tool
- globalize acceptance of data
- populate taxon names
- TNRS has new world bias
- willing to take NZ data even if no plans to use it?
- make this format compatible with others
- decide if anyone can put data in?
- weigh all possibilities and constraints and come up w/ bulleted list and work down it
- critical decisions to guide group
- tomorrow morning: science, analyses done and ongoing
- data we can analyze now
- afternoon: subgroups
- think about different subgroups: science, BIEN 3.0