Project

General

Profile

2011 working group Mo BIEN workflow

  • in depth look at workflow diagram
  • range of input sources: provider
  • infrastructure around BIEN
  • diagram derived from last year's working group
  • summarize processes in square boxes
  • ingest data -> staging area (db or network location)
  • staging data gets taxon scrubbed and geovalidated
    • automated pipeline
  • integrate data with existing datasets
  • connect data to external data sources
  • e.g. trait data
  • requirements for BIEN
  • repeatable, reliable, robust resource
  • traceable, granular data as it came in
  • reuse data and check reliability and provenance
  • over time, review taxonomic names
  • rework synonymy as needed or chronically
  • native traits feeding into confed db?
  • analytical db
  • degree of integration
  • separate analysis data from databank data
  • view of data at a particular time
  • user/web interface: narrower set of requirements
  • fundamental requirements
  • need to communicate back to data provider
  • report back to data providers who has used their information and what they used it for
  • annotations on the data
  • find holes in data when using it -> mech to flow back to providers
  • process boxes
    • ingestion: different mechanisms
    • automatically harvest data
      • like GBIF harvester scripts
  • IPT
  • data provider needs capability to handle inputs, often don't have
  • overkill?: how often do we want to refresh BIEN
  • pipeline for smaller data providers
  • tools to allow mapping of data to spreadsheets
  • VegX captures ontology, but not straightforward to enter or export data
  • standardized formats waiting to be used
    • TurboVeg plot data input (don't need VegX)
    • take spreadsheet and input data
  • European format data from TurboVeg
  • NPS plot data standard
  • need fair amount of knowledge about source data
  • when get data, what will happen on inside
  • automated harvesting, system of pushing data through to staging database
  • standard data format: VegX/DwC
  • validate data for structural correctness: plots, taxa
  • migrate data into staging area
  • taking closer look at breakout session
  • in order to ingest data, we have std data formats, but also ad-hoc datasets and historical data
  • describe mapping from source to destination data
  • mapping tool for VegX->VegBank
    • flatten data
    • XML, XSD (schema lang), XSLT (transform XML to another form)
    • flatten transformation
      • high-level elements in VegX: plot, observation, taxon, stratum, individual etc.
    • tool for user to update data with VegX schema
    • lossless?
    • VegX evolved to capture nested data
  • VegX, VegBank designed to work with idiosyncracies of existing datasets
    • can ignore rare items to simplify data: pruning off methodological options
  • values/authority file
  • map plot data from VegX->database
    • sometimes 1:1 mapping, but not always: e.g. missing identifiers which need to be constructed
    • calculate identifiers using expressions
    • specify where to get other data from
    • specify primary key
    • replace values in encoded way
    • automatic replacements
    • where clause to restrict data
    • categorical aggregations: % cover
  • look through aggregated data and pull relevant parts
  • check mapping on the fly
  • transform whole schema
  • VegX -> live modeled relational framework
  • tool creates mapping file which specifies which element to map to which
    • use to generate input to relational model
  • growing, enhancing tool
  • data that ecologists use is varied
  • robust, structured data not easy to create
    • that's why VegBank is a complicated model
  • inverse tool: raw data to VegX? yes
  • mapping requires experience, understanding of source, dest data
  • need to validate mapped data
  • DOM<->ER
  • what tabular data will look like
  • extracted definitions of methods, species code, tiers/strata, lower and upper bounds
  • view results
  • not significant validations
  • .NET app
    • by Shirley
    • can answer questions about how to solve specific problems
  • data, exclosure info, control, species, tag #, diameter/dbh
  • what is plot, subplot
  • recording dead, co-joined trees: how to read arrows off spreadsheet?
  • carve up data
  • ad-hoc rule: SQL script
    • takes control data on right-hand side and unions to stuff on left
    • some rows control, some exclosures
    • flattened view
  • find elements within import to use
  • two sources of plot
  • preserve pairings of trees?
  • hand-shuffle Excel spreadsheet?
  • different interpretations when shuffling spreadsheet
  • mapping validated: nested relationships, data integrity
  • src->dest mapping can be reused
  • start from where left off?
  • initial mapping takes a long time
    • later mappings much quicker
    • version the mappings
  • programmer or person w/ good data knowledge could make the changes
  • e.g. map VegX->relational, Excel->VegX
  • tool designed for expert users
  • tool will be shared
  • tool supported?
  • NVS database can import VegX
  • map autonomous groups data to VegX
  • BIEN needs to comprehensively map any valid VegX document
  • DDL syntax will vary, but SQL server script is generated
  • mainly for plots, but maybe also for specimen data
  • collectors have built own tools, but prefer them to use the common mapping tool
  • when present it with import source, it infers schema
  • spreadsheet view: flattened view
  • tweak into subset schemas? TurboVeg subset
  • give mapping file to someone -> they will be able to map it
  • Gentry plots in 4 flavors
  • once been given example, can make minor adjustments
  • XML schema doc has interpretation
  • relational format
  • viewFullOccurrence DDL SQL scripts
  • mapping to VegX: enforcing tables?
  • with DDL, it uses it
  • don't have full mapping to VegX
  • map between anything and anything?
  • CTFS->VegX mapping: used tool?
  • mapping tool: done item?
    • proof of concept that VegX can handle most of the plot and specimen data
  • we can integrate into VegX
  • test schema using mapping tool
  • a lot of vegetation data in partial dates
  • combine authority data with VegX
  • some issues that VegX can't solve
  • validate against incoming schema
  • if know schema going towards, can populate
  • recast data to meet stds of destination
  • user's guide? no
  • tool meant for programmer, but ecologist can learn it
  • is VegX the schema we're going to be going with?
  • alternative is to just convert each spreadsheet individually
  • if use mapping tool, don't need BIEN processes to handle range of inputs
  • how general will this be? all possibilities of data ingestion?
  • map anything to VegX
  • details about flattening process: how much specificity lost?
  • ER framework has cardinality constraints, etc.
  • big frameworks get confederated, mapping much more difficult
    • mapping two ERs together
  • the tool is for individual datasets, not ERs
  • tool in SQL Server, MS, etc.
  • how to model determinations
  • diameter
  • XMLSpy tree diagrams
  • don't need nesting to get value
  • how to pass spreadsheet to XML?

For tomorrow

  • what will db group be focusing on?
    • how to constrain discussion? dates for delivery of key items
    • design docs, product
  • define constraints in terms of time: delivery dates and people's time available to develop this
  • scientists have stake in BIEN db
    • what is vision for what BIEN could be?
  • what was lost in simplifying it
  • simplified VegX: does it have all capabilities needed?
  • what do we lose in flattening process?
  • next barrier is OS
  • higher level needed: what data does a scientist need for their analysis?
    • map, species file w/ lat/long
  • some people are data users and data providers
  • return of information to project
  • entering data into BIEN
  • data in BIEN->make changes?
  • unique chance: we have funding, people to develop this
  • not too many responsibilities for data provides
  • don't require data provider to map data because not worth time investment for them
  • responsibility of mapping on BIEN side
  • simplify tool->view it as annotating spreadsheet
  • some scientists won't take time to map data even if called "annotating"
  • grad students creating morpho XML data file: not able to do it
  • who is customer?
  • sociocultural issue
  • simplifying input is technical challenge
  • small time providers put off if interface too complicated
  • bigger programming investment to make simple interface vs just build db
  • collection methodologies
    • Gentry plots: not intimidating (8 fields)
  • EBIF in Bolivia w/ 100s Ha plots, vernacular names
  • historical plots that have never been entered
  • students in Mexico doing plots for Masters/Ph.D.
  • does BIEN accommodate info in source data?
  • what do we want BIEN to do for us?
  • functionality on input side vs. schema, functionality on output side
  • can flatten hierarchy and still interpret as hierarchy
  • field testing: just getting into that stage
  • Shirley has been using tool to do mappings
    • 20 spreadsheets from same source
  • source -> VegX: so much there, overwhelms naive user
  • how to use data
  • mapping tools work for a certain spectrum of users
  • usability, refinement feedback
  • other tools to map source -> target
  • primary driver is what needed
  • look at veg-flat.xsd file
  • # programmer-hours to develop framework?
  • rewriting it from C#
  • .NET code in Linux
  • don't need to reengineer to much for Mac/Linux?
  • what data/BIEN group will talk about tomorrow
  • grand vision of BIEN today
  • how much emphasis on components of data group
  • limited funding, time for broad vision
  • what are time and monetary constraints
  • clarify what BIEN 3 should do, in a year from now
  • need something in a year
  • specify what BIEN 3.0 should do
  • what BIEN should do
  • by end of meeting, start developing from some codebase
  • minimum deliverable, not end vision of BIEN
  • Nick presented tool for community users to contribute data to BIEN
  • VegBranch also does this, and has similar challenges
  • prototype BIEN 3.0 organic database for people to enter data
  • well-trained researcher w/ some informatics knowledge can enter data
  • acquiring more data from Bolivia, Mexico
  • researchers willing to use Nick's tool
  • what is output? plots, traits, specimens, questions at interface
  • something more to do in output
  • acquire new data
  • shortcomings of data addressed with new schema
  • may be last time we do something to BIEN
  • making BIEN 2 usable requires a lot of BIEN scripting
  • BIEN 3 should still be usable in 3-4 years
  • should be able to grow by itself ideally
  • walk away and leave server on
  • find researchers in Latin America who want to be part of data network
  • point to something that's a success
  • unfunded server that's running and people can enter data
    • e.g. SALVIAS is completely unfunded
  • SALVIAS has been on for 10 years or so, but can't grow (no new data)
  • db grows without maintenance?
  • plots, traits, collectors
  • data entry tool w/o database?
  • do db first in case run out of time, funding
  • db w/ web interface
  • clean API to enable other people to build tools for it
  • providers making data accessible
  • individual groups can't build all the tools everyone wants
  • no matter how nice the tools are
  • schemaless? like NoSQL
  • SQL dbs rather than NoSQL
  • relational db: anyone born in last 20 years can figure it out
  • new generation more equipped for that kind of platform
  • GeoNet: 3 govt organizations put data on earthquakes together
  • biggest constraint is time
  • 3 people for a year: Aaron, John/Brad, Jim?
  • relieved from db responsibilities
  • ~1 full time person altogether
  • synergize our interests
  • coordination issues
  • improve acquisition of data
  • have core goal or objective
  • a few dozen plots collected vs. developing tools
  • world vs. western hemisphere
  • nothing in BIEN that constrains data geographically
  • TurboVeg format ingestion tool
  • globalize acceptance of data
  • populate taxon names
  • TNRS has new world bias
  • willing to take NZ data even if no plans to use it?
  • make this format compatible with others
  • decide if anyone can put data in?
  • weigh all possibilities and constraints and come up w/ bulleted list and work down it
  • critical decisions to guide group
  • tomorrow morning: science, analyses done and ongoing
  • data we can analyze now
  • afternoon: subgroups
  • think about different subgroups: science, BIEN 3.0