Database development¶
BIEN 3 progress¶
Real-time progress is available via Redmine's *Activity tab* and corresponding *Atom feed* (there is a delay before Redmine adds activity to the list)
See also BIEN3 - Progress since the 2011 working group.docx
Analytical database¶
- The analytical database is viewable in *vegbiendev's phpPgAdmin* (access instructions)
- Our Darwin Core export of VegBIEN now includes BIEN2 analyses such as geovalidation, detection of cultivated specimens, and higher plant group indexing
- These analyses are similar to those used to create *bien_web.observation* and *bien2.viewFullOccurrence*
- The following views are available:
View Contents *analytical_stem* Each stem and specimen in VegBIEN *analytical_aggregate* Each unique taxon in a plot, aggregated together by percent cover and DBH size class (abundance) *provider_count* The row count for each datasource *taxon_trait* Taxon traits from *bien2.TraitObservation*
VegBIEN datasources¶
- Several new datasources were added: ACAD, HIBG, JBM, MT, QFA, TRT, TRTE, UBC, VASCAN, WIN
- Several existing datasources were refreshed: CVS, NCU, SALVIAS, TEAM
- The datasources in VegBIEN, along with their row counts, are now summarized in the *provider_count* table
- This information is also on the *BIEN website under Data providers*
- All BIEN2 datasources are in VegBIEN
- This includes 28 million input rows, more than twice BIEN2's 12 million rows (viewFullOccurrence)
- The VegBIEN database is hosted at *vegbiendev.nceas.ucsb.edu* and can be accessed using the instructions on the wiki under PhpPgAdmin
- If you have an account on nimoy or vegbankdev, you will also have an account on vegbiendev with the same login
- BIEN 3-related files are available on vegbiendev in
/home/bien
- Note: Most datasource names are herbarium codes as defined by the *Index Herbariorum*
Validation¶
- The VegBIEN datasources are being validated by spot-checking sample rows from the *
analytical_stem
* view - Validation status is summarized on the VegBIEN datasources page and results are under Spot-checking
- Aggregating validations status
VegCore¶
- The VegCore data dictionary is now hosted directly on the wiki rather than in a spreadsheet
- This allows anyone with a wiki login to edit definitions and preview their changes in Redmine (contact *Jim Regetz* to get a login)
- Terms are now grouped into tables, similar to the Darwin Core data dictionary's categories
- The *VegCore terms list* is autogenerated from the data dictionary instead of vice versa
- Data providers should use the example specimen dataset as a template when providing us with Darwin Core data
VegBIEN data dictionary¶
- The VegBIEN data dictionary is now available in the *phpPgAdmin* web interface
- The Comment column on the right contains table descriptions
- Clicking a table name brings up a list of fields, with descriptions in the Comment column
- Access to the schema (not the data) is provided by a passwordless public_ user, which the user is instructed to log in as
Database permalinks¶
- Links to the phpPgAdmin web interface now always point to the latest stable version of the database
- This is done by running the import into a separate schema and then replacing the public schema when the import is complete
Traits¶
- Taxon occurrence-level traits from *bien2.TraitObservation* are now stored in VegBIEN's *trait* table
- This information is exported for the analytical database in *taxon_trait*
Taxonomic schema¶
- VegBIEN's taxonomic schema has been refactored to support a variety of ways of labeling taxa:
Method Used by Tables first-class field for each rank Darwin Core taxonverbatim tree of life hierarchy NCBI taxonlabel, taxonlabel_relationship globally unique taxonomic name VegBank taxonlabel taxon concepts CVS taxonconcept
Taxonomic name resolution¶
- Taxonomic names in VegBIEN are now parsed, matched, and canonicalized using *TNRS*
- TNRS has been run on over 2 million input names, which are standardized to accepted names
Phylogenetic backbone¶
- To enable querying the database by higher classifications, the *NCBI taxonomic tree* is now imported into VegBIEN and linked up with the taxonomic names at the family and genus level
- To look up the descendants or ancestors of a taxon, use the taxonlabel_relationship *cross-link table*
Geolocation schema¶
- VegBIEN's geolocation schema has been refactored to support a variety of ways of locating observations:
Method Used by Tables first-class field for each place level Darwin Core place hierarchy of place names VegBank placename coordinates Darwin Core coordinates geovalidation BIEN2 geoscrub place
Geovalidation¶
- BIEN2's geovalidation cache has been ported to VegBIEN for direct use with BIEN2 data in VegBIEN
- A VegBIEN geovalidation workflow based on the *BIEN2 pipeline* has been developed
VegCSV¶
- We are using a CSV format similar to Darwin Core for plots data, in order to avoid the difficulties of working with XML documents and long XPaths
- This new format, named VegCSV, uses a vocabulary of terms (VegCore) from *Darwin Core*, *VegX*, *VegBank*, and *SALVIAS* into several standard tables as CSV column names
- It provides a "grab bag" of terms to map to, in the same way that DwC does
- VegCSV has been formalized in a conference call with those involved in the development of VegX
VegBIEN schema¶
- The VegBIEN schema has been created by significantly refactoring the *VegBank schema*
- The VegBIEN schema supports several new concepts:
- methods: plot methodology, line-intercept measurements, and size classes
- taxon class inclusions/exclusions: growth forms and plant concepts sampled/not sampled
- location determinations: successive remeasurements of plot GPS coordinates, georeferencing info
- The plantname and namedplace tables have been redesigned as trees, with each element pointing to its parent element
- We are using closure tables, which store all the paths between tree nodes, to speed up queries
- Our closure table algorithm is in svn under schemas/tree_cross-links.sql
- MySQL Workbench enables us to regularly synchronize the ERD with the SQL DDL (after it's translated into MySQL)
VegCore->VegBIEN mapping¶
- We have a mapping from VegCore->VegBIEN
- This human-readable version is automatically generated from the machine-readable version
- The import uses the following algorithm:
- Generate an in-memory XML template from the mappings
- Insert this tree into the database in dependency order (leaves first) using the VegBank XML import algorithm and column-based import
- To test the import process, login to vegbiendev and run:
make test --directory=/home/bien/
Column-based import¶
- We now import data by column instead of by row, providing over an order of magnitude speed improvement and taking only ~24 hours, rather than days
- The algorithm also handles many errors server-side using *wrapper functions*, which avoids the overhead of returning to the client for each error
- Details are on the wiki under Column-based import
Data provider feedback¶
- We are using a new error-logging and data provider feedback mechanism, which logs each invalid value instead of each invalid row
- This eliminates duplication in the logged errors, making it much easier to see individual problems affecting the data
- Invalid values and their corresponding error messages are placed in an *errors table*, which is an auxiliary table for each datasource where the errors are deposited
- SQL function calls and type casts are wrapped in an exception handler that saves errors into the errors table
- Feedback is now provided on most if not all places where input data causes errors
Timeline¶
- We created milestones and a *development timeline*
- We are using Redmine to track progress on BIEN 3 tasks
- You can watch an issue by clicking the issue, and in the upper-right-hand corner of the content area, clicking Watch
- See also our To do list
Contents¶
- Data requests
- Datasources
- Documentation testing
- Formats
- Highest priorities
- Meetings
- 2010-11-09 meeting
- 2011-10-13 conference call
- 2011-10-24 to 28 working group
- 2011 working group Fr BIEN Implementation
- 2011 working group Fr Summary
- 2011 working group Mo BIEN workflow
- 2011 working group Mo technical challenges
- 2011 working group Th BIEN Components
- 2011 working group Th BIEN Implementation
- 2011 working group Th Summary of subgroups
- 2011 working group Th Use cases
- 2011 working group Th VegBank conference call
- 2011 working group Tu BIEN database
- 2011 working group Tu iPToL-BIEN Phylogenetics
- 2011 working group We BIEN tools
- 2011 working group We Use cases
- 2011-12-01 conference call
- 2011-12-08 conference call
- 2011-12-15 conference call
- 2012-01-05 conference call
- 2012-01-11 NCEAS meeting
- 2012-01-12 conference call
- 2012-01-19 conference call
- 2012-02-03 conference call
- 2012-02-10 conference call
- 2012-02-17 conference call
- 2012-02-24 conference call
- 2012-03-02 conference call
- 2012-03-09 conference call
- 2012-03-16 conference call
- 2012-03-23 conference call
- 2012-04-02 conference call on VegX modifications
- 2012-04-09 conference call
- 2012-04-20 conference call
- 2012-04-27 conference call
- 2012-05-04 conference call
- 2012-06-01 conference call
- 2012-07-26 conference call
- 2012-08-03 conference call
- 2012-08-17 conference call
- 2012-08-24 conference call
- 2012-08-30 small VegCSV conference call
- 2012-09-07 conference call
- 2012-09-13 VegCSV conference call
- 2012-09-24 conference call
- 2012-10-03 conference call
- 2012-10-19 conference call
- 2012-11-02 conference call
- 2012-11-09 conference call
- 2012-11-14 conference call on data provider metadata
- 2012-11-16 conference call
- 2012-11-26 to 30 working group
- 2012-12-07 conference call
- 2012-12-14 conference call
- 2013-01-04 conference call
- 2013-01-11 conference call
- 2013-01-18 conference call
- 2013-01-24 conference call
- 2013-01-31 conference call
- 2013-02-07 conference call
- 2013-02-14 conference call
- 2013-02-21 conference call
- 2013-02-28 conference call
- 2013-03-06 conference call with Brad
- 2013-03-07 conference call
- 2013-03-14 conference call (canceled)
- 2013-03-21 conference call
- 2013-03-28 conference call
- 2013-04-04 conference call (canceled)
- 2013-04-11 conference call (canceled)
- 2013-04-19 conference call
- 2013-04-24 conference call
- 2013-05-02 conference call
- 2013-05-09 conference call
- 2013-05-16 conference call
- 2013-05-24 conference call
- 2013-05-30 conference call
- 2013-06-06 conference call
- 2013-06-13 conference call
- 2013-06-20 conference call
- 2013-06-27 conference call
- 2013-07-02 conference call
- 2013-07-03 separate conference call
- 2013-07-11 conference call
- 2013-07-19 conference call (canceled)
- 2013-07-25 conference call
- 2013-08-01 conference call
- 2013-08-08 conference call (canceled)
- 2013-08-16 conference call (canceled)
- 2013-08-22 conference call
- 2013-08-29 conference call strategy discussion
- 2013-09-05 conference call
- 2013-09-12 conference call
- 2013-09-19 conference call
- 2013-09-19 to 10-17 conference calls (summary)
- 2013-09-26 conference call
- 2013-10-03 conference call
- 2013-10-10 conference call
- 2013-10-17 conference call
- 2013-10-25 conference call
- 2013-10-31 conference call
- 2013-11-07 conference call
- 2013-11-14 conference call
- 2013-11-21 conference call
- 2013-11-28 conference call (canceled--holiday)
- 2013-12-05 conference call
- 2013-12-12 conference call
- 2013-12-17 planning conference call
- 2013-12-19 conference call
- 2013-12-26 conference call (canceled--holiday)
- 2014-01-02 conference call (canceled--holiday)
- 2014-01-09 conference call
- 2014-01-13 planning conference call
- 2014-01-16 conference call
- 2014-01-23 conference call
- 2014-01-30 conference call
- 2014-02-06 conference call
- 2014-02-13 conference call
- 2014-02-20 conference call
- 2014-02-24 working group
- 2014-02-27 conference call
- 2014-03-06 conference call
- 2014-03-13 conference call
- 2014-03-18 schema changes conference call with Brad
- 2014-03-20 conference call (canceled)
- 2014-03-27 conference call
- 2014-04-03 conference call
- 2014-04-10 conference call
- 2014-04-17 conference call
- 2014-04-23 conference call (canceled)
- 2014-05-01 conference call
- 2014-05-08 conference call
- 2014-05-15 conference call
- 2014-05-22 conference call (canceled)
- 2014-05-29 conference call
- 2014-06-05 conference call
- 2014-06-06 separate conference call on data dictionary
- 2014-06-12 conference call on data dictionary
- 2014-06-19 conference call
- 2014-06-26 conference call
- 2014-07-03 conference call
- 2014-07-10 conference call
- 2014-07-17 conference call
- 2014-07-24 conference call (canceled)
- 2014-07-31 conference call (canceled)
- 2014-08-07 conference call
- 2014-08-14 conference calls
- 2014-08-21 conference call
- 2014-08-28 conference call
- 2014-09-04 conference call
- 2014-09-11 conference call
- 2014-09-18 conference call (canceled)
- 2014-09-25 conference call (canceled)
- 2014-10-03 conference call on CVS issues
- 2014-10-09 conference call (canceled)
- 2014-10-16 conference call
- 2014-10-24 conference call on sPlot
- 2014-10-30 conference call
- Plan
- Resources
- VegBIEN
- Accessing VegBIEN
- Analyses
- Analytical database
- Conditions of use
- Connecting to vegbiendev
- Database performance tuning
- Derived columns
- Download tracking
- Import process
- Backup benchmarks
- Denormalizing a datasource
- Import steps
- General import steps
- Specific import steps
- Adding ACAD--a Darwin Core datasource
- Adding Cyrille traits--a traits datasource
- Adding Madidi--a flat-file plots datasource
- Mapping a new table in VegBank--a SQL plots datasource
- Refreshing ACAD--a Darwin Core datasource
- Refreshing CVS--an MS Access plots datasource
- Refreshing VegBank--a SQL plots datasource
- Individual datasource refresh
- New-style import
- Normalization techniques
- Result filtering
- Row-based import benchmarks
- TNRS workflow
- Maintenance FAQ
- Validation
- VegBIEN contents
- VegBIEN data dictionary
- VegBIEN FAQ
- VegBIEN schema
- Web interface
- Web interface options