Database development¶

BIEN 3 progress¶

Real-time progress is available via Redmine's *Activity tab* and corresponding *Atom feed* (there is a delay before Redmine adds activity to the list)

See also BIEN3 - Progress since the 2011 working group.docx

Analytical database ¶

The analytical database is viewable in *vegbiendev's phpPgAdmin* (access instructions)
Our Darwin Core export of VegBIEN now includes BIEN2 analyses such as geovalidation, detection of cultivated specimens, and higher plant group indexing
- These analyses are similar to those used to create *bien_web.observation* and *bien2.viewFullOccurrence*

The following views are available:

View	Contents
analytical_stem	Each stem and specimen in VegBIEN
analytical_aggregate	Each unique taxon in a plot, aggregated together by percent cover and DBH size class (abundance)
provider_count	The row count for each datasource
taxon_trait	Taxon traits from bien2.TraitObservation

VegBIEN datasources ¶

Several new datasources were added: ACAD, HIBG, JBM, MT, QFA, TRT, TRTE, UBC, VASCAN, WIN
Several existing datasources were refreshed: CVS, NCU, SALVIAS, TEAM
The datasources in VegBIEN, along with their row counts, are now summarized in the *provider_count* table
- This information is also on the *BIEN website under Data providers*
All BIEN2 datasources are in VegBIEN
This includes 28 million input rows, more than twice BIEN2's 12 million rows (viewFullOccurrence)
The VegBIEN database is hosted at *vegbiendev.nceas.ucsb.edu* and can be accessed using the instructions on the wiki under PhpPgAdmin
- If you have an account on nimoy or vegbankdev, you will also have an account on vegbiendev with the same login
- BIEN 3-related files are available on vegbiendev in /home/bien
- Note: Most datasource names are herbarium codes as defined by the *Index Herbariorum*

Validation ¶

The VegBIEN datasources are being validated by spot-checking sample rows from the *analytical_stem* view
Validation status is summarized on the VegBIEN datasources page and results are under Spot-checking
Aggregating validations status

VegCore ¶

The VegCore data dictionary is now hosted directly on the wiki rather than in a spreadsheet
- This allows anyone with a wiki login to edit definitions and preview their changes in Redmine (contact *Jim Regetz* to get a login)
Terms are now grouped into tables, similar to the Darwin Core data dictionary's categories
The *VegCore terms list* is autogenerated from the data dictionary instead of vice versa
Data providers should use the example specimen dataset as a template when providing us with Darwin Core data

VegBIEN data dictionary ¶

The VegBIEN data dictionary is now available in the *phpPgAdmin* web interface
The Comment column on the right contains table descriptions
Clicking a table name brings up a list of fields, with descriptions in the Comment column
Access to the schema (not the data) is provided by a passwordless public_ user, which the user is instructed to log in as

Database permalinks¶

Links to the phpPgAdmin web interface now always point to the latest stable version of the database
This is done by running the import into a separate schema and then replacing the public schema when the import is complete

Traits¶

Taxon occurrence-level traits from *bien2.TraitObservation* are now stored in VegBIEN's *trait* table
This information is exported for the analytical database in *taxon_trait*

Taxonomic schema¶

VegBIEN's taxonomic schema has been refactored to support a variety of ways of labeling taxa:

Method	Used by	Tables
first-class field for each rank	Darwin Core	taxonverbatim
tree of life hierarchy	NCBI	taxonlabel, taxonlabel_relationship
globally unique taxonomic name	VegBank	taxonlabel
taxon concepts	CVS	taxonconcept

Taxonomic name resolution¶

Taxonomic names in VegBIEN are now parsed, matched, and canonicalized using *TNRS*
TNRS has been run on over 2 million input names, which are standardized to accepted names

Phylogenetic backbone¶

To enable querying the database by higher classifications, the *NCBI taxonomic tree* is now imported into VegBIEN and linked up with the taxonomic names at the family and genus level
To look up the descendants or ancestors of a taxon, use the taxonlabel_relationship *cross-link table*

Geolocation schema¶

VegBIEN's geolocation schema has been refactored to support a variety of ways of locating observations:

Method	Used by	Tables
first-class field for each place level	Darwin Core	place
hierarchy of place names	VegBank	placename
coordinates	Darwin Core	coordinates
geovalidation	BIEN2 geoscrub	place

Geovalidation¶

BIEN2's geovalidation cache has been ported to VegBIEN for direct use with BIEN2 data in VegBIEN
A VegBIEN geovalidation workflow based on the *BIEN2 pipeline* has been developed

VegCSV ¶

We are using a CSV format similar to Darwin Core for plots data, in order to avoid the difficulties of working with XML documents and long XPaths
This new format, named VegCSV, uses a vocabulary of terms (VegCore) from *Darwin Core*, *VegX*, *VegBank*, and *SALVIAS* into several standard tables as CSV column names
It provides a "grab bag" of terms to map to, in the same way that DwC does
VegCSV has been formalized in a conference call with those involved in the development of VegX

VegBIEN schema ¶

The VegBIEN schema has been created by significantly refactoring the *VegBank schema*
The VegBIEN schema supports several new concepts:
- methods: plot methodology, line-intercept measurements, and size classes
- taxon class inclusions/exclusions: growth forms and plant concepts sampled/not sampled
- location determinations: successive remeasurements of plot GPS coordinates, georeferencing info
The plantname and namedplace tables have been redesigned as trees, with each element pointing to its parent element
- We are using closure tables, which store all the paths between tree nodes, to speed up queries
- Our closure table algorithm is in svn under schemas/tree_cross-links.sql
MySQL Workbench enables us to regularly synchronize the ERD with the SQL DDL (after it's translated into MySQL)

VegCore->VegBIEN mapping¶

We have a mapping from VegCore->VegBIEN
This human-readable version is automatically generated from the machine-readable version
The import uses the following algorithm:
1. Generate an in-memory XML template from the mappings
2. Insert this tree into the database in dependency order (leaves first) using the VegBank XML import algorithm and column-based import
To test the import process, login to vegbiendev and run: make test --directory=/home/bien/

Column-based import ¶

We now import data by column instead of by row, providing over an order of magnitude speed improvement and taking only ~24 hours, rather than days
The algorithm also handles many errors server-side using *wrapper functions*, which avoids the overhead of returning to the client for each error
Details are on the wiki under Column-based import

Data provider feedback¶

We are using a new error-logging and data provider feedback mechanism, which logs each invalid value instead of each invalid row
- This eliminates duplication in the logged errors, making it much easier to see individual problems affecting the data
Invalid values and their corresponding error messages are placed in an *errors table*, which is an auxiliary table for each datasource where the errors are deposited
SQL function calls and type casts are wrapped in an exception handler that saves errors into the errors table
Feedback is now provided on most if not all places where input data causes errors

Timeline¶

We created milestones and a *development timeline*
We are using Redmine to track progress on BIEN 3 tasks
You can watch an issue by clicking the issue, and in the upper-right-hand corner of the content area, clicking Watch
See also our To do list

Contents¶

Files (1)

Project

General

Profile

Wiki