Project

General

Profile

Database development

BIEN 3 progress

Priority tasks

Real-time progress is available via Redmine's *Activity tab* and corresponding *Atom feed* (there is a delay before Redmine adds activity to the list)

See also BIEN3 - Progress since the 2011 working group.docx

Analytical database

VegBIEN datasources

  • Several new datasources were added: ACAD, HIBG, JBM, MT, QFA, TRT, TRTE, UBC, VASCAN, WIN
  • Several existing datasources were refreshed: CVS, NCU, SALVIAS, TEAM
  • The datasources in VegBIEN, along with their row counts, are now summarized in the *provider_count* table
  • All BIEN2 datasources are in VegBIEN
  • This includes 28 million input rows, more than twice BIEN2's 12 million rows (viewFullOccurrence)
  • The VegBIEN database is hosted at *vegbiendev.nceas.ucsb.edu* and can be accessed using the instructions on the wiki under PhpPgAdmin
    • If you have an account on nimoy or vegbankdev, you will also have an account on vegbiendev with the same login
    • BIEN 3-related files are available on vegbiendev in /home/bien
    • Note: Most datasource names are herbarium codes as defined by the *Index Herbariorum*

Validation

VegCore

VegBIEN data dictionary

  • The VegBIEN data dictionary is now available in the *phpPgAdmin* web interface
  • The Comment column on the right contains table descriptions
  • Clicking a table name brings up a list of fields, with descriptions in the Comment column
  • Access to the schema (not the data) is provided by a passwordless public_ user, which the user is instructed to log in as

Database permalinks

  • Links to the phpPgAdmin web interface now always point to the latest stable version of the database
  • This is done by running the import into a separate schema and then replacing the public schema when the import is complete

Traits

Taxonomic schema

  • VegBIEN's taxonomic schema has been refactored to support a variety of ways of labeling taxa:
    Method Used by Tables
    first-class field for each rank Darwin Core taxonverbatim
    tree of life hierarchy NCBI taxonlabel, taxonlabel_relationship
    globally unique taxonomic name VegBank taxonlabel
    taxon concepts CVS taxonconcept

Taxonomic name resolution

  • Taxonomic names in VegBIEN are now parsed, matched, and canonicalized using *TNRS*
  • TNRS has been run on over 2 million input names, which are standardized to accepted names

Phylogenetic backbone

  • To enable querying the database by higher classifications, the *NCBI taxonomic tree* is now imported into VegBIEN and linked up with the taxonomic names at the family and genus level
  • To look up the descendants or ancestors of a taxon, use the taxonlabel_relationship *cross-link table*

Geolocation schema

  • VegBIEN's geolocation schema has been refactored to support a variety of ways of locating observations:
    Method Used by Tables
    first-class field for each place level Darwin Core place
    hierarchy of place names VegBank placename
    coordinates Darwin Core coordinates
    geovalidation BIEN2 geoscrub place

Geovalidation

VegCSV

VegBIEN schema

  • The VegBIEN schema has been created by significantly refactoring the *VegBank schema*
  • The VegBIEN schema supports several new concepts:
    • methods: plot methodology, line-intercept measurements, and size classes
    • taxon class inclusions/exclusions: growth forms and plant concepts sampled/not sampled
    • location determinations: successive remeasurements of plot GPS coordinates, georeferencing info
  • The plantname and namedplace tables have been redesigned as trees, with each element pointing to its parent element
  • MySQL Workbench enables us to regularly synchronize the ERD with the SQL DDL (after it's translated into MySQL)

VegCore->VegBIEN mapping

  • We have a mapping from VegCore->VegBIEN
  • This human-readable version is automatically generated from the machine-readable version
  • The import uses the following algorithm:
    1. Generate an in-memory XML template from the mappings
    2. Insert this tree into the database in dependency order (leaves first) using the VegBank XML import algorithm and column-based import
  • To test the import process, login to vegbiendev and run: make test --directory=/home/bien/

Column-based import

  • We now import data by column instead of by row, providing over an order of magnitude speed improvement and taking only ~24 hours, rather than days
  • The algorithm also handles many errors server-side using *wrapper functions*, which avoids the overhead of returning to the client for each error
  • Details are on the wiki under Column-based import

Data provider feedback

  • We are using a new error-logging and data provider feedback mechanism, which logs each invalid value instead of each invalid row
    • This eliminates duplication in the logged errors, making it much easier to see individual problems affecting the data
  • Invalid values and their corresponding error messages are placed in an *errors table*, which is an auxiliary table for each datasource where the errors are deposited
  • SQL function calls and type casts are wrapped in an exception handler that saves errors into the errors table
  • Feedback is now provided on most if not all places where input data causes errors

Timeline

  • We created milestones and a *development timeline*
  • We are using Redmine to track progress on BIEN 3 tasks
  • You can watch an issue by clicking the issue, and in the upper-right-hand corner of the content area, clicking Watch
  • See also our To do list

Contents