Project

General

Profile

Proposed enhancements

1.5-2 years left on 14 tasks; 1.5 years spent on 6 tasks since the 2012 working group [as of 2014-9-10]

  • task order reflects the BIEN group's decisions, where a priority order was decided
  • mouse over a title to see the BIEN-internal name for the task
  • with modifications by Mark

fully-documented database

To assist users in interpreting our data, we would like to document all fields in our VegBIEN schema. Definitions of each column would be placed directly in the schema as column comments, so that the data dictionary does not get separated from the schema as is the case in most other vegetation databases. (0.5 months left; 1 month spent)

automated import validation

Whenever data is transformed from one schema to another, there is the potential for data to be lost or scrambled in translation. To detect these import process errors, we plan to create an *automated validation pipeline* that will run summarizing queries on the input and output data and check that the results match. (1.5-2 months left; 4 months spent)

database testing, validation, and refinement

Previous vegetation databases have not provided assurance to their users that their data is properly represented and valid within the confederated data resource. We would like to be the first vegetation database to completely audit its data, by verifying that data has been properly loaded and standardized, as well as by performing additional checks on the validity of the provided data itself. This would enable us to fully represent to the vegetation community that our database is an accurate and reliable information resource. (1-2 months left; 8.5 months spent)

database publication

To enable researchers outside of our core research group to use our data, we would like to make the contents of our database public, to the extent that conditions of use allow. This entails determining the conditions of use for each datasource, and publishing the database in a format that follows these conditions. Where needed, private data and datasources will be removed to exclude them from the public database. (1 month left; 1 month spent)

user community feedback

To ensure that our user community's needs are satisfied by the BIEN database, we would like to field-test the usefulness of the resources to answer specific questions for the expert scientists involved with the BIEN effort. This would enable us to fully represent to the vegetation community that our database is a useful information resource. Ideally this will be done in the context of some pressing science research questions, with iterative feedback to the database developers as to satisfactoriness of: contents, performance, usability, etc. (0.5 months left; 1 month spent)

source data update

Some of BIEN3's source data is still from its predecessor BIEN2. In addition, this source data has often been modified from its original form, a process which can introduce errors. To ensure that BIEN3 reflects the most recent and accurate representation of its datasources, we would like to obtain new extracts from each of the datasources used in BIEN2. (1 month left; 3 months spent)

data export interface

To facilitate the use of BIEN in scientific research, we would like to create a web-based data export interface. This would enable scientists to easily download the BIEN data they need, without needing to ask support personnel for a custom extract, or to be experts at constructing complex SQL statements. This will involve dialogue with expert researchers as to the typical dimensions along which they want to filter and aggregate their data, such as a geospatial selection capability, a taxonomic selection interface, and some basic summarization (e.g. taxa summary/plot; counts/taxon/plot; etc.) routines to facilitate refined query construction. The interface will likely be based on phpPgAdmin's select interface. (1 month)

data upload interface

To encourage the growth of BIEN through new submissions, we would like to create a web-based data loading interface which would accept data in various standard formats, such as Darwin Core and VegCore. This would allow data providers to add data to BIEN themselves, rather than needing to rely on support personnel. This will involve creating a simple loading tool, with backend validation, and summary reports provided back to the data contributor. (1 month)

plain-text record identifiers

broader social impact: would make debugging and managing databases much easier. if we are successful in implementing these, we may set a good example for the database community to follow.

Most relational databases use arbitrary, numeric IDs to create relationships between rows. Unfortunately, these numeric IDs are unstable and will be different each time the data is reimported, preventing them from serving as permanent, publishable identifiers for the record. Further, the numeric IDs are unique only within one instance of the database, which causes collisions when different copies of the database are merged together (such as when combining data from different data providers). We have developed a stable, human-readable, and easily mergeable type of identifier called a text ID, which we would like to implement in our database to avoid these problems. (This would also speed up data validations, by creating a single identifier that can be used to match up input and output data.) (1-2 months)

streamlined import process

broader social impact: would allow adding automatic normalization and plain-text ID creation to any PostgreSQL schema

To dramatically improve maintainability, we would like to redesign our data import process to avoid using techniques that introduce unnecessary complexity. For example, we have encountered significant schema-refactoring challenges as a result of using a previous database's import approach, which we discovered was not designed for frequent schema changes1. We have also had problems as a result of storing our business logic external to the database itself, and would like to move business rules into the schema so that they will run automatically behind the scenes, without needing special coding in each database client.

We have developed a reimplementation of our import algorithm called trigger-based import that fixes these problems, and would like to implement it in our schema. This would facilitate schema changes (including those involved in switching to our intuitive schema below), reduce validation errors in the imported data, and eventually allow a transitive proof of our data's validity (by showing that each step preserves the correctness of the data, and therefore the entire import process does, too). (4-6 months)

1 it required us to use large XML documents and long XPaths to define our data normalization algorithm, which would then all need to be updated for a schema change
.

seamlessly-expandable schema

broader social impact: would enable creating easily-expandable schemas, without the need for major refactoring when optional, non-core columns are added

Current SQL schemas are typically limited to storing explicitly-supported data. This requires cumbersome schema changes every time a new field or table needs to be added to the database. We would like to support arbitrary data by using enhanced semantic techniques such as triplestore inheritance2 and inline key-value maps. This will make it much easier to provide new columns to scientists, because the additional columns will only need to be renamed to a common namespace rather than also added to the database. (1-2 months)

2 in which optional columns are implemented as single-field tables whose primary keys link to the main table. queries could then join together several of these single-field tables to create a combined result set.
.

intuitive schema

To make it easy to find data in BIEN, we would like to switch to a more intuitive schema. For this purpose, we would use our *normalized VegCore schema*, which has been described by one of our scientists as "miles ahead of VegBIEN". Normalized VegCore provides more entities, better names, and more accurate structure than VegBIEN, and also adds inheritance, an important semantic feature that would be a first for vegetation databases. (2-3 months left; 1 month spent)

published exchange schema

To encourage the widespread exchange of vegetation data for the benefit of the ecological community, we would like to publish our VegCore exchange schema as a formal standard. As the successor to the popular Darwin Core format, VegCore would add numerous entities and observation types, as well as support for multi-table relational databases. (1 month)

data mapping tool

Much vegetation data is not yet in a standard exchange schema, but we would like to include it in BIEN. To help data providers map their data to a BIEN-compatible format usable by the data loading interface, we would like to implement a graphical mapping tool that would generate extracts in our VegCore exchange format. The resulting VegCore extracts would even be useful independently of BIEN, for direct data exchange between institutions. (2 months)