Project

General

Profile

VegX changes

Schema

New CSV representation

In trying to load a VegX export of the CTFS database, we found that there are several issues with large VegX files:

  • Each logical row is spread out throughout the XML document because the data is grouped by top-level table instead of by row
  • This requires the entire XML file to be imported into memory at once
  • Python expands the size of the file significantly (30x+) when creating the DOM tree, which exhausts the available memory and crashes the import

We propose to solve these problems with a CSV format similar to Darwin Core for plots data, which would be both easier to import and easier to generate:

  • This new format, named VegCSV, will use just the leaf names from VegX, as well as terms from DwC and VegBank, as CSV column names.
  • It will provide a "grab bag" of terms to map to, in the same way that DwC does.
  • Nesting of data (organisms within plots, stems within organisms) would be represented as multiple DB tables or CSVs, with each child table record having a foreign key to its parent record

We hope that this format will preserve the content of VegX, while solving the issues involved with using XML.

Changes

Structural

  • Allow nesting individualOrganismObservation inside aggregateOrganismObservation as an alternative
    • Currently, individualOrganismObservation and aggregateOrganismObservation must be linked together using a common taxonNameUsageConcept
  • Allow all top-level tables to alternatively be nested inside their parent elements
    • e.g. individualOrganismObservation inside plotObservation
    • This will cause duplication with the existing pointer-based method of connecting parent and child tables

Convert user-defined fields to first-class fields

21 of 26 fields (in 4 tables) will be converted:

  • plotObservation
    • methodNarrative
    • parentPlotObservationID
    • precipitation
    • sourceAccessionCode
  • individualOrganismObservation
    • collectionDate
    • growthForm
    • sourceAccessionCode
  • individualOrganism
    • authorPlantCode
    • identificationLabel: multiple copies will be allowed to accommodate tag2
  • abioticObservation: may be standardized to soil exchange schema
    • acidity
    • base
    • calcium
    • carbon
    • cationExchangeCapacity
    • clay
    • conductivity
    • organic
    • sand
    • silt
    • sodium
    • texture

Other fields will remain user-defined:

  • individualOrganismObservation
    • canopyForm
    • canopyPosition
    • censusNo
    • heightFirstBranch
    • lianaInfestation

Closed lists

  • /*s/taxonConcept/Rank/@code/TaxonomicRankEnum/TaxonomicRankSpeciesGroupEnum: add auth