Project

General

Profile

VegCSV vs. VegX

XML

Advantages

  • Validatable according to a schema
  • Free-form nesting of elements

Disadvantages

  • XML data is more difficult to create and parse than CSVs1
  • XML is a plaintext format, which by the nature of plaintext is not directly queryable
  • In-memory representations of XML are neither persistent nor indexed, making any queries very slow

VegX

Advantages

  • Top-level tabular structure is similar to a relational database
  • Already exported by some data providers (CTFS, NVS)
  • Standardized in a *formal process*

Disadvantages

  • VegX's combination of top-level tables and deeply normalized subtrees adds additional complexity to parsing and generation by forcing a hierarchical structure on input data which it may not have
    For example, the 1+ GB VegX files from CTFS must be loaded into memory all at once in order to access related rows scattered throughout the DOM tree in the separate top-level tables
  • This requires preprocessing the input data by joining tables together or splitting them apart, as well as mapping or creating foreign keys between the VegX tables
  • VegX groups top-level types together and connects them with foreign keys ("pointers") in a similar way to a relational database, but has none of the searchability or querying advantages of SQL
  • We have no dependencies on a VegX intermediate format
    Specifically, we don't need to create intermediate VegX files for our mapping process, so we don't gain anything from using long VegX XPaths as intermediate terms
  • When changing the mapping destination of a data field in VegX, generally the entire VegX file must be regenerated
  • Not driven by a particular aggregator's desired import format, nor by a particular data provider's desired export format

1 Note that XML is still useful (and used by us) for representing configuration information (such as mappings) and as an in-memory tree data structure. However, it is not useful for bulk data storage

VegCSV CSV representation

Advantages

VegCSV is intentionally a looser format than VegX, for the following reasons:

  • By providing a format closer to the data provider's own data, significantly lowers the cost of creating a standard export
  • Avoids excessive structure, to ensure that the data provider can include all of their original data if they are willing
    Constraints for required columns are imposed by the data aggregator instead of the exchange schema, as different aggregators have different standards for the data they are willing to accept
  • Darwin Core data is automatically VegCSV-compliant, without needing to first be split apart into 1:1 tables
  • Very easy to expand, simply by exporting a custom column and having data providers gradually converge on a standard name for that column (e.g. cultivated)
    This provides a much more robust way of storing and exchanging user-defined data than that provided by VegX on some XML elements, and without the need for a formal standardization process
  • CSVs can store data in the hierarchical format of the input data, rather than imposing a particular structure
  • Dependency relationships are indicated simply by defining the import order of the CSVs
  • Driven directly by an existing import algorithm for an existing aggregating database (VegBIEN)
    i.e. There is a customer for the product
  • Driven directly by an existing export format (DwC) for existing datasources (all the DwC data, plus several plots datasources)
    i.e. There are (many) providers of the product

Disadvantages

  • Less uniform, so may require some post-processing by the data aggregator (e.g. LEFT JOINing taxon concepts)
    • But this reduces the burden on the data provider to export their data into a standard schema
  • Not directly validatable, because the validation depends on the needs of the data aggregator using the data
    • But a standard relational schema for VegCSV could be constructed, which would have validation inherent in the database constraints
    • This would be optional, as DwC data is generally denormalized and it is not necessary to pre-normalize it for our import process

VegCore terms

Advantages

  • For ease of mapping and maintainability, it is preferable to map to short, descriptive terms as the intermediate schema
  • Datasources' column names are often the same as a standard DwC term, in which case the intermediate mapping can be applied automatically
    This auto-mapping saves significant time in the mapping process
  • Since multiple datasources often use the same, nonstandard name for a field, these nonstandard mappings can be captured in a global mapping that augments the auto-mapping of standard names, further saving time
  • When changing the mapping destination of a data field in VegCSV, only the mapping of that column must be changed

Disadvantages

  • The same term may have different meanings to different data providers
  • CSV doesn't have a standard way of representing a data dictionary, where XML has XSD annotations
    • However, CSVs can imported into a standardized relational database schema which represents term definitions in a field or table comment

Conclusion

Ultimately, VegBIEN gains little from VegX as an exchange schema, but much from Darwin Core and its VegCSV extension

For these reasons, we have switched to mapping our plots data via VegCSV terms. Eventually, it is hoped that VegBIEN data providers will use VegCSV as the preferred exchange format for plots data. Also, as many of the new VegCSV terms are consistent with DwC's design and naming conventions, we hope VegCSV will eventually become an extension to DwC.