VegCSV vs. VegX ¶

Table of contents
VegCSV vs. VegX

XML ¶

Advantages¶

Validatable according to a schema
Free-form nesting of elements

Disadvantages¶

XML data is more difficult to create and parse than CSVs¹
XML is a plaintext format, which by the nature of plaintext is not directly queryable
In-memory representations of XML are neither persistent nor indexed, making any queries very slow

VegX ¶

Advantages¶

Top-level tabular structure is similar to a relational database
Already exported by some data providers (CTFS, NVS)
Standardized in a *formal process*

Disadvantages¶

VegX's combination of top-level tables and deeply normalized subtrees adds additional complexity to parsing and generation by forcing a hierarchical structure on input data which it may not have
For example, the 1+ GB VegX files from CTFS must be loaded into memory all at once in order to access related rows scattered throughout the DOM tree in the separate top-level tables
This requires preprocessing the input data by joining tables together or splitting them apart, as well as mapping or creating foreign keys between the VegX tables
VegX groups top-level types together and connects them with foreign keys ("pointers") in a similar way to a relational database, but has none of the searchability or querying advantages of SQL
We have no dependencies on a VegX intermediate format
Specifically, we don't need to create intermediate VegX files for our mapping process, so we don't gain anything from using long VegX XPaths as intermediate terms
When changing the mapping destination of a data field in VegX, generally the entire VegX file must be regenerated
Not driven by a particular aggregator's desired import format, nor by a particular data provider's desired export format

¹ Note that XML is still useful (and used by us) for representing configuration information (such as mappings) and as an in-memory tree data structure. However, it is not useful for bulk data storage

VegCSV CSV representation¶

Advantages¶

VegCSV is intentionally a looser format than VegX, for the following reasons:

By providing a format closer to the data provider's own data, significantly lowers the cost of creating a standard export
Avoids excessive structure, to ensure that the data provider can include all of their original data if they are willing
Constraints for required columns are imposed by the data aggregator instead of the exchange schema, as different aggregators have different standards for the data they are willing to accept
Darwin Core data is automatically VegCSV-compliant, without needing to first be split apart into 1:1 tables
Very easy to expand, simply by exporting a custom column and having data providers gradually converge on a standard name for that column (e.g. cultivated)
This provides a much more robust way of storing and exchanging user-defined data than that provided by VegX on some XML elements, and without the need for a formal standardization process
CSVs can store data in the hierarchical format of the input data, rather than imposing a particular structure
Dependency relationships are indicated simply by defining the import order of the CSVs
Driven directly by an existing import algorithm for an existing aggregating database (VegBIEN)
i.e. There is a customer for the product
Driven directly by an existing export format (DwC) for existing datasources (all the DwC data, plus several plots datasources)
i.e. There are (many) providers of the product

Disadvantages¶

Less uniform, so may require some post-processing by the data aggregator (e.g. LEFT JOINing taxon concepts)
- But this reduces the burden on the data provider to export their data into a standard schema
Not directly validatable, because the validation depends on the needs of the data aggregator using the data
- But a standard relational schema for VegCSV could be constructed, which would have validation inherent in the database constraints
- This would be optional, as DwC data is generally denormalized and it is not necessary to pre-normalize it for our import process

VegCore terms¶

Advantages¶

For ease of mapping and maintainability, it is preferable to map to short, descriptive terms as the intermediate schema
Datasources' column names are often the same as a standard DwC term, in which case the intermediate mapping can be applied automatically
This auto-mapping saves significant time in the mapping process
Since multiple datasources often use the same, nonstandard name for a field, these nonstandard mappings can be captured in a global mapping that augments the auto-mapping of standard names, further saving time
When changing the mapping destination of a data field in VegCSV, only the mapping of that column must be changed

Disadvantages¶

The same term may have different meanings to different data providers
- Darwin Core partially addresses this by providing a *standard data dictionary*
CSV doesn't have a standard way of representing a data dictionary, where XML has XSD annotations
- However, CSVs can imported into a standardized relational database schema which represents term definitions in a field or table comment

Conclusion¶

Ultimately, VegBIEN gains little from VegX as an exchange schema, but much from Darwin Core and its VegCSV extension

For these reasons, we have switched to mapping our plots data via VegCSV terms. Eventually, it is hoped that VegBIEN data providers will use VegCSV as the preferred exchange format for plots data. Also, as many of the new VegCSV terms are consistent with DwC's design and naming conventions, we hope VegCSV will eventually become an extension to DwC.

Files (0)

Project

General

Profile

Wiki

VegCSV vs. VegX ¶

XML ¶

Advantages¶

Disadvantages¶

VegX ¶

Advantages¶

Disadvantages¶

VegCSV CSV representation¶

Advantages¶

Disadvantages¶

VegCore terms¶

Advantages¶

Disadvantages¶

Conclusion¶

Project

General

Profile

Wiki

VegCSV vs. VegX¶

XML¶

Advantages¶

Disadvantages¶

VegX¶

Advantages¶

Disadvantages¶

VegCSV CSV representation¶

Advantages¶

Disadvantages¶

VegCore terms¶

Advantages¶

Disadvantages¶

Conclusion¶

VegCSV vs. VegX ¶

XML ¶

VegX ¶