input.Makefile: Removed no longer used custom via maps code, so that map files no longer need a prefix (which is always the same) specifying that they map through Veg+. Veg+ thus serves as the single gateway to VegBIEN, which avoids ever again having to maintain two copies of the mappings, as was the case when DwC and VegX XPaths were separate gateways. This will assist in untying the complex mapping logic in input.Makefile from file naming conventions in mappings/, and simplify the task of grouping each map with the CSV it maps.
input.Makefile: Removed no longer used DB inputs section, because all of our inputs are either CSV or (rarely) XML. This removes a significant amount of dead code that will make it easier to refactor input.Makefile to use custom CSV import orders.
mappings/Veg+-VegCore.specimens.csv: Added mappings for miscellaneous terms
mappings/Veg+.terms.csv: Added miscellaneous terms
to_do/: svn:ignore OpenOffice lock files
inputs/import.stats.xls: Updated with stats from latest import. The import time for SpeciesLink (the slowest datasource) went back down to 9 hours after replacing the slower _merge with _alt.
Added new autogen mappings/VegCore.self.specimens.csv (not currently used)
Merged DwC (including DwC1) and VegCSV mappings into new Veg+ schema. This involves replacing occurrences of DwC and VegCSV with Veg+ (or sometimes VegCore) everywhere, as described in <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/VegCSV-DwC_merging>.
README.TXT: Schema changes: Updated filenames of PDF ERD exports
Regenerated vegbien.ERD exports
xpath.py: parse(): _value(): Support '+' as a word character that doesn't need to be quoted
intersect: Fixed bug where test for ignore option needed to be removed, because ignore is not supported by this program
util.py: list_subset(): Fixed bug where using '+' to append the rest of the list didn't work if '+' was the first index, because max() cannot be called on an empty list
mappings/DwC2-VegBIEN.specimens.csv: Added VegCSV mappings, to enable use of one VegCSV-VegBIEN mapping for specimens and plots data
inputs/XAL/maps/DwC.specimens.csv: Remapped FieldNumber to recordNumber because this historical DwC term (http://rs.tdwg.org/dwc/terms/history/index.htm#fieldNumber-2009-04-24) has close to the same meaning as recordNumber, but not the same meaning as the current fieldNumber term
inputs/SpeciesLink/maps/DwC.specimens.csv: Remapped fieldNumber to recordNumber because term usage was inconsistent with DwC definition. Datasources often confuse this term, because it seems like the collection number, but is actually the author code for the event (VegBank's authorObsCode).
mappings/DwC2-VegBIEN.specimens.csv: catalogNumber: Added additional VegCSV mappings for mergability. taxonoccurrence.authortaxoncode: Added alternative mappings from VegCSV for mergability.
xml_func.py: simplify(): Apply pass-through optimizations for _if statements with no condition (which means false). This faciliates automated testing after an _if statement has been added, because the put template provided as part of the automated test will only change for those datasources that actually have a condition entry for the _if statement, which greatly reduces the number of tests that need to be accepted. (Note that the path before the _if will still be included as an empty path if there are no other mappings to that table, because the _if statement does not surround it.)
mappings/VegCSV-VegBIEN.specimens.csv: Added DwC mappings, to enable use of one VegCSV-VegBIEN mapping for specimens and plots data
schemas/vegbien.sql: Moved collectionnumber from specimenreplicate to plantobservation to replace authorplantcode, since these terms are used analogously in plots and specimens data. This code is really the DwC recordNumber (VegBIEN collectionnumber), which "serves as a link between field notes and an Occurrence record, such as a specimen [or plots data] collector's number" (http://rs.tdwg.org/dwc/terms/#recordNumber). Also, this prevents a specimenreplicate from incorrectly being created when plots data provides an authorplantcode.
mappings/DwC2-VegBIEN.specimens.csv: Mapped individualID for mergability with VegCSV
mappings/DwC2-VegBIEN.specimens.csv, VegCSV-VegBIEN.specimens.csv: Split occurrenceID into occurrenceID and individualID, where individualID refers to the plant in plots data and occurrenceID refers to the specimen in specimens data. This prevents plant sourceaccessioncodes from being mapped to the specimenreplicate, which was messing up stems mappings for the parent plantobservation. It also avoids mapping the specimenreplicate sourceaccessioncode to additional tables where it isn't needed. (Note that occurrenceID is needed for location to ensure that each specimen gets its own location to make locationdeterminations on. Everything else is directly or indirectly scoped by location when its own sourceaccessioncode isn't specified.)
mappings/DwC2-VegBIEN.specimens.csv, VegCSV-VegBIEN.specimens.csv: taxonoccurrence: Removed catalogNumber mapping because the catalogNumber applies only to the specimen, not to the occurrence, especially in plots data
mappings/DwC2-VegBIEN.specimens.csv, VegCSV-VegBIEN.specimens.csv: taxonoccurrence: Map everything except occurrenceID (which is globally unique) to new authortaxoncode, which only needs to be unique within the locationevent
schemas/vegbien.sql: taxonoccurrence: Renamed taxonoccurrence_locationevent_1_to_1 to taxonoccurrence_unique_within_locationevent and added new authortaxoncode to it
schemas/vegbien.sql: taxonoccurrence: Added authortaxoncode to store unique keys that are unique within the locationevent rather than within the datasource
inputs/SALVIAS-CSV/maps/VegCSV.organisms.csv: Added _alt to height_m, stem_height_m to choose between them when both are specified (rather than having bin/map choose their priority order based on their order in the map). Note that when both of the heights are specified, they are always either the same, or height_m is invalid (see <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/SALVIAS_issues#Some-organisms-have-one-stem-but-different-heights-in-the-organisms-and-stems-tables>).
bin/map: collision_suffix: Setting back to _alt to test if _merge caused the SpeciesLink slowdown. SpeciesLink contains a huge number of equivalent columns due to each DwC term being present with namespaces for all versions of the DwC schema, and these columns can be combined either using _alt or _merge. _merge is only useful if the values in different versions of the same DwC field are different, which is not likely the case.
inputs/import.stats.xls: Updated with stats from latest import. The import time for SpeciesLink (the slowest datasource) doubled, to 16 hours, most likely due to replacing _alt with the slower _merge, which preserves more input data.
mappings/DwC2-VegBIEN.specimens.csv, VegCSV-VegBIEN.specimens.csv: occurrenceID: Mapped to location.authorlocationcode instead of sourceaccessioncode so that it would not override any location- or event-related IDs in location.authorlocationcode merely by being mapped to the sourceaccessioncode field (which takes precedence over the authorlocationcode when specified)
mappings/VegCSV-VegBIEN.specimens.csv: occurrenceID: Mapped to specimenreplicate.sourceaccessioncode for mergability with DwC
mappings/VegCSV-VegBIEN.specimens.csv: Mapped voucherType to indirect voucher _if statements' conditions
mappings/VegCSV-VegBIEN.specimens.csv: locationID: location.sourceaccessioncode mapping: Added /_alt suffix for mergability with DwC
mappings/DwC2-VegBIEN.specimens.csv: collectionID: Mapped to location.authorlocationcode as merge with collectionCode, the same way as it is for specimenreplicate.collectioncode_dwc
schemas/vegbien.sql: location: location_unique_within_datasource_by_authorlocationcode unique index: Added `parent_id IS NULL` condition so that an authorlocationcode is not unintentionally treated as globally unique when a parent location is available (which implies that the authorlocationcode is a subplot code)
mappings/VegCSV-VegBIEN.specimens.csv: catalogNumber: Added location.authorlocationcode mapping for mergability with DwC
mappings/DwC2-VegBIEN.specimens.csv: location.authorlocationcode mappings: Added /_alt/3 for mergability with VegCSV mappings to same field
mappings/DwC2-VegBIEN.specimens.csv: catalogNumber: Wrapped all mappings in direct voucher _if for mergability with VegCSV
mappings/DwC2-VegBIEN.specimens.csv: catalogNumber: Moved direct/indirect voucher _if inwards to wrap just the value of catalognumber_dwc, not the catalognumber_dwc field node, to match the corresponding VegCSV mapping
mappings/DwC2-VegBIEN.specimens.csv: Replaced _alt with _merge where applicable to avoid losing source data on import when multiple fields collide
mappings/VegCSV-VegBIEN.specimens.csv: Cleaned up using `make mappings/`
schemas/functions.sql: join_strs_transform(): Use STRICT optimization to avoid needing to manually check if the state value or input value is NULL (http://www.postgresql.org/docs/8.3/static/sql-createaggregate.html#AEN51596)
schemas/functions.sql: join_strs(), join_strs_transform(): Reversed order of params to enable strict optimization, which replaces the state value with the first parameter, which used to be the delimiter (http://www.postgresql.org/docs/8.3/static/sql-createaggregate.html#AEN51596)
Renamed join_strs_transform_preserve_empty() to join_strs_transform() now that there are no other join_strs_transform_...() functions
schemas/functions.sql: Removed no longer used join_strs_transform_fold_empty()
schemas/functions.sql: join_strs() aggregate: Use join_strs_transform_preserve_empty() as an optimization because all our data has already had '' replaced with NULL by sql_io.cleanup_table() in csv2db. This will help speed up _merges now that they are performed on a large scale in the slowest datasource, SpeciesLink.
bin/map: collision_suffix: Changed to use _merge instead of _alt to avoid losing source data on import when multiple fields collide
bin/map: Preventing collisions if multiple inputs mapping to same output: Made collision suffix configurable so it can easily be changed
mappings/DwC2-VegBIEN.specimens.csv, VegCSV-VegBIEN.specimens.csv: taxonoccurrence.sourceaccessioncode mappings: Added catalogNumber mapping, which takes precendence over recordNumber and is applicable to specimens data and direct vouchers. recordNumber should only be used as a last resort (before the taxon name) because this is collector-assigned and often not unique within anything.
mappings/VegCSV-VegBIEN.specimens.csv: catalogNumber: Moved direct/indirect voucher _ifs inwards to wrap just the value of catalognumber_dwc, not the catalognumber_dwc field node, so that a future SQL function implementation of _if only needs to concern itself with returning one value or another, not with handling XML subtrees. The previous moving of the _ifs in r3942 was intended to effect this, but the _ifs weren't moved in far enough to wrap just the value.
mappings/VegCSV-VegBIEN.specimens.csv: eventDate mappings: Removed collectiondate mapping because the eventDate refers only to the plot event. Added /_alt suffixes for mergability with DwC.
mappings/DwC2-VegBIEN.specimens.csv, DwC1-DwC2.specimens.csv: Split eventDate into eventDate and dateCollected, where eventDate refers only to the date of the sampling event, but dateCollected also refers to the date the particular specimen was collected. (This distinction is important in merging with VegCSV, because in plots data, these two fields are distinct.) Remapped datasources with dateCollected-related fields to new dateCollected.
bin/map: Run new xml_func.simplify() on the root before printing the put template, so that _alts and _merges with only one element for the current datasource will be printed in their simplified form (with the _alt/_merge removed). This faciliates automated testing after an _alt/_merge suffix has been added, because the put template provided as part of the automated test will only change for those datasources that actually have an entry for both mappings, which greatly reduces the number of tests that need to be accepted.
xml_func.py: Added simplify()
xpath.py: put_obj(): Use new get_values(), so that the returned nodes are not modified by XML tree transformations, such as those performed by xml_func.process()
Added get_values()
xml_dom.py: is_empty(): Treat whitespace-only text nodes (including text nodes containing empty strings) as empty. This will also support None equivalents in text nodes, because they are isspace_none_str, which is considered whitespace.
xml_func.py: _map(): Don't remove None params, because are valid values and must be supported. This will become an issue once empty strings in text nodes are considered equivalent to None.
xml_func.py: _units(): Don't remove None params, because are valid values and must be supported. This will become an issue once empty strings in text nodes are considered equivalent to None.
xml_func.py: _name(): Fixed bug where needed to pass None values through and handle no name parts to properly support NULL propagation
xml_dom.py: value(), set_value(): Use new strings.isspace_none_str as sentinel None equivalent, to support cloning text nodes containing a sentinel None
strings.py: Added isspace_none_str to support clone-safe sentinel str values that pass isspace()
xml_dom.py: is_whitespace(): Also consider empty text nodes to be whitespace
xml_dom.py: is_whitespace(): Support text nodes whose value() is None by using .nodeValue instead
xml_dom.py: set_value(): Don't set the value of a text node to None by removing it, because this prevents the node from being reused. Instead use a sentinel string value to denote None, and map to and from it.
strings.py: Added none_str and helper class NonInternedStr to support sentinel str values
xml_dom.py: set_value(): Support setting the value of a text node to None, by removing it
Removed trailing whitespace on non-empty lines
sql_io.py: put_table(): DuplicateKeyException: is_literals: Fixed bug where sql.select() needed to select on just the join_cols, not the whole mapping
xml_func.py: process(): Removed support for no longer used structural functions
xml_func.py: Removed no longer used structural functions
mappings/for_review/DwC2-VegBIEN.specimens.fields.csv: input root: Removed DwC XML path info since DwC is now a CSV schema
mappings/DwC2-VegBIEN.specimens.csv: eventDate: Also map to obsstartdate/obsenddate, since the collectiondate is also the event date for specimens data, and for mergability with VegCSV
mappings/VegCSV-VegBIEN.specimens.csv: eventDate: Added mappings to obsstartdate/obsenddate, since users of this field (currently SALVIAS census_date) intend it as the plot event's date. Keep the mapping to collectiondate because a non-range plot event date is also the collectiondate of all organisms in that plot event.
schemas/py_functions.sql: parse_date_range(): Always return a value for end date, even if string is not a date range. This enables using _dateRangeEnd() as a filter function on anything intended as an end date.
mappings/DwC2-VegBIEN.specimens.csv, VegCSV-VegBIEN.specimens.csv: eventDate: collectiondate mapping: Removed _dateRangeStart filter because the eventDate (obsstartdate) is only valid as the date the specimen was collected if it is a single date, not a date range. (It is still valid as the obsstartdate/obsenddate if it's a range.)
mappings/Veg+.terms.csv: Added dateCollected
input via maps: Removed _date/date filter from date fields because the main mappings now have _date around all dates, so this filter is redundant
inputs/SALVIAS-CSV/maps/VegCSV.organisms.csv: census_date: Don't map directly to the year, as this field is allowed to be a full date even though our data sample contains only years. Note that _date/date will automatically detect plain years and treat them as years, and so will casts to timestamp.
inputs/SALVIAS*/maps/VegCSV.organisms.csv: census_date: Documented that this is for the subplot, not the organism, as all organisms in a subplot have the same value for it
mappings/DwC2-VegBIEN.specimens.csv: verbatimLatitude/verbatimLongitude: Fixed mappings to use _alt/2 instead of _alt/1 to avoid collisions with decimalLatitude/decimalLongitude
schemas/functions.sql: _merge(): Changed sort_orders to match the $-variable name instead of the function parameter name, so each line of the VALUES clause would use the same number for both
schemas/functions.sql: _merge(): Filter out NULL values as optimization so DISTINCT ON only has to consider non-NULL values
schemas/functions.sql: join_strs(): Return NULL if all strings were NULL or ''. This fixes unexpected behavior in _merge() where all elements are NULL but the return value is non-NULL.
schemas/functions.sql: Added join_strs_transform_preserve_empty() and use it in join_strs_transform_fold_empty()
schemas/functions.sql: Renamed join_strs_() to join_strs_transform_fold_empty() for clarity and to indicate that it's for use by the join_strs() aggregate
mappings/DwC2-VegBIEN.specimens.csv: recordNumber: Added VegCSV mappings for it
mappings/DwC2-VegBIEN.specimens.csv: occurrenceID: Added VegCSV mappings for it
mappings/DwC2-VegBIEN.specimens.csv: mappings to /location/sourceaccessioncode: Added _alt to prioritize them properly
inputs/UNCC/maps/DwC.specimens.csv: herbarium: Fixed mapping to go to institutionCode instead of collectionCode
mappings/DwC2-VegBIEN.specimens.csv: Remapped institutionCode/collectionCode/catalogNumber location mappings to location.authorlocationcode
schemas/vegbien.ERD.mwb: Reset methodtaxonclass lines so that only one needs to be repositioned after syncing with the schema
mappings/VegCSV-VegBIEN.specimens.csv: locationID: Removed mapping to locationevent.sourceaccessioncode, because locationID relates to the plot, not the plot event. (The locationevent is scoped by the location when the sourceaccessioncode and authoreventcode are not specified, so duplicate elimination will still occur correctly.)
mappings/DwC2-VegBIEN.specimens.csv: Mapped locationID, for mergability with VegCSV
mappings/VegCSV-VegBIEN.specimens.csv: plotName: Removed authoreventcode mapping because plotName relates to the plot, not the plot event. (The locationevent is scoped by the location when the authoreventcode is not specified, so duplicate elimination will still occur correctly.) Instead map only authoreventcode-related fields (currently CVS's authorObsCode) to authoreventcode, via DwC's (confusingly-named) fieldNumber ("An identifier given to the event in the field").
schemas/vegbien.sql: locationevent: locationevent_unique_within_location: Added authoreventcode to index. It was already in the locationevent_unique_within_*parent*_by_authoreventcode index, but also needed to be in the no-parent (non-subplot) index. This fixes locationevent duplicate elimination when a locationevent sourceaccessioncode is not specified.
schemas/vegbien.sql: location: location_unique_within_datasource unique index: Added COALESCE and `WHERE sourceaccessioncode IS NOT NULL` now that sourceaccessioncode is nullable. Renamed location_unique_within_datasource and location_unique_authorlocationcode to location_unique_within_datasource_by_... to show that both are alternatives for globally unique keys. schemas/vegbien.ERD.mwb: Moved elements slightly to reduce the number of lines that need to be repositioned after syncing with the schema.
mappings/DwC2-VegBIEN.specimens.csv: Mapped verbatimElevation and samplingProtocol, for mergability with VegCSV