review: Support limiting the XPath simplifying to custom columns, rather than always the first two
review: Usage message: Fixed typo
Added mappings/for_review/VegBIEN-DwC2.specimens.csv, generated by inverting for_review/DwC2-VegBIEN.specimens.csv. This will be used to help translate VegX->VegCSV.
mappings: Made VegX-VegBIEN.organisms.csv, VegX-VegBIEN.plots.csv symlinks to VegX-VegBIEN.stems.csv instead of building them in the Makefile by copying VegX-VegBIEN.stems.csv, since these files are now always the same
mappings/VegX-VegBIEN.stems.csv: _if that maps to specimenreplicate via plantobservation or voucher: Refactored to map right-hand side of _eq in the left-hand side mapping, rather than in all then/else mappings. Distinguish this _if statement from others using new name param.
xml_func.py: _if(): Documented that can add `name` param to distinguish separate _if statements
xml_func.py: _if(): Made cond optional. When it's not specified or None, it is treated as False. This supports cases where all elements of the condition are required but not mapped to.
mappings/VegX-VegBIEN.stems.csv: _if that maps to specimenreplicate via plantobservation or voucher: Refactored to map voucherType directly into _if/cond/_eq/left rather than mapping it to a temporary _ignore location and retrieving it with _ref
xml_func.py: Removed no longer used _simplifyPath(), which is now a built-in function of db_xml.put()
xml_func.py: _eq(): Documented that '' (empty node) is returned if a value was not mapped to, not if a value was None, since None arguments are no longer removed by process() (now XML functions do this manually with conv_items())
xml_func.py: _ref(): Only display "XPath reference target missing" warning if target node does not exist, not if it exists but is empty
xpath.py: get(): reference expansion: Use get_1() and check for None result instead of using get(), which returns multiple nodes when we just want the first
mappings/VegX-VegBIEN.stems.csv: Reversed XPaths so that they start with location instead of plantobservation
lib/common.Makefile: Added $(cp)
mappings/Makefile: Include lib/common.Makefile
lib/common.Makefile: Added $(CP)
inputs/import.stats.xls: Updated with stats from latest import
mappings/VegX-VegBIEN.stems.csv: Reversed input XPaths so that they start with plot instead of individualOrganismObservation as stem
inputs/CTFS: Disabled maps because CTFS is not yet compatible with reversed XPaths, but the effort required to make it compatible is not worth including in the current commit. We lose only 2 test rows of test VegX data by doing this, since the full CTFS VegX files were never able to be imported.
ch_root, ch_root_via: Documented that these are usually not idempotent operations
mappings/VegX-VegBIEN.stems.csv: input (VegX) root: Removed tcs namespace URL to simplify the XPath reversing process. It isn't needed now that we don't generate intermediate XML documents in the automated tests (because intermediate formats are no longer required to be XML schemas).
mappings/DwC2-VegBIEN.specimens.csv: Reversed XPaths so that they start with location instead of specimenreplicate
README.TXT: WinMerge setup: Documented how to get to Compare Options page
README.TXT: WinMerge setup: Added step to set Whitespace to Ignore change
README.TXT: Moved WinMerge setup to separate section. Changed Moved block detection link to the Configuration page.
mappings/VegX-VegBIEN.stems.csv: Expanded {} expressions using expand_braces, so that each distinct output for the same input is on its own line, improving readability. This will also help enable search-and-replace reversing of XPaths for the re-rooting to location.
mappings/VegX-VegBIEN.stems.csv: VegX XPaths: Expanded {} expressions using expand_braces, so that later use of expand_braces on the file would not affect the VegX output mappings of the inputs' via maps (VegX.organisms.csv, etc.)
mappings/DwC2-VegBIEN.specimens.csv: Expanded {} expressions using expand_braces, so that each distinct output for the same input is on its own line, improving readability. This will also help enable search-and-replace reversing of XPaths for the re-rooting to location.
README.TXT: Accepting test cases: Documented that when refactoring mappings, it's helpful to use WinMerge to detect moved lines
expand_braces: Fixed bug where needed to get next line from stdin in raw mode, so that \ won't be parsed as escape chars
join: Fixed bug where when an input to mapped to multiple outputs, the joined row for each output needed to be output separately using writer.writerow()
sort_map: Remove duplicates resulting from multiple outputs for the same input. mappings/Makefile: $(mkSelfMap): Removed uniq now that sort_map does this.
mappings/Makefile: $(mkSelfMap): Run uniq on the output to remove duplicates resulting from multiple outputs for the same input
expand_braces: Also expand XPaths containing [], with up to one level of nesting (which is the most we currently use), because many {} XPaths do in fact contain []. Debug-print intermediate values when env var expand_braces_debug is true. Added usage message.
expand_braces: Fixed bug where ./{ and brackets with commas inside {} are unparseable, and should not be expanded
expand_braces: Fixed bug where `head -1` seemed to read more lines than just the first, causing EOF to be returned after the first line, by using `read` instead. Support data containing \r (such as Excel-dialect CSVs) by removing it. Fixed bug where ./{...} was not being properly escaped.
Added expand_braces
mappings: location: Removed centerlatitude/centerlongitude mappings because the lat/long should be in only one place: the locationdetermination. It is up to the database querier to decide which locationdetermination(s) to use as the coordinates for a plot/specimen.
bin/map: input is CSV: Removed unused map_ var
bin/map: Documented that it's multi-safe (supports an input appearing multiple times)
subtract: Documented that it's multi-safe (supports an input appearing multiple times)
join: Made it multi-safe (supports an input appearing multiple times)
lib/common.Makefile: Added empty clean target to make sure `make clean` always works
root Makefile, input.Makefile: Maps validation: Treat missing join mappings differently from missing non-empty join mappings, because they indicate mapping to an invalid location, which is a bug. Factored maps validation code out into new lib/mappings.Makefile.
lib/common.Makefile: Added vars for chars not allowed in make targets. Added functions/vars to replace "_" with " ".
root Makefile: Include lib/common.Makefile
input.Makefile: Include lib/common.Makefile
intersect: Documented that it's multi-safe (supports an input appearing multiple times)
union: Documented that it's multi-safe (supports an input appearing multiple times)
mappings/DwC2-VegBIEN.specimens.csv: Moved shared /specimenreplicate root to mappings in preparation for reversing the XPaths so that parent table paths (such as location) don't contain a prefix for child tables (specimenreplicate, locationevent, etc.). This reversing will avoid the need to "ch_root" the child table map to obtain maps for parent tables with the prefixes removed, allowing all hierarchical levels to use the same map spreadsheet.
ch_root: Support column headers without a root, for non-hierarchical formats such as DwC
lib/common.Makefile: rsync: Time the rsync operation
in_place: Wrap EXIT handler in shell function so that "-escaping can easily be used on the temp file path
in_place: Documented that doesn't update file on error
DwC mappings: Removed ':/list/' root (full version: '::[@xmlns:dcterms=http://purl.org/dc/terms/]/list/') from map spreadsheets to simplify the boilerplate in each file. Since intermediate DwC XML files no longer need to be produced for automated tests, these roots are not needed.
inputs/import.stats.xls: Moved independent-import data to separate tab so that it wouldn't get moved to the side whenever a new column of simultaneous-import data is inserted. It is also no longer updated, because all column-based imports are now done simultaneously.
Use strings.ustr() or strings.urepr() everywhere that columns are stringified, in order to support column names with non-ASCII characters (such as in the Madidi data)
strings.py: concat(): Convert args to raw (non-Unicode) strings first, so that multi-byte Unicode sequences are considered by # of bytes instead of # of chars. This is necessary because PostgreSQL truncates identifiers by # of bytes instead of # of chars, so that identifiers will actually be less than 63 chars long when some chars were multi-byte.
strings.py: ustr(): Call str() method manually like urepr() to avoid Unicode errors when the returning string is non-ASCII
strings.py: Added urepr() and use it in repr_no_u(), to better support repr() return values with non-ASCII characters. Avoiding repr() also provides a more complete stack trace in the case of such errors.
schemas/vegbien.sql: plantobservation: plantobservation_aggregateoccurrence_count_1() trigger: Don't raise an error if existing count was >1, because there are in fact datasets (notably SALVIAS) where input records for individual stems may themselves contain aggregate data (such as plant and stem counts). For this data, we have an anomalous condition where an aggregateoccurrence has count >1 but contains one plantobservation, due to the plant/stem count being included in the first stem's record. (See <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/SALVIAS_issues#Data-interpretation-issues> for more info on this problem.) Note that our desired 1:1 relationship between aggregateoccurrence and plantobservation is still guaranteed by a constraint, but the anomalous data may still cause irregularities later on in the analysis.
sql_io.py: put_table(): Ignoring all rows on unrecoverable errors: Also support the case where has_joins == True, by setting it to False so that the no-joins case is effectively used
inputs/import.stats.xls: Moved Simultaneously above Independently because that is how we are now running the imports
Regenerated vegbien.ERD exports
schemas/vegbien.sql: 1_to_1 and *_unique_within unique indexes with a `WHERE sourceaccessioncode IS NULL` filter: Added IS NULL filters for other unique keys, so that these fallback indexes would only be used if there was no (or no other) way to uniquely identify their tables. For *_1_to_1 unique indexes, this is the case for specimens data.
schemas/vegbien.sql: stemobservation: Replaced stemobservation_unique_code unique constraint with stemobservation_unique_within_plantobservation unique index that uses COALESCE and WHERE ... IS NOT NULL appropriately, to work with sql_gen's use of COALESCE indexes and (for the renaming) to better reflect what it does
schemas/vegbien.ERD.mwb: Synced with schema
schemas/vegbien.sql: 1_to_1 and *_unique_within unique indexes intended to operate only when sourceaccessioncode is NULL: Changed to use `sourceaccessioncode IS NULL` WHERE condition instead of COALESCE element, since the sourceaccessioncode is not actually needed for the uniquification (it is already globally unique within the datasource if it's not NULL; this just covers the case where it is NULL)
schemas/vegbien.sql: unique_within unique indexes used for 1:1 relationships: Renamed to __1_to_1 to better reflect what they do
schemas/vegbien.sql: plantobservation: Corrected plantobservation_aggregateoccurrence_id_1_to_1's name to plantobservation_aggregateoccurrence_1_to_1 because it's 1:1 with aggregateoccurrence, not aggregateoccurrence_id. Made it a unique index for consistency with our general method of expressing unique constraints on potentially nullable columns.
schemas/vegbien.sql: specimenreplicate: Renamed specimenreplicate_unique_plantobservation to specimenreplicate_plantobservation_1_to_1 to better reflect what it does
schemas/vegbien.sql: locationevent unique indexes: Renamed to unique_within to better reflect what they do
schemas/vegbien.sql: location: Removed redundant location_unique_sourceaccessioncode unique constraint, which has been replaced by location_unique_within_datasource
schemas/vegbien.sql: Reset foreign key constraint names to autogenerated defaults for consistency
schemas/vegbien.sql: Renamed *_unique_datasource unique indexes to *_unique_within_datasource to better reflect what they do
schemas/vegbien.sql: locationevent: Renamed locationevent_unique_accessioncode to locationevent_unique_within_location to better reflect what it does
schemas/vegbien.sql: specimenreplicate: Renamed specimenreplicate_unique_accessioncode to specimenreplicate_unique_within_datasource to better reflect what it does
schemas/vegbien.sql: stemobservation: Renamed stemobservation_unique_accessioncode to stemobservation_unique_within_plantobservation and also apply it to NULL sourceaccessioncodes, so that a plantobservation can have a single stemobservation for its single stem's traits without needing a separate sourceaccessioncode for it
schemas/vegbien.sql: aggregateoccurrence: Removed redundant aggregateoccurrence_unique_accessioncode unique constraint, which has been replaced by aggregateoccurrence_unique_within_taxonoccurrence
schemas/vegbien.sql: plantnamescope: Added CHECK constraint to ensure that at least one key column is specified (an empty plantnamescope doesn't make sense; use NULL instead)
ch_root: Don't require both the input and output mappings to contain their respective new roots, since sometimes only one or the other root is being subset. This will occur, for example, in mappings that are flat on the input but normalized on the output, such as VegCSV.
VegBIEN: Reversed aggregateoccurrence<->plantobservation relationship to point from plantobservation->aggregateoccurrence, so plantobservation could be scoped by aggregateoccurrence in the same way as all other core tables are scoped by their parent tables. This reversed direction was an anomaly due to the need to have a trigger auto-set aggregateoccurrence.count to 1 when there was an associated plantobservation. This was most easily accomplished on the aggregateoccurrence table itself, but required the reversed relationship. The trigger has now been reimplemented on plantobservation, which externally updates aggregateoccurrence.count.
input.Makefile: Testing: diffing test outputs: Ignore changes in whitespace, due to e.g. different indent levels. This facilitates accepting tests when an element has been nested inside another element (or unnested), by showing only the opening and closing tags of the new outer element.
dicts.py: DictProxy: Fixed bug where default value for inner param needed to be created in the constructor, or else every default instance would use and modify the same dictionary
db_xml.py: put(): wrap_e(): Call augment_error() to add the current node to the error message
db_xml.py: put(): Raise an error if there are multiple fields with the same name, instead of silently overwriting the first with the second. This generally indicates the need to use `:[@merge=1]` on the fields in question.
dicts.py: Added OnceOnlyDict and helper exception KeyExistsError
dicts.py: DictProxy: Added default value for inner param to facilitate creating empty wrapped dicts
bin/map: out_is_db: row-based mode: Debug-log the processed XML tree produced by xml_func.process()
sql_io.py: put_table(): Fixed bug where Missing mapping for NOT NULL column errors should actually be warnings because sometimes the mappings include extra tables which aren't used by the dataset
schemas/vegbien.sql: aggregateoccurrence: Added UNIQUE INDEX that makes an aggregateoccurrence unique within a taxonoccurrence. When the sourceaccessioncode isn't specified (as for individual organisms data, where this goes in plantobservation and taxonoccurrence), this ensures a 1:1 relationship between aggregateoccurrence and taxonoccurrence.
schemas/vegbien.sql: taxonoccurrence: Added UNIQUE INDEX that makes a taxonoccurrence unique within a locationevent. When the sourceaccessioncode isn't specified (as for specimens data), this ensures a 1:1 relationship between taxonoccurrence and locationevent.
mappings/VegX-VegBIEN.stems.csv: binomial (full) plantname: Also mapped to an alternative for taxonoccurrence.sourceaccessioncode, for aggregate plots data that distinguishes taxonoccurrences only by plantname (such as CVS)
exc.py: e_msg(): Fixed bug where exceptions with nothing in e.args (such as StopIteration) caused a failed assertion. Fixed bug where exceptions with multiple values in e.args (such as certain IOErrors) caused a failed assertion.
sql.py: flatten(): Documented that shouldn't cache query because the temp table will usually be truncated after use