schemas/: svn:ignore log files
Added inputs/.NCBI/. This uses many of the new schema and mappings features, such as taxonconcept.sourceaccessioncode and parentTaxonID
mappings/VegCore-VegBIEN.csv: identifyingtaxonomicname: Don't create if taxonconcept has an explicit parent, because the taxonName (which is generally only a component of the full taxonomic name, e.g. specificEpithet) is not globally unique. Datasources that provide name components in such a way that levels at or below family can't be directly concatenated cannot currently receive an identifyingtaxonomicname for input to TNRS.
mappings/VegCore-VegBIEN.csv: taxonName->identifyingtaxonomicname: Don't include the rank with the taxonName, because TNRS only allows the rank to be included in the taxonomic name if it's infraspecific (otherwise, it returns no or an invalid match due to the presence of what it sees as an invalid term or a name component)
mappings/VegCore-VegBIEN.csv: Mapped taxonName to the TNRS input taxonconcept's identifyingtaxonomicname
mappings/VegCore-VegBIEN.csv: Only forward taxonRank to the parent taxonconcept (which stores the infraspecific taxonconcept when the infraspecificEpithet is provided) if there is no explicit parent provided via parentTaxonID/etc.
mappings/VegCore-VegBIEN.csv: Mapped parentScientificNameID, parentTaxonConceptID, parentTaxonID
mappings/VegCore.csv: Added parentScientificNameID, parentTaxonConceptID, parentTaxonID
input.Makefile: $(inDatasrc): Also include the vegbien_dest $schemas in the search_path, so that the datasource's SQL scripts (create.sql, etc.) can use VegBIEN functions and types
lib/common.Makefile: Added $(comma)
inputs/test_taxonomic_names/_scrub/public.sql: Regenerated with schema changes
input.Makefile: Maps building: %/.map.csv.last_cleanup: Fixed bug where needed to include $(coreMap) as a prerequisite, because even though it is not used directly in this target's recipe, it is used by targets invoked via recursive make after the main recipe runs. In general, whenever targets forward commands to a recursive make target, they also need to forward those recursive targets' prerequisites by including them in their own prerequisites list.
mappings/VegCore-VegBIEN.csv: Mapped taxonConceptID, taxonID, scientificNameID to taxonconcept.sourceaccessioncode. Note that taxonconcept stores all of these taxonomic entities, using creator_id+creationdate, taxonname+rank+parent_id, and identifyingtaxonomicname, respectively.
mappings/VegCore-VegBIEN.csv: Mapped taxonName
mappings/VegCore.csv: Added taxonName
schemas/vegbien.ERD.mwb: Fixed lines
schemas/vegbien.sql: Copied functions in the functions schema that are also used by the public schema to the public schema, so that reinstalling the functions schema would not cause anything that depends on a function in it to be cascadingly deleted. Currently, this just affects analytical_db_view, which uses _fraction_to_percent().
schemas/vegbien.sql: taxonconcept: Added taxonconcept_2_propagate_accepted_concept_id() trigger to auto-populate the accepted_concept_id
schemas/vegbien.sql: taxonconcept.sourceaccessioncode: Added descriptive comment
schemas/vegbien.sql: taxonconcept.accepted_concept_id: Added descriptive comment
Regenerated vegbien.ERD exports
schemas/vegbien.sql: taxonconcept: Added sourceaccessioncode, and allow it to scope the taxonconcept when provided
schemas/vegbien.sql: taxonconcept: Renamed canon_concept_id to matched_concept_id, because this is actually the closest-match taxonconcept in the match hierarchy (datasource concept -> parsed concept -> matched concept -> accepted concept) rather than the accepted synonym, which goes in accepted_concept_id
schemas/vegbien.sql: taxonconcept: Added accepted_concept_id
schemas/vegbien.sql: taxonconcept.canon_concept_id: comment: Changed "accepted synonym" to "closest match", since canon_concept_id is actually a hierarchy from datasource concept -> parsed concept -> matched concept -> accepted concept
schemas/vegbien.sql: taxonconcept: Added order # to trigger names so they run in a defined order (triggers are run in alphabetical order)
README.TXT: Use new revision # in log filenames to get all the logs for an import. Changed <datetime> to <version> because the rotated public schema now also includes the svn revision.
lib/common.Makefile: $(version): Include both the svn revision when make was started as well as the svn revision when the command is actually run (when these values differ), in case svn was updated between the time an import was started and the time a particular table started being imported. Because tables within a datasource are imported sequentially, it is possible that an update would have happened before the last table started importing.
Makefile: Moved setting of $(root) before include of lib/common.Makefile because it's used by lib/common.Makefile
Factored OS section out from Makefile, input.Makefile into lib/common.Makefile
Makefile, input.Makefile: Use new $(version), which unlike $(date) also includes the svn revision, to version log files, etc. This way, the working copy can be put back to the way it was at the time of a given import (excluding changes to nonversioned files). This also makes it easier to get all the log files for a particular import when different tables' imports started at different times.
Makefile: Added $(root) for use with $(rootRevision)
lib/common.Makefile: Added $(version), to replace $(date) for versioning log files, etc., and helper function $(rootRevision)
lib/common.Makefile: Added $(revision)
input.Makefile: Removed no longer used $(SED)
lib/common.Makefile: Added $(sed)
Factored $(date) out from Makefile, input.Makefile into lib/common.Makefile
sql_io.py: put_table(): DuplicateKeyException: Fixed bug where indexes with conditions needed to have the input rows filtered by the condition, to prevent trying to retrieve an existing/inserted row using a join on the index columns when the index in fact does not apply. This fixes a bug in the import of taxonconcept where the taxonconcept_0_unique_identifying_name unique index has a condition which was not satisfied for input rows with no identifyingtaxonomicname, causing any input row with NULL in this column to match all taxonconcepts with a NULL identifyingtaxonomicname. This uses ignore_cond()'s new support for constraints that did not fail at least once.
sql_io.py: put_table(): ignore_cond(): Added support for constraints that did not fail at least once, and therefore should not be required to simplify to a non-false value. As part of this, only track the failed constraint in the errors table if it actually failed at least once based on the deleted row count or the `failed` param.
sql_gen.py: map_expr(): Fixed bug where names were being replaced when they were inside another name. This occurred with combined names created by sql_io.into_table_name().
sql.py: ConstraintException: message: Wrap condition in strings.as_tt()
sql.py: run_query(): DuplicateKeyException: Also retrieve the index's condition using new index_cond()
sql.py: Added index_cond()
sql_io.py: put_table(): insert_into_pkeys(): Take a query as the param instead of sql.mk_select()'s params, to allow the caller to pass in any query without needing insert_into_pkeys() to manually pass through those args
sql.py: constraint_cond(): Fixed NotImplementedError message to apply to this function
sql_io.py: put_table(): ignore_cond(): Log message: Replaced don't with do not so it wouldn't mess up syntax highlighting when viewing the log file in a text editor
input.Makefile: Staging tables installation: Don't delete %/header.csv on error, because header.csv is a byproduct rather than the primary output and is created roughly atomically
schemas/vegbien.sql: *_ancestor tables: Added descriptive comment that these are ancestor cross link tables
csvs.py: sniff(): Support multi-char delims using \t, such as \t|\t used by NCBI. Support custom line suffixes, such as \t| used by NCBI.
csvs.py: TsvReader.next(): Remove only the autodetected line ending instead of any standard line ending. Note that this requires all header override files to use the same line ending as the CSV they override, which is now the case.
csvs.py: is_tsv(): Support multi-char delimiters by checking only the first char of the delimiter
csvs.py: sniff(): Also autodetect the line ending
inputs/test_taxonomic_names/Taxon/+header.txt: Changed line endings to \r\n to match testNames.txt line endings. This will be necessary when the line ending is autodetected by csvs.sniff().
csvs.py: TsvReader.next(): Renamed raw_contents var to line, since this is just the line with the ending removed
strings.py: Replaced no longer used contains_any() with find_any(), which returns any found substring, or None if none of the substrings were found
csvs.py: Modify csv.Dialect._validate() to ignore "delimiter must be a 1-character string" errors, in order to support multi-char delimiters used by TsvReader
csvs.py: TsvReader: Use str.split() instead of csv.reader().next() to parse the row, for efficiency and to support multi-char delimiters. This is possible because the TSV dialect doesn't use CSV parsing features other than the delimiter and newline-escaping (which is handled separately).
input.Makefile: $(exts): Added .dmp
csvs.py: delims: Added |
Removed no longer used inputs/.public/. Use inputs/.TNRS/ and inputs/.TNRS/tnrs/tnrs.make instead.
README.TXT: Documentation: To import and scrub just the test taxonomic names: Added steps to restore the original DB when the test scrub is complete
inputs/test_taxonomic_names/test_scrub: Also export the results to inputs/test_taxonomic_names/_scrub/
inputs/test_taxonomic_names/test_scrub: Use regular for .. in loop with a list of what's being processed in each iteration (match_input_names, parse_accepted_names)
inputs/.TNRS/tnrs/map.csv: Mapped Genus_score, Specific_epithet_score
mappings/VegCore-VegBIEN.csv: Mapped matchedGenusFit_fraction, matchedSpeciesFit_fraction. Reordered canon_concept_fit_fraction _maxs in the order they would be used if _alt were being used instead.
mappings/VegCore.csv: Added matchedSpeciesFit_fraction
mappings/VegCore.csv: matchedFamilyFit_fraction: Source the "matched" to Family_matched, which is a closer fit than Name_matched. matchedGenusFit_fraction: Fixed Genus_matched source to use #detailed_download instead of #simple_download.
mappings/VegCore.csv: Added matchedGenusFit_fraction
README.TXT: Removed extra trailing whitespace
README.TXT: Documentation: To import and scrub just the test taxonomic names: Use new inputs/test_taxonomic_names/test_scrub
Added inputs/test_taxonomic_names/test_scrub
schemas/vegbien.sql: taxonconcept: Renamed canon_taxonconcept_id to canon_concept_id to shorten the name, which is used often
schemas/vegbien.sql: taxonconcept: Added taxonconcept_canon_concept_min_fit() trigger to remove the canon_concept_id link from insufficient matches. These occur when e.g. a name in another language is approximated to a latin name or when the input name is not a proper taxon but TNRS provides a best-guess match anyway.
inputs/.TNRS/tnrs/map.csv: Mapped Family_score to new matchedFamilyFit_fraction
mappings/VegCore-VegBIEN.csv: Use matchedFamilyFit_fraction as canon_concept_fit_fraction when greater than matchedTaxonFit_fraction, because if there is at least a matched family, there is a valid taxonconcept to attach to
xml_func.py: Simplifying functions: Added _min, _max as passthroughs
schemas/functions.sql: Added _max(), _min()
mappings/VegCore.csv: Added matchedFamilyFit_fraction
mappings/VegCore-VegBIEN.csv: Remapped matchedTaxonFit_fraction to the verbatim* taxonconcept, because this is actually for the verbatim* concept's fit to the matched concept, not the matched concept's fit to the accepted concept
inputs/.TNRS/tnrs/map.csv: Restored *-prefixed output terms for unmapped terms that had initially been mapped to OMIT but could reasonably match to something in the future. Continue mapping Name_number to OMIT because it isn't globally unique (it identifies the name only within one TNRS batch).
inputs/.TNRS/tnrs/map.csv: Mapped Overall_score to new matchedTaxonFit_fraction
mappings/VegCore-VegBIEN.csv: Mapped matchedTaxonFit_fraction to _set_canon_taxonconcept(canon_concept_fit_fraction)
mappings/VegCore.csv: Added matchedTaxonFit_fraction
schemas/vegbien.sql: _set_canon_taxonconcept(): Also set the canon_concept_fit_fraction
schemas/vegbien.sql: taxonconcept: Added canon_concept_fit_fraction to store the closeness of fit of the canon_concept
sql.py: mk_update(): in_place: Convert columns of type character varying to text so that they can be merge-joined with text columns. Note that these two types are equivalent but not aliases of one another, so the explicit type change is needed.
sql_gen.py: Added canon_type()
sql.py: mk_update(): in_place: Factored retrieval of column type out into separate statement for clarity
schemas/functions.sql: _join*(): Fixed bug where was returning '' instead of NULL when only NULL inputs were provided, because array_to_string() always returns a non-NULL string. Functions must always return NULL in place of '' to ensure that empty strings do not find their way into VegBIEN, and to prevent inconsistencies between row-based and column-based import (row-based import folds empty strings to NULL while column-based import relies on having a clean input table).
sql_io.py: cleanup_table(): Use sql.table_pkey_col() instead of sql.pkey_col() so that only an actual pkey column is removed from the list of columns to clean. This fixes a bug where the first column in the table was not cleaned up if there was no pkey. Note that this bug only affected newly re-created staging tables, because staging tables previously had a special row_num pkey column added if they did not already have a pkey. The row_num column is now added by column-based import instead.
sql.py: table_pkey_col(): Raise a DoesNotExistException if the table has no pkey