mappings/VegCore-VegBIEN.csv: Don't create matched taxonlabel if taxonName was provided. This fixes a bug where an NCBI node was incorrectly pointing to a TNRS name, when the reference should only be the other way around. This may also fix the TNRS slowdown, if it was caused by circular matched_label_id references.
schemas/vegbien.sql: taxonlabel_2_set_canon_label_id_on_insert(): Fixed bug where also need to set canon_label_id based on matched_label_id here, not just in taxonlabel_2_set_canon_label_id_on_update(), because the matched_label_id could be specified when the taxonlabel is first created
schemas/vegbien.sql: taxonlabel_2_set_canon_label_id_on_*(): Fixed bug where need to use := instead of = to perform assignment of canon_label_id
schemas/tree_cross-links.sql: Updated for schema changes
schemas/vegbien.sql: taxonlabel_update_ancestors(): Include ancestors for both parent_id and matched_label_id rather than just one or the other. This avoids needing to delete existing ancestors for the parent_id when a matched_label_id is added and overrides it. This should reduce the TNRS import time if the slowdown was due to the need to delete parent_id ancestors when later adding a matched_label_id (which only occurs in a separate step in the TNRS datasource).
sql_io.py: put_table(): ensure_cond(): Fixed bug where test if any rows failed cond did not check if cur != None (which is the case when cond == sql_gen.true_expr) before checking cur.rowcount
sql_gen.py: simplify_expr(): Don't require () around NULL IS NULL and NULL IS NOT NULL because extra parentheses are not provided in index conditions, only in check constraint conditions
inputs/import.stats.xls: Updated import times. The TNRS import has slowed down significantly, possibly due to a bug in the autopopulation of the taxonlabel_relationship table when the input data contains cycles.
sql_io.py: put_table(): Assertion that into and full_in_table have the same row count: Allow into to have more rows than full_in_table, in case an input row matched multiple output rows. This should not happen for a properly-configured database, but seems to happen periodically nevertheless (currently, to the MO datasource) and should not abort the import when it does.
sql.py: parse_exception(): "could not create unique index" DuplicateKeyException: Fixed bug where can't use make_DuplicateKeyException() because it tries to retrieve information about the index in question, but the index it was trying to create doesn't exist
schemas/vegbien.sql: analytical_db_view: Renamed datasource's taxonverbatim to datasource_taxonverbatim to distinguish it from the other taxonverbatims that are joined on (parsed_taxonverbatim, accepted_taxonverbatim)
inputs/.NCBI/nodes/create.sql: Make genus (mostly) globally unique by removing kingdom Animalia, which has significant genus overlap with plants. This reduces the number of duplicated genera from 578 to 65 (determined with `SELECT name_txt, count(), array_agg(rank) FROM "NCBI".nodes GROUP BY name_txt HAVING count() > 1 AND 'genus' = ALL (array_agg(rank))`).
inputs/.NCBI/nodes/create.sql: Added foreign key on parent tax_id with covering index
input.Makefile: Staging tables installation: Added %/uninstall, %/reinstall to allow reinstalling individual tables
sql_io.py: put_table(): ensure_cond(): When adding the failed condition to the errors table, also include the original, untranslated condition from the DB schema in addition to the translation of the condition into the input schema
sql_io.py: track_data_error(): Fixed bug where errors whose column had no srcs (indicated by () ) were incorrectly being ignored. This affected NOT NULL exceptions where the column was not provided by the dataset.
sql_gen.py: If no cols had srcs, return [] instead of the [()] that itertools.product() would have returned
sql_io.py: track_data_error(): Support errors with no columns by inserting a single entry with column set to NULL
strings.py: Added join()
sql_io.py: mk_errors_table(): Made "column" column nullable, because some errors (such as check constraint violations) don't have any corresponding columns if its columns weren't provided in the input data
inputs/test_taxonomic_names/test_scrub: `make inputs/.TNRS/reinstall`: Use new $schema_only option so that an empty TNRS schema is installed rather than one containing inputs/.TNRS/data.sql
inputs/.TNRS/: Added data.sql containing the test_taxonomic_names TNRS results, so that a new installation of VegBIEN will contain the necessary data to make the tests pass, including the TNRS import test
input.Makefile: Staging tables installation: If $schema_only option is set, only install .sql files ending in schema.sql
inputs/Makefile: $(rsyncLogs): Use $(rsync) instead of $(rsync*) now that it supports excluding just temp files and .svn rather than all .*
lib/common.Makefile: rsync: $(rsync): Exclude .svn, #, and .DS_Store rather than all . because dirs beginning with . created by the user (such as .NCBI, .TNRS) should be included in the sync
Added inputs/REMIB/Specimen.src/.map.csv.last_cleanup
Added inputs/bien_web/observation/+header.csv
input.Makefile: Staging tables installation: $(dbExports): When putting schemas first, don't require a . before "schema" to allow the entire filename to be schema.sql
inputs/test_taxonomic_names/_scrub/public.test_taxonomic_names.sql, TNRS.sql: Regenerated with schema and mappings changes
inputs/.TNRS/tnrs/map.csv: Added _nullIf filter to remove "Unknown" values for Accepted_name_family
README.TXT: Generate the local TNRS cache from the test_taxonomic_names rather than syncing it with the vegbiendev TNRS cache, so that the automated test's inserted row count stays the same regardless of the contents of the full-DB TNRS cache
README.TXT: Backups: Added TNRS cache section
inputs/.TNRS/tnrs/test.xml.ref: Accepted inserted row count using TNRS cache created from test_taxonomic_names. Using a standard set of names for the test ensures that the inserted row count will not change when the full-DB TNRS cache changes.
inputs/.TNRS/schema.sql: tnrs_accepted_names: Prepend the Accepted_name_family to the taxonomic name that will be submitted back to TNRS for parsing, because TNRS input names now always include the family when it's provided
inputs/.TNRS/schema.sql: tnrs_accepted_names: Use simpler array_to_string() instead of || and COALESCE to put together the taxonomic name that will be submitted back to TNRS for parsing. Note that this requires defining an IMMUTABLE wrapper function for array_to_string(), because pg_catalog.array_to_string() is declared STABLE but indexes require functions to be IMMUTABLE (http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg156323.html).
inputs/.TNRS/schema.sql: Don't hardcode the schema name
input.Makefile: Staging tables installation: sql/install: Provide the datasource's schema to the script in :schema, so it can refer to its own elements explicitly when it's not possible to rely on the search_path. This is the case for functions that have the same signature as (and are intended to replace) a pg_catalog function, because the pg_catalog function will be used in preference to the datasource function regardless of the search_path.
input.Makefile: Staging tables installation: $(cleanup): If a cleanup.sql is provided, only run it and don't do default cleanup, to allow tables to override rather than just add to default cleanup operations. This prevents the automatic replacement of certain strings (sql_io.null_strs) with NULL on TNRS, and keeps the TNRS cache mostly as it was output by the TNRS service. Note that empty strings are still replaced with NULL by COPY FROM in sql_io.append_csv(). This is necessary for TNRS import to work properly, because although '' generally means NULL, it is not treated that way by PostgreSQL.
input.Makefile: Staging tables installation: Moved custom cleanup.sql cleanup operations to main $(cleanup) function, so custom cleanup operations would run whenever any target (such as %/install) invokes $(cleanup), not just manually through %/cleanup
sql.py: parse_exception(): function MissingCastException: If first param's type is anyelement (for polymorphic function, which had mismatched arg types), use type text, as all types can cast to it
sql_io.py: cast(): Set the created function's value param type to anyelement to support any input type, not just text
mappings/VegCore-VegBIEN.csv: Only prepend the family to the concatenated scientificName for TNRS if it ends in -aceae (using _taxon_family_require_std()), to avoid sending unsupported, nonstandard families to TNRS which it will place in Unmatched_terms
schemas/vegbien.sql: Added _taxon_family_require_std()
mappings/VegCore-VegBIEN.csv: Prepend the family to the concatenated scientificName input to TNRS, so that TNRS can use it to disambiguate the genus
tnrs_db: Making TNRS request: Fixed bug where needed to remove else block now that there is no except block
tnrs.py: retrieval_request_template: Turn on taxonomic_constraint (to match family before genus) and source_sorting (to always return any result from the first source before returning results from any other sources, regardless of match %)
Regenerated vegbien.ERD exports
mappings/VegCore.csv: speciesBinomial: Changed definition to genus+specificEpithet, not genus+species, to match the scientific meaning of specificEpithet vs. species
schemas/vegbien.sql: taxonverbatim: Renamed species to specific_epithet to avoid confusion with the scientific meaning of species (genus+specificEpithet), since this field contains just the specific epithet
input.Makefile: Verification of import: verify: Use tables from the verify/*.ref files themselves rather than from the datasource's subdirs, in order to match the tables in mappings/verify.*.sql
schemas/vegbien.sql: analytical_db_view: Added stemobservation.tag, stemobservation.height_m for use in plot change over time analysis <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/Plot_change_over_time_analysis>
schemas/vegbien.sql: analytical_db_view: Fixed typo in scientificNameWithMorphospecies
schemas/vegbien.sql: analytical_db_view: Renamed columns to VegCore names (https://projects.nceas.ucsb.edu/nceas/projects/bien/repository/raw/mappings/VegCore.csv)
mappings/VegCore.csv: Added cultivatedBasis
mappings/VegCore.csv: Added scientificNameWithMorphospecies
mappings/VegCore.csv: Added speciesBinomial
schemas/vegbien.sql: analytical_db_view: Generate species by concatenating genus and specific epithet, since according to Brad this field is actually the binomial, not the specificEpithet
schemas/vegbien.sql: Removed no longer used plot_change_over_time view. Use one of the queries at <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/Plot_change_over_time_analysis> instead.
mappings/VegCore-VegBIEN.csv: location: Populate sourceaccessioncode with locationID + subplot when subplot is unique only within the parent plot, so that location always has a sourceaccessioncode to use as the plotCode in analytical_db_view
lib/PostgreSQL-MySQL.csv: Remove views because they can contain arbitrary expressions, whose syntax may not be compatible with MySQL
schemas/vegbien.sql: analytical_db_view: Use location.sourceaccessioncode as plotCode instead of authorlocationcode because authorlocationcode isn't globally unique (for subplots, it's only unique within the parent plot)
schemas/vegbien.sql: plantobservation: Made taxonoccurrence_id optional when sourceaccessioncode is specified, so that aggregateoccurrence doesn't get pruned away in datasource tables that link just a stemobservation to a plantobservation (and therefore don't provide a taxonoccurrence to satisfy the previous taxonoccurrence_id NOT NULL constraint)
schemas/vegbien.sql: aggregateoccurrence: Made taxonoccurrence_id optional when sourceaccessioncode is specified, so that aggregateoccurrence doesn't get pruned away in datasource tables that link just a stemobservation to a plantobservation (and therefore don't provide a taxonoccurrence to satisfy the previous taxonoccurrence_id NOT NULL constraint)
schemas/vegbien.sql: taxonoccurrence: Added taxonoccurrence_required_key check constraint to ensure that all taxonoccurrences are properly identified, and empty taxonoccurrences are properly pruned. This fixes a bug where taxon-only and stem-only data did not properly prune the taxonoccurrence that would otherwise get created because it's included in the mappings.
sql_io.py: put_table(): insert_into_pkeys(): Use new sql.add_pkey_or_index() instead of sql.add_pkey() in order to just print a warning if for some reason there were duplicate entries for an input row in the iteration's pkeys table. This should provide a workaround for bugs (often in the schema itself, related to its unique indexes) that cause an input row to match multiple output rows when joining on the output table using the unique constraint's columns.
sql.py: Added add_pkey_or_index()
sql.py: parse_exception(): Parse "could not create unique index ... Key is duplicated" errors as DuplicateKeyException
sql.py: parse_exception(): DuplicateKeyException: Factored out creation of DuplicateKeyException into helper function
inputs/import.stats.xls: Updated import times
tnrs_db: Removed tnrs.InvalidResponse exception handler that retries the query because the current query does not track which names have been submitted to but not processed by TNRS, so the error would continue to happen repeatedly
schemas/vegbien.sql: location: Added index on parent_id to speed up plot change over time joins
schemas/vegbien.sql: location: Added index on creator_id to speed up analytical_db_view joins
schemas/vegbien.sql: stemobservation: Added index on plantobservation_id to speed up analytical_db_view joins
schemas/vegbien.sql: Added initial plot_change_over_time view
Added inputs/bien_web/
schemas/vegbien.sql: analytical_db_view: Reordered taxonoccurrence.growthform to put if after the bien_web.observation fields
schemas/vegbien.sql: analytical_db_view: Include taxonoccurrence.growthform
schemas/vegbien.sql: analytical_db_view: Generate taxonMorphospecies by concatenating the scientificName to the morphospecies
schemas/vegbien.sql: analytical_db_view: Fixed bug where needed to take taxonomic name components from the accepted taxonlabel's taxonverbatim instead of the datasource's taxonverbatim, which does not contain the accepted name
schemas/vegbien.sql: analytical_db_view: identifiedBy: Added NULLIF to keep empty strings out of the analytical DB
schemas/vegbien.sql: analytical_db_view: Fixed bug where needed to take morphospecies from the parsed taxonlabel's taxonverbatim, where it has been parsed out, instead of the datasource's taxonverbatim, which has it as part of the verbatim input name
analytical_db_view: Added stemobservation.xposition_m, yposition_m
inputs/.TNRS/tnrs/map.csv: Added new Time_submitted field
inputs/REMIB/Specimen/header.csv: Regenerated for new staging tables format
inputs/.TNRS/tnrs/test.xml.ref: Accepted correct inserted row count, which most likely became detached from the primary row count when the TNRS cache was cleared and repopulated with test data
schemas/vegbien.sql: analytical_db_view: Reordered joins in path order, putting datasource before location. This will enable more naturally reusing the SELECT query for other analyses.
mappings/VegCore-VegBIEN.csv: TNRS<->NCBI attachment: Do not include rank in the mapping because taxonomicname is globally unique, and thus it isn't used in looking up the NCBI taxonlabel
mappings/VegCore-VegBIEN.csv: TNRS<->NCBI attachment: Also attach TNRS genus to NCBI backbone. This causes attachment to be made with as many of family and genus as are provided and have an entry in NCBI.
mappings/VegCore-VegBIEN.csv: family -> NCBI backbone: Removed extra path after _if statement's cond/_exists
mappings/VegCore-VegBIEN.csv: Instead of connecting the acceptedFamily to the NCBI backbone, connect the family for the TNRS matched taxonlabel. This connects more families and also connects the same set of fields as will be connected for the genus.
mappings/VegCore-VegBIEN.csv: TNRS<->NCBI attachment: Fixed bug where needed to attach accepted family to NCBI using taxonomicname, which is globally unique, rather than taxonepithet, which is only unique within the parent taxon
inputs/.TNRS/tnrs/: Added Time_submitted column at beginning and populate it in tnrs_db with the time the batch TNRS request was submitted
csvs.py: RowNumFilter: Use new ColInsertFilter
csvs.py: Added ColInsertFilter
schemas/vegbien.sql: Removed no longer used _is_higher_taxon(). Use _has_taxonomic_name() or _taxonomic_name_is_epithet() instead.
mappings/VegCore-VegBIEN.csv: taxonName->taxonepithet: Use new _taxonomic_name_is_epithet() instead of _is_higher_taxon(), because it's more specific to the filtering task for this field