To Do¶

scope specimenreplicate by collectionnumber when no catalognumber present
individualCount should be 1 for specimens
taxondetermination: Add constraint trigger to make sure exactly one (not zero) taxondeterminations per taxonoccurrence is always marked current
{commname,commstatus}.source_id should be scoping
store verbatim date
form scientificNameWithMorphospecies differently for specimens and plots
- use scientificName for specimens

remove no longer used centerlatitude/centerlongitude? the lat/long go in locationdetermination
Change taxonrank's forma value to form to match TCS?
move plantobservation scoping fields to taxonoccurrence, because these tables are 1:1
support raw location name in its own field, distinct from locationNarrative
partial indexes should be full where possible, so that they can be used to query the database
specimenreplicate: require catalognumber_dwc in check constraint, even if plantobservation_id provided
- first need to ensure plots data doesn't use any specimenreplicate fields except for that
Store times as a binary times in VegBIEN
add locationdetermination notes on how lat/long converted from input data, if any
Move plantobservation.stemcount to aggregateoccurrence, for cases where number of stems is known, but not which stems go to each individual
Normalize fields ending in numbers
- e.g. growthFormType
add ~~specimens~~ and traits capability
compare to CTFS

~~remove subproviders from provider_count who don't have any rows in VegBIEN~~
~~make taxonoccurrence.locationevent_id NOT NULL~~: Instead, nullable only when sourceaccessioncode is specified
- requires running all the tables' automated tests in one transaction (or in commit mode), so that the existing parent tables can be connected to
~~make taxonoccurrence.locationevent_id nullable only when sourceaccessioncode is specified~~
- but better to require a locationevent, and look up the taxonoccurrence by its sourceaccessioncode if the locationevent_id is NULL
  (Note that this will not trigger a DuplicateKeyException, because the NullValueException will be triggered first, so the existing import process can't yet do this.)
~~store full name of person instead of/in addition to parsed first/last name~~
~~fuzzing from access level~~
~~make different hierarchical levels for DwC taxonRank and infraspecificEpithet~~
~~add project.parentProject_id?~~

denormalized VegCore ¶

canon: support two terms having the same simplified form, which will be disambiguated using ? like in redmine_synonyms' output
mark terms sourced from VegX
*slopeAspect, etc.: add units
add native

VegCSV ¶

reorganize VegCSV vs. VegX into a table with two columns

VegPath ¶

web/main/: Handle symlinked dirs in .htaccess files that contain self-referential paths, e.g. VegBIEN/.htaccess > don't redirect subdir paths

VegX schema ¶

See VegX schema

Mappings¶

for TNRS, map the Unmatched_terms (morphospeciesSuffix) to NULL if a Specific_epithet_matched was provided
~~populate all datasources' import_order.txt~~
adding a subdir auto-adds it to import_order.txt
map ND to NULL (e.g. in REMIB.Specimen.accession_number, ~~locality~~)
handle taxonomic names that are actually comments, like "NO SPECIES ON PLOT"
map TEAM site placename metadata
map CVS.taxonObservation_ growthForm fields
validate CVS (after VegBank problems have been fixed)
Translate ranks to taxonrank enum values
- especially needed for NCBI.higher_taxa.rank: SELECT DISTINCT rank FROM "NCBI".nodes
store whether a source is top-level
analytical_stem TNRS names: Merge name containing just a family and family field when combining, so family is not duplicated
- this occurs when Name_matched_rank = family
Set taxonomicStatus on higher taxa
Place cf/aff in taxonlabel and populate with TNRS.Annotations
dataGeneralizations is confidentialityStatus
don't copy collectiondate to locationevent if mapping a specific TaxonOccurrence
move datasources' custom mappings (along with the comments) to mappings/VegCore.thesaurus.csv
migrate mappings so that collectionnumber is used for authorSpecimenCode instead of catalognumber_dwc
resolve SALVIAS SourceVoucher/coll_number/Ind ambiguity: coll_number should really be recordNumber, but that's currently Ind
check that SALVIAS SourceVoucher/coll_number is globally unique, since it is being used as such in indirect vouchers
_eq(): compare values case-insensitively
- this will support SALVIAS DetType "Indirect" matching "indirect"
```
select "DetType", count(*) from "SALVIAS".organisms group by "DetType" 
```
map DwC 1.21 terms to official DwC
- use http://rs.tdwg.org/dwc/2009-02-20/terms/history/versions/index.htm
correctly support looking up a plantobservation using just its sourceaccessioncode (not also its aggregateoccurrence_id)
- possibly by making aggregateoccurrence_id nullable when sourceaccessioncode is specified
- the mappings currently work around this by also providing a taxonoccurrence whenever a plantobservation is needed
import all tables in same public schema, without rolling back after each test, so that stemobservations will link up with existing plantobservations
handle "day is out of range for month" errors by replacing the day with 15 (mid month)
- need to parse the date into parts first
remove main maps' mappings comments that only relate to a specific datasource
map minimumElevationInMeters to elevation/_avg/max, filtered by _rangeEnd
filter dateCollected->collectiondate mapping with _dateRangeStart?
- is it valid to have a collection date that's a range? do any datasources have this?

figure out which BIEN2 datasources from viewFullOccurrence.DataSource (SurveyType = 'Specimen') are in VegBIEN
convert unit suffixes in verbatim fields
Handle invalid lat/long (99.9, 999.9, etc.) in all core maps (currently just done for value 0 in DwC)
Handle date ranges in all date fields (esp. DwC)
Unescape \% in e.g. ACAD ID 16551
Change the long DwC column name to just the DwC label in the datasource mappings
Parse time fields into standard format
Fix eventDate/verbatimEventDate mappings so they correspond to TDWG
append YMD dates using " " so that if full date is in one field, it will be parsed correctly
- but need to handle empty YMD fields: maybe check for full date in one field as special case
- for examples, see vegbien "ARIZ"."specimens.errors"
Casting to timestamps: add UTC timezone if no existing timezone
join together min/max elevation values before splitting them apart so that any range in the min field will automatically be parsed as a range
constrain all child tables with a default unique index that makes them 1:1 with their immediate parent
- core tables should have this already
to support col-based _map, add _dict built-in function that puts args into a dict, which becomes a PostgreSQL *hstore*
make _if a built-in function, which vertically subsets the rows according to the given filter
- would likely require handling then and else in separate _if statements using new XML function _not
- built-in function could just handle passing the parent fkey through to the then element, and then a relational function with a must-be-true check constraint on cond could do the subsetting

Map fields with no join mapping:
make missing_mappings
- associatedMedia
- associatedSequences
- basisOfRecord
- bibliographicCitation
- coordinatePrecision
- countryCode
- datasetName
- day
- dynamicProperties
- endDayOfYear
- eventRemarks
- eventTime
- geodeticDatum
- georeferenceProtocol
- georeferenceRemarks
- georeferenceSources
- georeferenceVerificationStatus
- ~~higherGeography~~: Datasources with it always also have place names divided out by rank
- identificationRemarks
- interpretationType
- ~~island~~: Not used
- ~~islandGroup~~: Not used
- language
- lifeStage
- locationRemarks
- modified
- month
- municipality
- occurrenceRemarks
- otherCatalogNumbers
- ownerInstitutionCode
- preparations
- relatedResourceID
- relationshipOfResource
- reproductiveCondition
- rightsHolder
- ~~startDayOfYear~~: Datasources with it always also have month
- subgenus
- type
- typeStatus
- verbatimDepth
- verbatimSRS
- year

Map fields with no input mapping
make missing_mappings
cat unmapped_terms.csv

~~Convert degree-minutes-seconds to decimal degrees~~
~~check that each table has the needed unique index(es), including ones we don't (yet) map to~~ only needed for ones we map to
use date only in datasources that need the extra parsing provided by dateutil: _._date(date) has been removed
~~make method unique within the datasource or locationevent instead of globally unique~~
~~map infraspecificEpithet to the field indicated by taxonRank~~: not applicable because we are using a hierarchical schema for epithets and the analytical DB does not contain infraspecificEpithet
~~Map DwC day (aka julianDay)~~: Datasources with it always also have month

Fixes¶

add validation to PostgreSQL util.set_col_names() to check that the column being renamed is the correct column. this is necessary to prevent errors when the map.csv columns don't correspond 1:1 to the staging table columns (e.g. in the case of one input column mapping to multiple outputs, or a data refresh causing column names to change).
sql.py run_query(): savepoint-level down before running parse_exception() so that you don't get current transaction is aborted, commands ignored until end of transaction block errors
e.g. happens when running verbosity=4 make scrub on the test_taxonomic_names (generated with inputs/test_taxonomic_names/test_scrub) as of r9756
support UTF-8Y input files (e.g. MO refresh)
check whether threatened field is still populated correctly after switch to new TNRS import method
import_all's after_import() should ensure tnrs.make is continuously unlocked for at least a minute before trying to acquire the lock, to allow other waiting processes to acquire it first
sql_io.put_table(): each col_default should only be evaluated once, and replaced with its value
- because col_defaults are sometimes copied, the copies would need to be updated, too
input.Makefile: %/install should set pipefail when teeing output to log so errors cause make to stop
~~db.col_info()~~ and related functions should use the search_path
all functions that take an errors_table should accept a None value for it
sql_io.put_table(): ensure input and output columns match up
- use function to do each insert incrementally and return the input pkey along with the output pkey from INSERT RETURNING
change Missing mapping for NOT NULL column warnings to errors
- first need to remove empty parent tables in xml_func.simplify() so they don't generate this warning
when two paths map to same place, and a node contains two text elements, need user-friendly error to indicate this
- currently, error is AttributeError: Text instance has no attribute 'tagName'
- this happens if two paths are identical except one has _alt at the end, because _alt will only be autoappended to the one without it
sql_gen.map_expr(): Don't replace quoted identifier where it is preceded by double quotes (indicating embedded double quotes)
sql.py: run_query(): Parse error messages' value strings containing embedded quotes
sql_io.put_table(): ignore(): handle cols that have been wrapped in func calls (casts, etc.)
When setting the value of a text element, raise an error if that element already contains child elements (and vice versa)

Fix duplicate elimination for tables that have nullable columns in their unique constraints

SELECT conname, attname
FROM pg_constraint
JOIN pg_attribute ON attrelid = conrelid AND attnum = ANY (conkey)
WHERE
    conname like '%_unique'
    and not attnotnull
ORDER BY conname, attnum

sql_io.put_table(): ensure_cond(): Handle case where is_literals is False but some of the columns in the condition are literals, not input columns

~~Only replace IDs (*ID) with abbrs, so that plantname in /*_id/*/plantname doesn't get abbreviated~~
~~set up read-only DB user for people to use to browse the DB~~
~~add fki indexes on all fkey source columns~~
~~TNRS-scrub the names in taxon_trait_view using the new ScrubbedTaxon view~~
~~fix race condition in scrubbing daemons' lockfile algorithm, which frequently allows 2-3 scrub.make instances to process the same set of rows at once~~
~~figure out what causes the could not create unique index ... key is duplicated errors and whether this is repeatable or random~~: Occurs when one input row matches multiple output rows, due to different imports using different unique indexes of the same table
- see inputs/REMIB/Specimen/logs/2012-09-21-16-37-57.log.sql, inputs/VegBank/stemcount/logs/2012-09-21-17-56-19.log.sql
- appears to be related to index conditions, where not all rows satisfy the condition
~~Deal with missing plantnames error in SALVIAS organisms import: see plotObservations.PlotObsID = 145483~~: Hasn't been a problem in awhile

Features¶

rename all README.TXT to _README.TXT so they sort at the top of the folder
change plain-text wiki code blocks to language blocks, now that language blocks no longer display with line numbers
db_xml.put(): Add runtime _if optimization like for _alt
recluster tables periodically on pkey to facilitate joins and updates by pkey
sql_gen.simplify_expr(): Support identifiers with embedded ()
move _alt optimization that just returns the first arg if it's non-NULL to xml_func.simplify() (after tagging the XML tree with the nullability of each node)
staging tables and derived temp tables: apply a NOT NULL constraint to every column that will accept it
tnrs_db: Lock TNRS.tnrs for writing to ensure that no two instances of tnrs_db are performing TNRS requests simultaneously (which would overburden and crash TNRS)
sql_io.put_table(): Try import first with no rows in input table, so input table only needs to be generated if there are no unrecoverable errors in the zero-row run
sql_io.put_table(): Support doing lookups of existing records without requiring a DuplicateKeyException, to support cases where one of the duplicate key columns is NOT NULL and not provided in the current hierarchical level
Remove id="-1" from import templates
Add separating line between each datasource in verbose make output
escape XML tag names
filename sorting supports negative numbers
sql_io.py: put_table(): don't generate output pkeys table if the caller doesn't need it
support NULL in all SQL function params with a default value, and use coalesce() to apply the default value
_dateRangeStart/_dateRangeEnd autodetect the range and date part separators
- currently, only dates containing " " (space) are supported
highlight/pretty-print UserWarnings to make them visible like exceptions
- should allow them to be used with error_stats
sql_io.put_table(): Allow col_defaults to contain output table column names, in the same way as default
join: Add option to print "No input mapping" error even if there is a comment on the mapping
support CSVs whose quotes are escaped with "\"

Print summary stats before exiting if user sends SIGINT, SIGTERM, etc. to map
Print command to restart import where it left off if user sends SIGINT, SIGTERM, etc. to map
Restart import where it left off if user sends SIGHUP to map

join: Support "bare" join column labels without a root, which should be treated as compatible with any root
Mark autogenerated maps as such with a comment so that the user doesn't accidentally edit them
- ~~or don't keep them in version control (but then need to have all make dependencies on the machine where the code is checked out)~~ this helps detect unwanted diffs
Handle seasons in dates
Handle unknown characters in dates (fuzzy option to parse()?)
Don't require a {} XPath expression to be preceded by an element to attach the other_branches to
Filter log files to allow comparison using diff
- use debug2redmine?
Compare filtered 2012-08-03 and 2012-08-01 import log files using WinMerge diff to ensure that they do the same thing (with different XML trees)

Escape names of everything being inserted into the DB from a make target
- This will help prevent SQL injection attacks when VegBIEN becomes public
Set ON DELETE fkey behavior for nullable fields to SET NULL instead of CASCADE
Warn if there's an index missing on a column used in a WHERE clause
- need to support indexes on multiple columns
In XPaths, make / following -> optional
automate collision elimination of column names in cat_csv
- see README.TXT section: "For every file with an error 'column "..." specified more than once'"
sql.py: index_pkey(): recover if pkey exists
- Error message: multiple primary keys for table "<table>" are not allowed
see if there's a way to get exception detail info in SQLERRM (probably not, but would be useful for errors tables)
db_xml.put_table(): don't subset table if less than partition_size and getting all rows
- CREATE TABLE AS is fast (<1s), but the subsequent ANALYZE is comparatively slow (8s) (vegbiendev:/home/bien/inputs/Madidi/import/organisms.2012-07-27-22-54-00.log.sql)
- use EXPLAIN's row count
Garbage collect target records created for a source record to point to, where the source record is never inserted because of an error
- but sometimes, only the target record is needed and the source record just happens to be part of the output mapping
use param names in info_schema to order params when SQL functions with named params not supported (on old versions of Postgres)
not all PL/Python exceptions should be translated to data_exception, because some should be handled by the import mechanism (e.g. not-null constraint errors)
Make .last_cleanup targets silently (with -s)
Only run tests on inputs whose maps have changed according to svn st
- but run on all inputs if the schema has changed

~~Remove verbose make output when checking whether external files are up to date (especially in make test)~~
~~Parallelize import so it uses all 4 cores (less priority with col-based import, but still useful)~~ using column-based import instead
~~Splitting sourcelist.name to sourcename.name should also split on ,~~
~~sql_io.put_table(): use left anti-join to remove existing rows before trying to insert new rows, in order to avoid creating holes in the indexes when the duplicate inserts are rolled back~~
~~For derived maps installation, redirect stderr to the install log file~~
~~make all map tools (join, etc.) case-insensitive~~
- eliminates the need for case-sensitive/insensitive mappings
~~Reconnect to database if connection lost~~: hasn't been an issue in a long time; might have only been a problem for MySQL inputs which are now CSV exports
- Would be fixed by handling the error in run_query() and disconnecting

Refactorings¶

have local machine and vegbiendev back up separately to jupiter, rather than synchronizing via jupiter, which introduces unnecessary complexity in the local machine/vegbiendev synchronization process
- this would avoid the need for many of the .rsync_filter/.rsync_ignore files, and the separate commands for syncing different parts of the directory tree
inputs/input.Makefile $(svnFilesGlob): move unversioned files into separate subdirs with svn:ignore * , to avoid needing to explicitly add every versioned file by running make inputs/<datasrc>/add
Auto-detect the CSV's NULL value and store this in the CSV dialect, for use by csv.reader
- Make TsvReader use the dialect's NULL value
Use the CSV dialect's NULL value
remove dependency on $(bin)/join
remove no longer used prefixes code
Make xml_dom.NodeEntryIter return a namedtuple
Use raise instead of raise e where possible to preserve whole stack trace
aggregate SQL functions: use array param to support arbitrary # of args
- _name is part of this, because it simplifies when it has only one arg
sql_gen.py: to_str() renamed items (NamedCol, etc.) using hint param that defines whether to include the AS "..." renaming or just the value
Move all SQL query-generating functions to sql_gen.py
Change all val parameters to value to standardize named parameters
Split util into multiple libs
sql_io.py put_table(): generate in_tables from the cols in the mapping param
sql.py mk_select(): use new conditions syntax
- once everything that uses conds uses new syntax, remove: elif isinstance(conds, dict): conds = conds.items()
In map, factor out WrapIter/ListDict code into a common function
Move map code that doesn't relate to command line invocation to separate lib file
Handle parsing and getting of metadata in xpath.py parse() and get()
For readability, use :: instead of CAST() in PostgreSQL queries (but retain CAST() usage for other DB engines)
Makefiles: move after-line comments before the line when the comment isn't indented
sql.py run_query(): don't remove PL/Python prefix
- but first, row-based import would need to parse errors using wrapper functions, because it doesn't use errors tables
use operator classes which compare NULL literally instead of COALESCE() in indexes
db_xml.put_special_funcs _simplifyPath(): don't need to xpath.parse(next) ?

~~In common.Makefile, change the default $src_server (sync server) from vegbiendev to jupiter~~

Suggestions¶

Look into using Sybase Powerbuilder or IBM Enterprise Vision to map data
- The TACC (Texas Advanced Computing Center) people might have individual licenses they could let us use
"For a next plot type I would suggest TurboVeg" (e-mail from Bob Peet on 2011-12-1)

Working group output¶

either modify loading scripts to use VegBIEN or create BIEN 2 -> VegBIEN loading script
analytical database
- version controlled
validation

BIEN¶

~~deficiencies in existing data~~
time component
~~taxonomy versioning~~ versioned DB backups
~~taxon traits table~~
good data entry tool
UI and tools for porting data to and from ~~VegX~~ VegCSV

~~use cases~~
- ~~Brad to request from BIEN members; will compile for Aaron.~~
- each use case will consist of:
  - analysis for which data was used (publication or in prep)
  - raw data sample
  - summary of manipulations needed to make data useable
  - shortcomings of data, challenges during data compilation/preparation

Mapping¶

~~talk to Nick Spencer about mapping engine~~
- ~~will be made publicly available online~~
- ~~engine reusable for VegBIEN mapping~~: no: VegX-specific and just maps to VegX top-level tables, not nested XML paths

Databanks¶

contact(s) for RAINFOR
access to databanks' internal databases rather than just their source data

~~CTFS schema~~
~~login for SALVIAS~~: clone on nimoy in salvias_plots MySQL database

Files (0)

Project

General

Profile

Wiki

To Do¶

Issue tracker ¶

Meetings ¶

VegBIEN schema ¶

denormalized VegCore ¶

VegCSV ¶

VegPath ¶

VegX schema ¶

Mappings¶

Fixes¶

Features¶

Refactorings¶

Suggestions¶

Working group output¶

BIEN¶

Mapping¶

Databanks¶

Project

General

Profile

Wiki

To Do¶

Issue tracker¶

Meetings¶

VegBIEN schema¶

denormalized VegCore¶

VegCSV¶

VegPath¶

VegX schema¶

Mappings¶

Fixes¶

Features¶

Refactorings¶

Suggestions¶

Working group output¶

BIEN¶

Mapping¶

Databanks¶

Issue tracker ¶

Meetings ¶

VegBIEN schema ¶

denormalized VegCore ¶

VegCSV ¶

VegPath ¶

VegX schema ¶