Project

General

Profile

To Do

Issue tracker

Meetings

VegBIEN schema

  1. scope specimenreplicate by collectionnumber when no catalognumber present
  2. individualCount should be 1 for specimens
  3. taxondetermination: Add constraint trigger to make sure exactly one (not zero) taxondeterminations per taxonoccurrence is always marked current
  4. {commname,commstatus}.source_id should be scoping
  5. store verbatim date
  6. form scientificNameWithMorphospecies differently for specimens and plots
    • use scientificName for specimens
  1. remove no longer used centerlatitude/centerlongitude? the lat/long go in locationdetermination
  2. Change taxonrank's forma value to form to match TCS?
  3. move plantobservation scoping fields to taxonoccurrence, because these tables are 1:1
  4. support raw location name in its own field, distinct from locationNarrative
  5. partial indexes should be full where possible, so that they can be used to query the database
  6. specimenreplicate: require catalognumber_dwc in check constraint, even if plantobservation_id provided
    • first need to ensure plots data doesn't use any specimenreplicate fields except for that
  7. Store times as a binary times in VegBIEN
  8. add locationdetermination notes on how lat/long converted from input data, if any
  9. Move plantobservation.stemcount to aggregateoccurrence, for cases where number of stems is known, but not which stems go to each individual
  10. Normalize fields ending in numbers
    • e.g. growthFormType
  11. add specimens and traits capability
  12. compare to CTFS
  1. remove subproviders from provider_count who don't have any rows in VegBIEN
  2. make taxonoccurrence.locationevent_id NOT NULL: Instead, nullable only when sourceaccessioncode is specified
    • requires running all the tables' automated tests in one transaction (or in commit mode), so that the existing parent tables can be connected to
  3. make taxonoccurrence.locationevent_id nullable only when sourceaccessioncode is specified
    • but better to require a locationevent, and look up the taxonoccurrence by its sourceaccessioncode if the locationevent_id is NULL
      (Note that this will not trigger a DuplicateKeyException, because the NullValueException will be triggered first, so the existing import process can't yet do this.)
  4. store full name of person instead of/in addition to parsed first/last name
  5. fuzzing from access level
  6. make different hierarchical levels for DwC taxonRank and infraspecificEpithet
  7. add project.parentProject_id?

denormalized VegCore

  1. canon: support two terms having the same simplified form, which will be disambiguated using ? like in redmine_synonyms' output
  2. mark terms sourced from VegX
  3. *slopeAspect, etc.: add units
  4. add native

VegCSV

  1. reorganize VegCSV vs. VegX into a table with two columns

VegPath

  1. web/main/: Handle symlinked dirs in .htaccess files that contain self-referential paths, e.g. VegBIEN/.htaccess > don't redirect subdir paths

VegX schema

See VegX schema

Mappings

  1. for TNRS, map the Unmatched_terms (morphospeciesSuffix) to NULL if a Specific_epithet_matched was provided
  2. populate all datasources' import_order.txt
  3. adding a subdir auto-adds it to import_order.txt
  4. map ND to NULL (e.g. in REMIB.Specimen.accession_number, locality)
  5. handle taxonomic names that are actually comments, like "NO SPECIES ON PLOT"
  6. map TEAM site placename metadata
  7. map CVS.taxonObservation_ growthForm fields
  8. validate CVS (after VegBank problems have been fixed)
  9. Translate ranks to taxonrank enum values
    • especially needed for NCBI.higher_taxa.rank: SELECT DISTINCT rank FROM "NCBI".nodes
  10. store whether a source is top-level
  11. analytical_stem TNRS names: Merge name containing just a family and family field when combining, so family is not duplicated
    • this occurs when Name_matched_rank = family
  12. Set taxonomicStatus on higher taxa
  13. Place cf/aff in taxonlabel and populate with TNRS.Annotations
  14. dataGeneralizations is confidentialityStatus
  15. don't copy collectiondate to locationevent if mapping a specific TaxonOccurrence
  16. move datasources' custom mappings (along with the comments) to mappings/VegCore.thesaurus.csv
  17. migrate mappings so that collectionnumber is used for authorSpecimenCode instead of catalognumber_dwc
  18. resolve SALVIAS SourceVoucher/coll_number/Ind ambiguity: coll_number should really be recordNumber, but that's currently Ind
  19. check that SALVIAS SourceVoucher/coll_number is globally unique, since it is being used as such in indirect vouchers
  20. _eq(): compare values case-insensitively
    • this will support SALVIAS DetType "Indirect" matching "indirect"
      select "DetType", count(*) from "SALVIAS".organisms group by "DetType" 
      
  21. map DwC 1.21 terms to official DwC
  22. correctly support looking up a plantobservation using just its sourceaccessioncode (not also its aggregateoccurrence_id)
    • possibly by making aggregateoccurrence_id nullable when sourceaccessioncode is specified
    • the mappings currently work around this by also providing a taxonoccurrence whenever a plantobservation is needed
  23. import all tables in same public schema, without rolling back after each test, so that stemobservations will link up with existing plantobservations
  24. handle "day is out of range for month" errors by replacing the day with 15 (mid month)
    • need to parse the date into parts first
  25. remove main maps' mappings comments that only relate to a specific datasource
  26. map minimumElevationInMeters to elevation/_avg/max, filtered by _rangeEnd
  27. filter dateCollected->collectiondate mapping with _dateRangeStart?
    • is it valid to have a collection date that's a range? do any datasources have this?
  1. figure out which BIEN2 datasources from viewFullOccurrence.DataSource (SurveyType = 'Specimen') are in VegBIEN
  2. convert unit suffixes in verbatim fields
  3. Handle invalid lat/long (99.9, 999.9, etc.) in all core maps (currently just done for value 0 in DwC)
  4. Handle date ranges in all date fields (esp. DwC)
  5. Unescape \% in e.g. ACAD ID 16551
  6. Change the long DwC column name to just the DwC label in the datasource mappings
  7. Parse time fields into standard format
  8. Fix eventDate/verbatimEventDate mappings so they correspond to TDWG
  9. append YMD dates using " " so that if full date is in one field, it will be parsed correctly
    • but need to handle empty YMD fields: maybe check for full date in one field as special case
    • for examples, see vegbien "ARIZ"."specimens.errors"
  10. Casting to timestamps: add UTC timezone if no existing timezone
  11. join together min/max elevation values before splitting them apart so that any range in the min field will automatically be parsed as a range
  12. constrain all child tables with a default unique index that makes them 1:1 with their immediate parent
    • core tables should have this already
  13. to support col-based _map, add _dict built-in function that puts args into a dict, which becomes a PostgreSQL *hstore*
  14. make _if a built-in function, which vertically subsets the rows according to the given filter
    • would likely require handling then and else in separate _if statements using new XML function _not
    • built-in function could just handle passing the parent fkey through to the then element, and then a relational function with a must-be-true check constraint on cond could do the subsetting
  1. Map fields with no join mapping:
    make missing_mappings
    • associatedMedia
    • associatedSequences
    • basisOfRecord
    • bibliographicCitation
    • coordinatePrecision
    • countryCode
    • datasetName
    • day
    • dynamicProperties
    • endDayOfYear
    • eventRemarks
    • eventTime
    • geodeticDatum
    • georeferenceProtocol
    • georeferenceRemarks
    • georeferenceSources
    • georeferenceVerificationStatus
    • higherGeography: Datasources with it always also have place names divided out by rank
    • identificationRemarks
    • interpretationType
    • island: Not used
    • islandGroup: Not used
    • language
    • lifeStage
    • locationRemarks
    • modified
    • month
    • municipality
    • occurrenceRemarks
    • otherCatalogNumbers
    • ownerInstitutionCode
    • preparations
    • relatedResourceID
    • relationshipOfResource
    • reproductiveCondition
    • rightsHolder
    • startDayOfYear: Datasources with it always also have month
    • subgenus
    • type
    • typeStatus
    • verbatimDepth
    • verbatimSRS
    • year
  1. Map fields with no input mapping
    make missing_mappings
    cat unmapped_terms.csv
  1. Convert degree-minutes-seconds to decimal degrees
  2. check that each table has the needed unique index(es), including ones we don't (yet) map to only needed for ones we map to
  3. use date only in datasources that need the extra parsing provided by dateutil: _._date(date) has been removed
  4. make method unique within the datasource or locationevent instead of globally unique
  5. map infraspecificEpithet to the field indicated by taxonRank: not applicable because we are using a hierarchical schema for epithets and the analytical DB does not contain infraspecificEpithet
  6. Map DwC day (aka julianDay): Datasources with it always also have month

Fixes

  1. add validation to PostgreSQL util.set_col_names() to check that the column being renamed is the correct column. this is necessary to prevent errors when the map.csv columns don't correspond 1:1 to the staging table columns (e.g. in the case of one input column mapping to multiple outputs, or a data refresh causing column names to change).
  2. sql.py run_query(): savepoint-level down before running parse_exception() so that you don't get current transaction is aborted, commands ignored until end of transaction block errors
    e.g. happens when running verbosity=4 make scrub on the test_taxonomic_names (generated with inputs/test_taxonomic_names/test_scrub) as of r9756
  3. support UTF-8Y input files (e.g. MO refresh)
  4. check whether threatened field is still populated correctly after switch to new TNRS import method
  5. import_all's after_import() should ensure tnrs.make is continuously unlocked for at least a minute before trying to acquire the lock, to allow other waiting processes to acquire it first
  6. sql_io.put_table(): each col_default should only be evaluated once, and replaced with its value
    • because col_defaults are sometimes copied, the copies would need to be updated, too
  7. input.Makefile: %/install should set pipefail when teeing output to log so errors cause make to stop
  8. db.col_info() and related functions should use the search_path
  9. all functions that take an errors_table should accept a None value for it
  10. sql_io.put_table(): ensure input and output columns match up
    • use function to do each insert incrementally and return the input pkey along with the output pkey from INSERT RETURNING
  11. change Missing mapping for NOT NULL column warnings to errors
    • first need to remove empty parent tables in xml_func.simplify() so they don't generate this warning
  12. when two paths map to same place, and a node contains two text elements, need user-friendly error to indicate this
    • currently, error is AttributeError: Text instance has no attribute 'tagName'
    • this happens if two paths are identical except one has _alt at the end, because _alt will only be autoappended to the one without it
  13. sql_gen.map_expr(): Don't replace quoted identifier where it is preceded by double quotes (indicating embedded double quotes)
  14. sql.py: run_query(): Parse error messages' value strings containing embedded quotes
  15. sql_io.put_table(): ignore(): handle cols that have been wrapped in func calls (casts, etc.)
  16. When setting the value of a text element, raise an error if that element already contains child elements (and vice versa)
  17. Fix duplicate elimination for tables that have nullable columns in their unique constraints
    SELECT conname, attname
    FROM pg_constraint
    JOIN pg_attribute ON attrelid = conrelid AND attnum = ANY (conkey)
    WHERE
        conname like '%_unique'
        and not attnotnull
    ORDER BY conname, attnum
    
  18. sql_io.put_table(): ensure_cond(): Handle case where is_literals is False but some of the columns in the condition are literals, not input columns
  1. Only replace IDs (*ID) with abbrs, so that plantname in /*_id/*/plantname doesn't get abbreviated
  2. set up read-only DB user for people to use to browse the DB
  3. add fki indexes on all fkey source columns
  4. TNRS-scrub the names in taxon_trait_view using the new ScrubbedTaxon view
  5. fix race condition in scrubbing daemons' lockfile algorithm, which frequently allows 2-3 scrub.make instances to process the same set of rows at once
  6. figure out what causes the could not create unique index ... key is duplicated errors and whether this is repeatable or random: Occurs when one input row matches multiple output rows, due to different imports using different unique indexes of the same table
    • see inputs/REMIB/Specimen/logs/2012-09-21-16-37-57.log.sql, inputs/VegBank/stemcount/logs/2012-09-21-17-56-19.log.sql
    • appears to be related to index conditions, where not all rows satisfy the condition
  7. Deal with missing plantnames error in SALVIAS organisms import: see plotObservations.PlotObsID = 145483: Hasn't been a problem in awhile

Features

  1. rename all README.TXT to _README.TXT so they sort at the top of the folder
  2. change plain-text wiki code blocks to language blocks, now that language blocks no longer display with line numbers
  3. db_xml.put(): Add runtime _if optimization like for _alt
  4. recluster tables periodically on pkey to facilitate joins and updates by pkey
  5. sql_gen.simplify_expr(): Support identifiers with embedded ()
  6. move _alt optimization that just returns the first arg if it's non-NULL to xml_func.simplify() (after tagging the XML tree with the nullability of each node)
  7. staging tables and derived temp tables: apply a NOT NULL constraint to every column that will accept it
  8. tnrs_db: Lock TNRS.tnrs for writing to ensure that no two instances of tnrs_db are performing TNRS requests simultaneously (which would overburden and crash TNRS)
  9. sql_io.put_table(): Try import first with no rows in input table, so input table only needs to be generated if there are no unrecoverable errors in the zero-row run
  10. sql_io.put_table(): Support doing lookups of existing records without requiring a DuplicateKeyException, to support cases where one of the duplicate key columns is NOT NULL and not provided in the current hierarchical level
  11. Remove id="-1" from import templates
  12. Add separating line between each datasource in verbose make output
  13. escape XML tag names
  14. filename sorting supports negative numbers
  15. sql_io.py: put_table(): don't generate output pkeys table if the caller doesn't need it
  16. support NULL in all SQL function params with a default value, and use coalesce() to apply the default value
  17. _dateRangeStart/_dateRangeEnd autodetect the range and date part separators
    • currently, only dates containing " " (space) are supported
  18. highlight/pretty-print UserWarnings to make them visible like exceptions
    • should allow them to be used with error_stats
  19. sql_io.put_table(): Allow col_defaults to contain output table column names, in the same way as default
  20. join: Add option to print "No input mapping" error even if there is a comment on the mapping
  21. support CSVs whose quotes are escaped with "\"
  1. Print summary stats before exiting if user sends SIGINT, SIGTERM, etc. to map
  2. Print command to restart import where it left off if user sends SIGINT, SIGTERM, etc. to map
  3. Restart import where it left off if user sends SIGHUP to map
  1. join: Support "bare" join column labels without a root, which should be treated as compatible with any root
  2. Mark autogenerated maps as such with a comment so that the user doesn't accidentally edit them
    • or don't keep them in version control (but then need to have all make dependencies on the machine where the code is checked out) this helps detect unwanted diffs
  3. Handle seasons in dates
  4. Handle unknown characters in dates (fuzzy option to parse()?)
  5. Don't require a {} XPath expression to be preceded by an element to attach the other_branches to
  6. Filter log files to allow comparison using diff
    • use debug2redmine?
  7. Compare filtered 2012-08-03 and 2012-08-01 import log files using WinMerge diff to ensure that they do the same thing (with different XML trees)
  1. Escape names of everything being inserted into the DB from a make target
    • This will help prevent SQL injection attacks when VegBIEN becomes public
  2. Set ON DELETE fkey behavior for nullable fields to SET NULL instead of CASCADE
  3. Warn if there's an index missing on a column used in a WHERE clause
    • need to support indexes on multiple columns
  4. In XPaths, make / following -> optional
  5. automate collision elimination of column names in cat_csv
    • see README.TXT section: "For every file with an error 'column "..." specified more than once'"
  6. sql.py: index_pkey(): recover if pkey exists
    • Error message: multiple primary keys for table "<table>" are not allowed
  7. see if there's a way to get exception detail info in SQLERRM (probably not, but would be useful for errors tables)
  8. db_xml.put_table(): don't subset table if less than partition_size and getting all rows
    • CREATE TABLE AS is fast (<1s), but the subsequent ANALYZE is comparatively slow (8s) (vegbiendev:/home/bien/inputs/Madidi/import/organisms.2012-07-27-22-54-00.log.sql)
    • use EXPLAIN's row count
  9. Garbage collect target records created for a source record to point to, where the source record is never inserted because of an error
    • but sometimes, only the target record is needed and the source record just happens to be part of the output mapping
  10. use param names in info_schema to order params when SQL functions with named params not supported (on old versions of Postgres)
  11. not all PL/Python exceptions should be translated to data_exception, because some should be handled by the import mechanism (e.g. not-null constraint errors)
  12. Make .last_cleanup targets silently (with -s)
  13. Only run tests on inputs whose maps have changed according to svn st
    • but run on all inputs if the schema has changed
  1. Remove verbose make output when checking whether external files are up to date (especially in make test)
  2. Parallelize import so it uses all 4 cores (less priority with col-based import, but still useful) using column-based import instead
  3. Splitting sourcelist.name to sourcename.name should also split on ,
  4. sql_io.put_table(): use left anti-join to remove existing rows before trying to insert new rows, in order to avoid creating holes in the indexes when the duplicate inserts are rolled back
  5. For derived maps installation, redirect stderr to the install log file
  6. make all map tools (join, etc.) case-insensitive
    • eliminates the need for case-sensitive/insensitive mappings
  7. Reconnect to database if connection lost: hasn't been an issue in a long time; might have only been a problem for MySQL inputs which are now CSV exports
    • Would be fixed by handling the error in run_query() and disconnecting

Refactorings

  1. have local machine and vegbiendev back up separately to jupiter, rather than synchronizing via jupiter, which introduces unnecessary complexity in the local machine/vegbiendev synchronization process
    • this would avoid the need for many of the .rsync_filter/.rsync_ignore files, and the separate commands for syncing different parts of the directory tree
  2. inputs/input.Makefile $(svnFilesGlob): move unversioned files into separate subdirs with svn:ignore * , to avoid needing to explicitly add every versioned file by running make inputs/<datasrc>/add
  3. Auto-detect the CSV's NULL value and store this in the CSV dialect, for use by csv.reader
    • Make TsvReader use the dialect's NULL value
  4. Use the CSV dialect's NULL value
  5. remove dependency on $(bin)/join
  6. remove no longer used prefixes code
  7. Make xml_dom.NodeEntryIter return a namedtuple
  8. Use raise instead of raise e where possible to preserve whole stack trace
  9. aggregate SQL functions: use array param to support arbitrary # of args
    • _name is part of this, because it simplifies when it has only one arg
  10. sql_gen.py: to_str() renamed items (NamedCol, etc.) using hint param that defines whether to include the AS "..." renaming or just the value
  11. Move all SQL query-generating functions to sql_gen.py
  12. Change all val parameters to value to standardize named parameters
  13. Split util into multiple libs
  14. sql_io.py put_table(): generate in_tables from the cols in the mapping param
  15. sql.py mk_select(): use new conditions syntax
    • once everything that uses conds uses new syntax, remove: elif isinstance(conds, dict): conds = conds.items()
  16. In map, factor out WrapIter/ListDict code into a common function
  17. Move map code that doesn't relate to command line invocation to separate lib file
  18. Handle parsing and getting of metadata in xpath.py parse() and get()
  19. For readability, use :: instead of CAST() in PostgreSQL queries (but retain CAST() usage for other DB engines)
  20. Makefiles: move after-line comments before the line when the comment isn't indented
  21. sql.py run_query(): don't remove PL/Python prefix
    • but first, row-based import would need to parse errors using wrapper functions, because it doesn't use errors tables
  22. use operator classes which compare NULL literally instead of COALESCE() in indexes
  23. db_xml.put_special_funcs _simplifyPath(): don't need to xpath.parse(next) ?
  1. In common.Makefile, change the default $src_server (sync server) from vegbiendev to jupiter

Suggestions

  1. Look into using Sybase Powerbuilder or IBM Enterprise Vision to map data
    • The TACC (Texas Advanced Computing Center) people might have individual licenses they could let us use
  2. "For a next plot type I would suggest TurboVeg" (e-mail from Bob Peet on 2011-12-1)

Working group output

  1. either modify loading scripts to use VegBIEN or create BIEN 2 -> VegBIEN loading script
  2. analytical database
    • version controlled
  3. validation

BIEN

  1. deficiencies in existing data
  2. time component
  3. taxonomy versioning versioned DB backups
  4. taxon traits table
  5. good data entry tool
  6. UI and tools for porting data to and from VegX VegCSV
  1. use cases
    • Brad to request from BIEN members; will compile for Aaron.
    • each use case will consist of:
      • analysis for which data was used (publication or in prep)
      • raw data sample
      • summary of manipulations needed to make data useable
      • shortcomings of data, challenges during data compilation/preparation

Mapping

  1. talk to Nick Spencer about mapping engine
    • will be made publicly available online
    • engine reusable for VegBIEN mapping: no: VegX-specific and just maps to VegX top-level tables, not nested XML paths

Databanks

  1. contact(s) for RAINFOR
  2. access to databanks' internal databases rather than just their source data
  1. CTFS schema
  2. login for SALVIAS: clone on nimoy in salvias_plots MySQL database