Project

General

Profile

Aggregating validations status


see also pipeline

see also validations on sparse datasources


issues

these apply in general to the datasources:

  1. Brad's queries do not follow the standard root->leaf join sequence, which makes it difficult to determine if his output queries are correct. the joins in these queries must generally be rewritten to use the standard root->leaf order before they can be confirmed correct.
    • current recommended join sequence (root->leaf order, as used in plots #6a):
      FROM plot
      JOIN locationevent USING (plot_id)
      LEFT JOIN project USING (project_id)
      JOIN taxonoccurrence USING (locationevent_id)
      JOIN aggregateoccurrence USING (taxonoccurrence_id)
      JOIN plantobservation USING (aggregateoccurrence_id)
      JOIN stemobservation USING (plantobservation_id)
      
    • note that recent schema changes have made it possible to significantly improve parts of the join sequence. these improvements should be made in all queries that do not yet use the recommended join sequence.
  2. tables that have undergone schema changes must always be vacuumed before querying them, or Postgres will use a slow query plan
  3. some of the queries need to be run with various query planner options disabled, to prevent the query planner from choosing a slow plan. (incorrect query plans are an ongoing bug in Postgres, because it does not support index hints and by default does not follow the join order.)
    • the following must always be off:
      SET enable_seqscan = off;
      
    • for queries that use an explicit join order (as well as some queries that don't), they will often be faster if you turn off join_collapse_limit to force using your join order. however, if you do this, you sometimes need to turn off other options as well:
      SET join_collapse_limit = 1; -- turn it off
      SET enable_sort = off;
      SET enable_hashjoin = off; -- better to keep this on if possible
      SET enable_mergejoin = off;
      
      fortunately, the combination of all of these is usually sufficient to get Postgres to use the right query plan.
  4. verbatim (unscrubbed) data is sometimes not stored in the database
    1. eg. text dates, and component dates that are missing components, are generally stored only in their parsed representation

datasources

crossed out = 100% done

in the order they will be validated in: [as decided in the 2014-04-03 conference call and an e-mail from Brad ]


representative:

BIEN2 traits

SALVIAS

task progress/# left at step
input query exists 2 left: #20,21
% input query written 19/21 = 90%
input query mapped to staging tables -
tested outside of pipeline -
query runs without errors -
% input query ready 19/21 = 90%
added to pipeline -
diff runs without errors -
checked for duplicated rows -
no duplicated rows -
% queries match 19/21 = 90%
standard join sequence 7 left: #11,12,13,14,16,17,18
cause of diff found 2 left: #6,10
output side appears correct -
input side appears correct -
% queries verified correct 10/21 = 48%
bugs found -
bugs fixed 1 left: #19
% bugs fixed 9/21 = 43%
requires reload -
% successful diff 9/21 = 43%
no diff #1,2,3,4,5,7,8,9,15
allowable diff -

bugs

  1. locationevent incorrectly being identified by date_start, which is not intended to be identifying
    • creates diff in #6,6a,10,10a,19
    • this occurs in the 3 plotMetadata_ rows with valid date_start date values: UYLN003,SUC-05,YAN-02 (see "SALVIAS"."plotMetadata_.errors")
    • causes 3 outer locationevents to be missing project_ids: UYLN003,SUC-05,YAN-02
    • also, there is an unexpected error:
      could not create unique index "locationevent_pkeys_pkey" 
      DETAIL:  Key ("PlotID")=(26119) is duplicated.
      
  2. some stem records are missing (#6,6a)
    • note that some of #6a's diff is instead due to the missing project_ids (above)

mismatches

  1. VegBIEN's definition of a plot is different from Brad's
    • Brad's output queries were written assuming that the location table stores only top-level plots, but will need to be rewritten because location also stores subplots
    • extracting just the top-level plots from the subplot hierarchy also required numerous schema changes
    • breaks #18
  2. the output queries depend on plots being grouped into projects, which is not the case for all plots datasources
  3. the taxon name is formed differently in the input and output queries
  4. note that queries #6,17 have mismatches on the same plot_codes. this may suggest that a similar error is affecting both.

NY

task progress/# left at step
% queries match 100%
input side appears correct 2 left: #8,9
% queries verified correct 14/16 = 88%
bugs found -
bugs fixed 4 left: #3,12,13(+a,b),15
% bugs fixed 10/16 = 63%
requires reload 5 left: #1,6,7,10,16
% successful diff 5/16 = 31%
no diff #2,4,5,11,14
allowable diff -

bugs

  1. a few rows (0.1%) are not imported
    • creates diff in #1 (1061 out of 941,107 rows)
    • note that there are 1060 rows with accessionNumber = NULL:
      SELECT "accessionNumber", COUNT(*) FROM "NY"."Ecatalog_all" GROUP BY "accessionNumber" HAVING COUNT(*) > 1
      
    • before fixing this, NY should be reloaded to fix the NULL accessionNumbers bug (below), and the existence of this bug rechecked
  2. habitat, substrate, vegetation should be stored in locationRemarks, not locality
    • creates diff in #15
    • requires schema change: add location.location_remarks
    • requires mappings change: mappings/VegCore-VegBIEN.csv: fields incorrectly mapped to location.locationnarrative (locality): remap to new location.location_remarks instead
  3. sometimes, multiple input fields are concatenated together in the XPath mappings to form a single output column
    • these fields should instead be concatenated together at stage II (source-general derived columns), so that their concatenation can be compared directly with what's stored in the database
    • creates diff in #15, which concatenates locality and habitat into locationnarrative
  4. only coordinates with both a latitude and a longitude are stored, due to NOT NULL constraints on latitude_deg and longitude_deg
    • affects 65/941,107 rows = 0.007% (#13b)
    • creates diff in #13[+a,b] (#13a: rows 1-53)
  5. verbatim dates are not stored in the database
    • this prevents later generating a parsed date for records our parsing functions were unable to parse
    • creates diff in #12
  6. verbatim coordinates are not stored in the database when the column is a valid double precision
    • this discards trailing zeros that would be useful in determining the precision of the value, in the absence of a coordinatePrecision
    • creates diff in #13[+a,b] (#13a: rows 54+)
  7. families are scrubbed (replacing 'Unknown' with NULL) rather than stored verbatim
    • creates diff in #3

fixed

  1. NULL accessionNumbers are preventing rows with an accessionNumber from having their specimenreplicate-related fields imported (other tables' fields are unaffected)
    • creates diff in #1,10,12,16
    • see error in NY.Ecatalog_all.errors > row 14 (accessionNumber) > ERROR: duplicate key value violates unique constraint "specimenreplicate_plantobservation_1_to_1", which applies a filter "NY.Ecatalog_all.accessionNumber" IS NULL to satisfy the unique constraint
  2. coordinates_unique unique constraint: add coordsaccuracy_m
    • might create diff in #13
  3. subspecies needs to be available in taxonverbatim.subspecies for use by the validations queries
    • creates diff in #6,7

mismatches

  • input queries that use the scientificName need to prepend the family to match what is used on the output side

provider-validated:

VegBank

task progress/# left at step
% output query ready 100%
input query exists 1 left: #20
% input query written 19/20 = 95%
input query mapped to staging tables 19 left: #1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
tested outside of pipeline -
query runs without errors -
% input query ready 0/20 = 0%

mismatches

  1. include the following columns: see #915
  2. the taxon names should be compared side-by-side instead of in separate lists (Martha): see VegBank.tnrs_names.xls, _plots_20_tnrs_names .

[the scripts need to] 1) form taxon names correctly before sending them to TNRS and 2) re-form the taxon names correctly after TNRS processing (including morphospecies). I think you need the following:

  • verbatim name
  • name as input to TNRS
  • name output from TNRS
  • name output from TNRS with the morphospecies concatenated again [including the morphospecies suffix separately, since it doesn't always make sense to concatenate it to the taxon name]

plots, denormalized:

CVS

CTFS


specimens:

NY

ARIZ

MO

U

UNCC

BRIT

HVAA

TEX

NCU

Canadensys:

ACAD

HIBG

JBM

MT

QFA

TRT

TRTE

UBC

WIN

REMIB

SpeciesLink

GBIF


special cases:

FIA


plots, normalized:

TEAM

Madidi

NVS


progress bar template

  • remove any sections that are not applicable to the datasource
  • to calculate the numerator of a % done: numerator of % done in previous section (or total # queries) - sum of all "# left" in tasks in current section
task progress/# left at step
output query exists [count of queries that need this step next] left: #[list of these queries]
% output query written 0/total_#_queries = 0%
output query mapped to VegBIEN -
tested outside of pipeline -
query runs without errors -
column list done -
runs without slowdowns -
% output query ready 0/total_#_queries = 0%
input query exists -
% input query written 0/total_#_queries = 0%
input query mapped to staging tables -
tested outside of pipeline -
query runs without errors -
% input query ready 0/total_#_queries = 0%
added to pipeline -
diff runs without errors -
checked for duplicated rows -
no duplicated rows -
% queries match 0/total_#_queries = 0%
standard join sequence -
cause of diff found -
output side appears correct -
input side appears correct -
% queries verified correct 0/total_#_queries = 0%
bugs found -
bugs fixed -
% bugs fixed 0/total_#_queries = 0%
requires reload -
% successful diff 0/total_#_queries = 0%
no diff #[list of these queries]
allowable diff #[list of these queries]

bugs

  1. ...

fixed

  1. bug

mismatches

  1. ...