- Table of contents
- Aggregating validations status
- datasources
- representative:
- provider-validated:
- plots, denormalized:
- specimens:
- special cases:
- plots, normalized:
- progress bar template
Aggregating validations status¶
- Aggregating validations on sparse datasources
- Aggregating validations pipeline
- Aggregating validations refactoring
- NY aggregating validations
- SALVIAS aggregating validations
pipeline¶
validations on sparse datasources¶
issues¶
these apply in general to the datasources:
- Brad's queries do not follow the standard root->leaf join sequence, which makes it difficult to determine if his output queries are correct. the joins in these queries must generally be rewritten to use the standard root->leaf order before they can be confirmed correct.
- current recommended join sequence (root->leaf order, as used in plots #6a):
FROM plot JOIN locationevent USING (plot_id) LEFT JOIN project USING (project_id) JOIN taxonoccurrence USING (locationevent_id) JOIN aggregateoccurrence USING (taxonoccurrence_id) JOIN plantobservation USING (aggregateoccurrence_id) JOIN stemobservation USING (plantobservation_id)
- note that recent schema changes have made it possible to significantly improve parts of the join sequence. these improvements should be made in all queries that do not yet use the recommended join sequence.
- current recommended join sequence (root->leaf order, as used in plots #6a):
- tables that have undergone schema changes must always be vacuumed before querying them, or Postgres will use a slow query plan
- some of the queries need to be run with various query planner options disabled, to prevent the query planner from choosing a slow plan. (incorrect query plans are an ongoing bug in Postgres, because it does not support index hints and by default does not follow the join order.)
- the following must always be off:
SET enable_seqscan = off;
- for queries that use an explicit join order (as well as some queries that don't), they will often be faster if you turn off
join_collapse_limit
to force using your join order. however, if you do this, you sometimes need to turn off other options as well:
fortunately, the combination of all of these is usually sufficient to get Postgres to use the right query plan.SET join_collapse_limit = 1; -- turn it off SET enable_sort = off; SET enable_hashjoin = off; -- better to keep this on if possible SET enable_mergejoin = off;
- the following must always be off:
- verbatim (unscrubbed) data is sometimes not stored in the database
- eg. text dates, and component dates that are missing components, are generally stored only in their parsed representation
datasources¶
crossed out = 100% done
in the order they will be validated in: [as decided in the 2014-04-03 conference call and an e-mail from Brad ]
representative:¶
BIEN2 traits¶
SALVIAS¶
task | progress/# left at step |
input query exists | 2 left: #20,21 |
% input query written | 19/21 = 90% |
input query mapped to staging tables | - |
tested outside of pipeline | - |
query runs without errors | - |
% input query ready | 19/21 = 90% |
added to pipeline | - |
diff runs without errors | - |
checked for duplicated rows | - |
no duplicated rows | - |
% queries match | 19/21 = 90% |
standard join sequence | 7 left: #11,12,13,14,16,17,18 |
cause of diff found | 2 left: #6,10 |
output side appears correct | - |
input side appears correct | - |
% queries verified correct | 10/21 = 48% |
bugs found | - |
bugs fixed | 1 left: #19 |
% bugs fixed | 9/21 = 43% |
requires reload | - |
% successful diff | 9/21 = 43% |
no diff | #1,2,3,4,5,7,8,9,15 |
allowable diff | - |
bugs¶
locationevent
incorrectly being identified bydate_start
, which is not intended to be identifying- creates diff in #6,6a,10,10a,19
- this occurs in the 3
plotMetadata_
rows with validdate_start
date values:UYLN003,SUC-05,YAN-02
(see"SALVIAS"."plotMetadata_.errors"
) - causes 3 outer
locationevents
to be missingproject_ids
:UYLN003,SUC-05,YAN-02
- also, there is an unexpected error:
could not create unique index "locationevent_pkeys_pkey" DETAIL: Key ("PlotID")=(26119) is duplicated.
- some stem records are missing (#6,6a)
- note that some of #6a's diff is instead due to the missing
project_ids
(above)
- note that some of #6a's diff is instead due to the missing
mismatches¶
- VegBIEN's definition of a plot is different from Brad's
- Brad's output queries were written assuming that the
location
table stores only top-level plots, but will need to be rewritten becauselocation
also stores subplots - extracting just the top-level plots from the subplot hierarchy also required numerous schema changes
- breaks #18
- Brad's output queries were written assuming that the
- the output queries depend on plots being grouped into projects, which is not the case for all plots datasources
- the taxon name is formed differently in the input and output queries
- breaks #13,14
- note that queries #6,17 have mismatches on the same
plot_codes
. this may suggest that a similar error is affecting both.
NY¶
task | progress/# left at step |
% queries match | 100% |
input side appears correct | 2 left: #8,9 |
% queries verified correct | 14/16 = 88% |
bugs found | - |
bugs fixed | 4 left: #3,12,13(+a,b),15 |
% bugs fixed | 10/16 = 63% |
requires reload | 5 left: #1,6,7,10,16 |
% successful diff | 5/16 = 31% |
no diff | #2,4,5,11,14 |
allowable diff | - |
bugs¶
- a few rows (0.1%) are not imported
- creates diff in #1 (1061 out of 941,107 rows)
- note that there are 1060 rows with
accessionNumber = NULL
:SELECT "accessionNumber", COUNT(*) FROM "NY"."Ecatalog_all" GROUP BY "accessionNumber" HAVING COUNT(*) > 1
- before fixing this, NY should be reloaded to fix the NULL accessionNumbers bug (below), and the existence of this bug rechecked
habitat
,substrate
,vegetation
should be stored inlocationRemarks
, notlocality
- creates diff in #15
- requires schema change: add
location.location_remarks
- requires mappings change:
mappings/VegCore-VegBIEN.csv
: fields incorrectly mapped tolocation.locationnarrative
(locality): remap to newlocation.location_remarks
instead
- sometimes, multiple input fields are concatenated together in the XPath mappings to form a single output column
- these fields should instead be concatenated together at stage II (source-general derived columns), so that their concatenation can be compared directly with what's stored in the database
- creates diff in #15, which concatenates
locality
andhabitat
intolocationnarrative
- only coordinates with both a latitude and a longitude are stored, due to
NOT NULL
constraints onlatitude_deg
andlongitude_deg
- verbatim dates are not stored in the database
- this prevents later generating a parsed date for records our parsing functions were unable to parse
- creates diff in #12
- verbatim coordinates are not stored in the database when the column is a valid
double precision
- this discards trailing zeros that would be useful in determining the precision of the value, in the absence of a coordinatePrecision
- creates diff in #13[+a,b] (#13a: rows 54+)
- families are scrubbed (replacing
'Unknown'
withNULL
) rather than stored verbatim- creates diff in #3
fixed¶
NULLaccessionNumbers
are preventing rows with anaccessionNumber
from having theirspecimenreplicate
-related fields imported (other tables' fields are unaffected)- creates diff in #1,10,12,16
- see error in
NY.Ecatalog_all.errors
> row 14 (accessionNumber
) >ERROR: duplicate key value violates unique constraint "specimenreplicate_plantobservation_1_to_1"
, which applies a filter"NY.Ecatalog_all.accessionNumber" IS NULL
to satisfy the unique constraint
coordinates_unique
unique constraint: addcoordsaccuracy_m
- might create diff in #13
subspecies needs to be available intaxonverbatim.subspecies
for use by the validations queries- creates diff in #6,7
mismatches¶
- input queries that use the scientificName need to prepend the family to match what is used on the output side
- affects #6,7,8,9
provider-validated:¶
VegBank¶
task | progress/# left at step |
% output query ready | 100% |
input query exists | 1 left: #20 |
% input query written | 19/20 = 95% |
input query mapped to staging tables | 19 left: #1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 |
tested outside of pipeline | - |
query runs without errors | - |
% input query ready | 0/20 = 0% |
mismatches¶
- include the following columns: see #915
the taxon names should be compared side-by-side instead of in separate lists (Martha):see VegBank.tnrs_names.xls, _plots_20_tnrs_names .
[the scripts need to] 1) form taxon names correctly before sending them to TNRS and 2) re-form the taxon names correctly after TNRS processing (including morphospecies). I think you need the following:
verbatim namename as input to TNRSname output from TNRSname output from TNRS with the morphospecies concatenated again[including the morphospecies suffix separately, since it doesn't always make sense to concatenate it to the taxon name]
plots, denormalized:¶
CVS¶
CTFS¶
specimens:¶
NY¶
ARIZ¶
MO¶
U¶
UNCC¶
BRIT¶
HVAA¶
TEX¶
NCU¶
Canadensys:¶
ACAD¶
HIBG¶
JBM¶
MT¶
QFA¶
TRT¶
TRTE¶
UBC¶
WIN¶
REMIB¶
SpeciesLink¶
GBIF¶
special cases:¶
FIA¶
plots, normalized:¶
TEAM¶
Madidi¶
NVS¶
progress bar template¶
- remove any sections that are not applicable to the datasource
- to calculate the numerator of a % done: numerator of % done in previous section (or total # queries) - sum of all "# left" in tasks in current section
task | progress/# left at step |
output query exists | [count of queries that need this step next] left: #[list of these queries] |
% output query written | 0/total_#_queries = 0% |
output query mapped to VegBIEN | - |
tested outside of pipeline | - |
query runs without errors | - |
column list done | - |
runs without slowdowns | - |
% output query ready | 0/total_#_queries = 0% |
input query exists | - |
% input query written | 0/total_#_queries = 0% |
input query mapped to staging tables | - |
tested outside of pipeline | - |
query runs without errors | - |
% input query ready | 0/total_#_queries = 0% |
added to pipeline | - |
diff runs without errors | - |
checked for duplicated rows | - |
no duplicated rows | - |
% queries match | 0/total_#_queries = 0% |
standard join sequence | - |
cause of diff found | - |
output side appears correct | - |
input side appears correct | - |
% queries verified correct | 0/total_#_queries = 0% |
bugs found | - |
bugs fixed | - |
% bugs fixed | 0/total_#_queries = 0% |
requires reload | - |
% successful diff | 0/total_#_queries = 0% |
no diff | #[list of these queries] |
allowable diff | #[list of these queries] |
bugs¶
- ...
fixed¶
bug
mismatches¶
- ...