2014-04-03 conference call¶
Martha's notes¶
upcoming¶
- the next conference call is next week at the usual time (Th. 9am PT/9am Tucson/12pm ET)
availability¶
- Brad will now be able to stay until the end of April, not March
- however, Brad will not be available, even for conference calls, after April
- Mark will be around for the next 2 weeks, then traveling
- please add your availability for spring 2014 to the *spreadsheet*:
Loading Google Spreadsheet...
decisions¶
aggregating validations¶
- we do want to run all queries for each datasource, even those queries that are not applicable to all datasources, because they contain columns that some datasources do not provide (Brad)
however, this is actually not necessary, as shown in the validations on sparse datasources infographic - even though the validations just test the normalization mechanism, not the source-specific preprocessing, we want to test all datasources, in case the example datasource used to develop the queries does not test every edge case
- assume person running scripts has access to the shell as well as the DB (ie. the pipeline does not have to be self-contained inside the DB) (Brad)
specimens aggregating validations¶
will implement workaround for the slowdown in query #12 instead of moving validations sequencing to shell scripts to prevent a single query from rolling back the the set of diffs
note that if there are future query slowdowns, this can also be fixed by settingstatement_timeout
locally onpublic_validations.rematerialize_out_view(text, regclass)
.
plots aggregating validations¶
- defer denormalizing normalized plots datasources (TEAM, Madidi) so that this doesn't hold up the aspects of the validations that need to get done while Brad is available
import process¶
- should switch to importing to a denormalized VegCore schema, so that we don't import each datasource differently
Postgres upgrades¶
- in general, wait a bit before upgrading (Martha)
to do for Aaron¶
aggregating validations¶
specimens queriesimplement workaround for the slowdown in query #12run pipeline on NY to generate diffs
- plots queries
- write denormalized plots input queries, using VegBank as the example datasource
- finish fixing plots output queries
- validate datasources
- SALVIAS
- denormalized plots datasources: VegBank, CVS, CTFS
- specimens: NY, others
- FIA (special case, with separate input queries)
- normalized plots datasources: TEAM, Madidi
- denormalize
- validate
later¶
- move validations sequencing to shell scripts
import process¶
estimate how long it would take to switch to importing to a denormalized VegCore schemasee Importing to a denormalized VegCore schema- create denormalized VegCore schema
- should contain all the terms in the VegCore data dictionary
- switch to importing to a denormalized VegCore schema
- denormalize the remaining normalized datasources: TEAM, Madidi
- materialize the denormalized views of the staging tables
- rename all specimens tables and materialized views to the same table name
- for each datasource, add
NULL
columns for all VegCore terms that are not present in the datasource - remove source-specific columns
- re-order columns to match the order in the denormalized VegCore schema
info¶
TNRS¶
- after-next PHP upgrade will break TNRS (Brad)
MySQL¶
- MySQL nested queries can be slow, because MySQL sometimes won't use indexes for them (Brad)
- can solve this by running the nested query to a temp table and then using the temp table