Project

General

Profile

2014-04-03 conference call

Martha's notes

upcoming

  • the next conference call is next week at the usual time (Th. 9am PT/9am Tucson/12pm ET)

availability

  • Brad will now be able to stay until the end of April, not March
  • however, Brad will not be available, even for conference calls, after April
  • Mark will be around for the next 2 weeks, then traveling
  • please add your availability for spring 2014 to the *spreadsheet*:

Loading Google Spreadsheet...

decisions

aggregating validations

  • we do want to run all queries for each datasource, even those queries that are not applicable to all datasources, because they contain columns that some datasources do not provide (Brad)
    however, this is actually not necessary, as shown in the validations on sparse datasources infographic
  • even though the validations just test the normalization mechanism, not the source-specific preprocessing, we want to test all datasources, in case the example datasource used to develop the queries does not test every edge case
  • assume person running scripts has access to the shell as well as the DB (ie. the pipeline does not have to be self-contained inside the DB) (Brad)

specimens aggregating validations

  • will implement workaround for the slowdown in query #12 instead of moving validations sequencing to shell scripts to prevent a single query from rolling back the the set of diffs
    note that if there are future query slowdowns, this can also be fixed by setting statement_timeout locally on public_validations.rematerialize_out_view(text, regclass) .

plots aggregating validations

import process

  • should switch to importing to a denormalized VegCore schema, so that we don't import each datasource differently

Postgres upgrades

  • in general, wait a bit before upgrading (Martha)

to do for Aaron

aggregating validations

  1. specimens queries
    1. implement workaround for the slowdown in query #12
    2. run pipeline on NY to generate diffs
  2. plots queries
    1. write denormalized plots input queries, using VegBank as the example datasource
    2. finish fixing plots output queries
  3. validate datasources
    1. SALVIAS
    2. denormalized plots datasources: VegBank, CVS, CTFS
    3. specimens: NY, others
    4. FIA (special case, with separate input queries)
    5. normalized plots datasources: TEAM, Madidi
      1. denormalize
      2. validate

later

  • move validations sequencing to shell scripts

import process

  1. estimate how long it would take to switch to importing to a denormalized VegCore schema see Importing to a denormalized VegCore schema
  2. create denormalized VegCore schema
  3. switch to importing to a denormalized VegCore schema
    1. denormalize the remaining normalized datasources: TEAM, Madidi
    2. materialize the denormalized views of the staging tables
    3. rename all specimens tables and materialized views to the same table name
    4. for each datasource, add NULL columns for all VegCore terms that are not present in the datasource
    5. remove source-specific columns
    6. re-order columns to match the order in the denormalized VegCore schema

info

TNRS

  • after-next PHP upgrade will break TNRS (Brad)

MySQL

  • MySQL nested queries can be slow, because MySQL sometimes won't use indexes for them (Brad)
    • can solve this by running the nested query to a temp table and then using the temp table