Project

General

Profile

Individual datasource refresh

datasource removal runtimes

  • small
    • NVS: 1.5 min ("1m40.764s"); before fixes: 1.2 h ("72m42.756s") = 50x improvement
  • medium-sized
    • bien2_traits: 2 min ("2:14.36")
  • large
    • CVS: 7.5 min ("453137.787 ms")

Optimization

  1. avoid ANALYZE-ing the output table after every step for small datasources. this is only needed for large datasources which would change the query plan for queries on the output table.
    autoanalyzes are now occurring instead of the explicit ANALYZEs:
    sudo grep -F 'automatic analyze' /var/lib/postgresql/9.1/main/pg_log/postgresql-2013-08-31_*
  2. ensure SELECTs on output tables always use an index scan
    e.g. inputs/NVS/AggregateOccurrence/logs/r10837.log.sql:3164 uses a seq scan/sort (search for ^Took \d:\d[1-9] ):
    INSERT INTO "in#7_distinct" 
    SELECT
    DISTINCT ON ("in#7"."locationevent_pkeys.out.locationevent_id", "in#7"."_alt(1=Verbatim Code, 2=Species Code).result")
    "in#7".*
    FROM "in#7" 
    LEFT JOIN "taxonoccurrence" ON
    COALESCE("taxonoccurrence"."locationevent_id", CAST(2147483647 AS integer)) = COALESCE("in#7"."locationevent_pkeys.out.locationevent_id", CAST(2147483647 AS integer))
    AND COALESCE("taxonoccurrence"."authortaxoncode", CAST('\N' AS text)) = COALESCE("in#7"."_alt(1=Verbatim Code, 2=Species Code).result", CAST('\N' AS text))
    WHERE "taxonoccurrence"."taxonoccurrence_id" IS NULL
    
  3. check each datasource's logs for any remaining slow queries (after re-running column-based import), to ensure that there are indexes on all necessary columns:
    1. open each inputs/{.,}*/*/logs/r10848.log.sql
    2. search for ^Took \d:\d[1-9]
    3. in matches, look for queries whose EXPLAIN contains Seq Scan
      • inputs/GBIF/raw_occurrence_record_plants/logs/r10848.log.sql:978574 uses Seq Scan/Sort, even though an equivalent test query correctly uses Index Scan:
        (but maybe the query planner does a Seq Scan because it feels that most of the place entries come from GBIF? this explanation would likely only be applicable to GBIF, since it is the largest datasource.)
        SELECT *
        FROM "place" WHERE
        "place"."source_id" = 13
        AND COALESCE("place"."continent", CAST('\N' AS text)) = '\N'
        AND COALESCE("place"."country", CAST('\N' AS text)) = '\N'
        AND COALESCE("place"."stateprovince", CAST('\N' AS text)) = '\N'
        AND COALESCE("place"."county", CAST('\N' AS text)) = '\N'
        AND COALESCE("place"."coordinates_id", CAST(2147483647 AS integer)) = 2147483647
        
  4. test separate refresh of each datasource
  5. avoid selecting rows back out of a large table (used when a unique constraint is violated and the existing row needs to be fetched)