Individual datasource refresh¶
datasource removal runtimes¶
- small
- NVS: 1.5 min ("1m40.764s"); before fixes: 1.2 h ("72m42.756s") = 50x improvement
- medium-sized
- bien2_traits: 2 min ("2:14.36")
- large
- CVS: 7.5 min ("453137.787 ms")
Optimization¶
avoidANALYZE
-ing the output table after every step for small datasources. this is only needed for large datasources which would change the query plan for queries on the output table.
autoanalyzes are now occurring instead of the explicitANALYZEs
:sudo grep -F 'automatic analyze' /var/lib/postgresql/9.1/main/pg_log/postgresql-2013-08-31_*
ensure SELECTs on output tables always use an index scan
e.g.inputs/NVS/AggregateOccurrence/logs/r10837.log.sql:3164
uses a seq scan/sort (search for^Took \d:\d[1-9]
):INSERT INTO "in#7_distinct" SELECT DISTINCT ON ("in#7"."locationevent_pkeys.out.locationevent_id", "in#7"."_alt(1=Verbatim Code, 2=Species Code).result") "in#7".* FROM "in#7" LEFT JOIN "taxonoccurrence" ON COALESCE("taxonoccurrence"."locationevent_id", CAST(2147483647 AS integer)) = COALESCE("in#7"."locationevent_pkeys.out.locationevent_id", CAST(2147483647 AS integer)) AND COALESCE("taxonoccurrence"."authortaxoncode", CAST('\N' AS text)) = COALESCE("in#7"."_alt(1=Verbatim Code, 2=Species Code).result", CAST('\N' AS text)) WHERE "taxonoccurrence"."taxonoccurrence_id" IS NULL
- check each datasource's logs for any remaining slow queries (after re-running column-based import), to ensure that there are indexes on all necessary columns:
- open each
inputs/{.,}*/*/logs/r10848.log.sql
- search for
^Took \d:\d[1-9]
- in matches, look for queries whose EXPLAIN contains
Seq Scan
inputs/GBIF/raw_occurrence_record_plants/logs/r10848.log.sql:978574
uses Seq Scan/Sort, even though an equivalent test query correctly uses Index Scan:
(but maybe the query planner does a Seq Scan because it feels that most of theplace
entries come from GBIF? this explanation would likely only be applicable to GBIF, since it is the largest datasource.)SELECT * FROM "place" WHERE "place"."source_id" = 13 AND COALESCE("place"."continent", CAST('\N' AS text)) = '\N' AND COALESCE("place"."country", CAST('\N' AS text)) = '\N' AND COALESCE("place"."stateprovince", CAST('\N' AS text)) = '\N' AND COALESCE("place"."county", CAST('\N' AS text)) = '\N' AND COALESCE("place"."coordinates_id", CAST(2147483647 AS integer)) = 2147483647
- open each
- test separate refresh of each datasource
- avoid selecting rows back out of a large table (used when a unique constraint is violated and the existing row needs to be fetched)