Individual datasource refresh¶

datasource removal runtimes¶

small
- NVS: 1.5 min ("1m40.764s"); before fixes: 1.2 h ("72m42.756s") = 50x improvement
medium-sized
- bien2_traits: 2 min ("2:14.36")
large
- CVS: 7.5 min ("453137.787 ms")

Optimization¶

~~avoid ANALYZE-ing the output table after every step for small datasources. this is only needed for large datasources which would change the query plan for queries on the output table.~~
autoanalyzes are now occurring instead of the explicit ANALYZEs:
sudo grep -F 'automatic analyze' /var/lib/postgresql/9.1/main/pg_log/postgresql-2013-08-31_*

~~ensure SELECTs on output tables always use an index scan~~
e.g. inputs/NVS/AggregateOccurrence/logs/r10837.log.sql:3164 uses a seq scan/sort (search for ^Took \d:\d[1-9] ):

INSERT INTO "in#7_distinct" 
SELECT
DISTINCT ON ("in#7"."locationevent_pkeys.out.locationevent_id", "in#7"."_alt(1=Verbatim Code, 2=Species Code).result")
"in#7".*
FROM "in#7" 
LEFT JOIN "taxonoccurrence" ON
COALESCE("taxonoccurrence"."locationevent_id", CAST(2147483647 AS integer)) = COALESCE("in#7"."locationevent_pkeys.out.locationevent_id", CAST(2147483647 AS integer))
AND COALESCE("taxonoccurrence"."authortaxoncode", CAST('\N' AS text)) = COALESCE("in#7"."_alt(1=Verbatim Code, 2=Species Code).result", CAST('\N' AS text))
WHERE "taxonoccurrence"."taxonoccurrence_id" IS NULL

check each datasource's logs for any remaining slow queries (after re-running column-based import), to ensure that there are indexes on all necessary columns:
1. open each inputs/{.,}*/*/logs/r10848.log.sql
2. search for ^Took \d:\d[1-9]
3. in matches, look for queries whose EXPLAIN contains Seq Scan
  - inputs/GBIF/raw_occurrence_record_plants/logs/r10848.log.sql:978574 uses Seq Scan/Sort, even though an equivalent test query correctly uses Index Scan:
    (but maybe the query planner does a Seq Scan because it feels that most of the place entries come from GBIF? this explanation would likely only be applicable to GBIF, since it is the largest datasource.)
```
SELECT *
FROM "place" WHERE
"place"."source_id" = 13
AND COALESCE("place"."continent", CAST('\N' AS text)) = '\N'
AND COALESCE("place"."country", CAST('\N' AS text)) = '\N'
AND COALESCE("place"."stateprovince", CAST('\N' AS text)) = '\N'
AND COALESCE("place"."county", CAST('\N' AS text)) = '\N'
AND COALESCE("place"."coordinates_id", CAST(2147483647 AS integer)) = 2147483647
```
test separate refresh of each datasource
- tested: ACAD, bien2_traits, CVS, HVAA, MO, NVS
avoid selecting rows back out of a large table (used when a unique constraint is violated and the existing row needs to be fetched)

Files (0)

Project

General

Profile

Wiki

Individual datasource refresh¶

datasource removal runtimes¶

Optimization¶