Task #887
Updated by Aaron Marcuse-Kubitza over 10 years ago
h3. issue
* in the last full-database import, this caused disk space errors in 29 of [[VegBIEN_contents#datasources|41 datasources]]:
<pre>
ssh -t vegbiendev.nceas.ucsb.edu exec sudo su - aaronmk
export version=r13016
grep --files-with-matches -F "No space left on device" inputs/{.,}*/*/logs/$version.log.sql
# and uniqify by datasource
</pre>
* in a test run, importing even moderate-sized datasources (eg. NY, SALVIAS) causes all the available disk space to be used up, and crashes the import:
<pre>
ssh -t vegbiendev.nceas.ucsb.edu exec sudo su - aaronmk
export version=test_import
grep --files-with-matches -F "No space left on device" inputs/{.,}*/*/logs/$version.log.sql
# the list includes all the datasources in the test run: NY, SALVIAS
</pre>
* because the test run crashed as well, the problem is most likely a bug in Postgres or Linux itself. this unfortunately means that we *can't run the import until we find the Postgres/Linux bug* that is causing the problem.
* the disk fills up during queries that normally run quickly but now seem to run indefinitely until the disk space is used up (note the @query_start@ times above each query, compared to when the snapshot was taken):
<pre><code class="SQL">
pg_stat_activity snapshot taken: "Apr 2 11:42:26 2014"
running for 5.5 h:
"query_start"
"2014-04-02 06:20:01.952174-07"
/*NVS.StemObservation*/ CREATE TEMP TABLE "stemobservation_pkeys" AS
SELECT
"in#11"."row_num"
, "stemobservation"."stemobservation_id" AS "out.stemobservation_id"
FROM "in#11"
JOIN "stemobservation" ON COALESCE("stemobservation"."height_m", CAST('NaN' AS double precision)) = COALESCE("in#11"."NVS.StemObservation.Height::double", CAST('NaN' AS double precision))
/* EXPLAIN:
Merge Join (cost=1045217.21..95909185.15 rows=6323041431 width=8)
Merge Cond: ((COALESCE(stemobservation.height_m, 'NaN'::double precision)) = (COALESCE("in#11"."NVS.StemObservation.Height::double", 'NaN'::double precision)))
-> Sort (cost=30039.56..30492.76 rows=181279 width=12)
Sort Key: (COALESCE(stemobservation.height_m, 'NaN'::double precision))
-> Seq Scan on stemobservation (cost=0.00..14206.79 rows=181279 width=12)
-> Materialize (cost=1015177.64..1050057.81 rows=6976033 width=12)
-> Sort (cost=1015177.64..1032617.73 rows=6976033 width=12)
Sort Key: (COALESCE("in#11"."NVS.StemObs
*/
running for 21 h:
"query_start"
"2014-04-01 14:56:05.573418-07"
/*CVS.taxon_observation.***/ CREATE TEMP TABLE "location_pkeys" AS
SELECT
"in#5"."row_num"
, "location"."location_id" AS "out.location_id"
FROM "in#5"
JOIN "location" ON ("location"."accesslevel" = "in#5"."CVS.taxon_observation.**.accessRights::accesslevel" OR ("location"."accesslevel" IS NULL AND "in#5"."CVS.taxon_observation.**.accessRights::accesslevel" IS NULL))
/* EXPLAIN:
Nested Loop (cost=0.00..83049883712.45 rows=27683225000 width=8)
Join Filter: ((location.accesslevel = "in#5"."CVS.taxon_observation.**.accessRights::accesslevel") OR ((location.accesslevel IS NULL) AND ("in#5"."CVS.taxon_observation.**.accessRights::accesslevel" IS NULL)))
-> Seq Scan on location (cost=0.00..175778.45 rows=5536645 width=8)
-> Materialize (cost=0.00..35434.00 rows=1000000 width=8)
-> Seq Scan on "in#5" (cost=0.00..30434.00 rows=1000000 width=8)
*/
</code></pre><pre><code class="SQL">
config: Ubuntu 14.04 with official Ubuntu version of Postgres
pg_stat_activity snapshot taken: "2014-04-23 19:20:59.165896-07"
running for 2 h:
"query_start"
"2014-04-23 17:23:43.400618-07"
/*ARIZ.omoccurrences*/ CREATE TEMP TABLE "location_pkeys" AS
SELECT
"in#6"."occurrenceID"
, "location"."location_id" AS "out.location_id"
FROM "in#6"
JOIN "location" ON COALESCE("location"."elevation_m", CAST('NaN' AS double precision)) = COALESCE("in#6"."_alt(1=verbatimElevation, 2=_avg(1=minimumElevat.result::double", CAST('NaN' AS double precision))
/* EXPLAIN:
Merge Join (cost=52039.20..3508151.98 rows=230343442 width=8)
Merge Cond: ((COALESCE(location.elevation_m, 'NaN'::double precision)) = (COALESCE("in#6"."_alt(1=verbatimElevation, 2=_avg(1=minimumElevat.result::double", 'NaN'::double precision)))
-> Sort (cost=23780.86..24261.44 rows=192230 width=12)
Sort Key: (COALESCE(location.elevation_m, 'NaN'::double precision))
-> Seq Scan on location (cost=0.00..6910.30 rows=192230 width=12)
-> Sort (cost=28258.34..28857.47 rows=239654 width=12)
Sort Key: (COALESCE("in#6"."_alt(1=verbatimElevation, 2=_avg(1=minimumElevat.result::double", 'NaN'::double precision))
-
*/
uses 350 GB for sort temp storage:
# before stopping import (160 GB free):
df
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vegbiendev-root 2.0T 1.8T 160G 92% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 16G 4.0K 16G 1% /dev
tmpfs 3.2G 400K 3.2G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 16G 152K 16G 1% /run/shm
none 100M 0 100M 0% /run/user
/dev/vda1 485M 66M 394M 15% /boot
# after stopping import (516 GB free):
df
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vegbiendev-root 2.0T 1.4T 516G 73% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 16G 4.0K 16G 1% /dev
tmpfs 3.2G 400K 3.2G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 16G 152K 16G 1% /run/shm
none 100M 0 100M 0% /run/user
/dev/vda1 485M 66M 394M 15% /boot
</code></pre>
* may be related to the same problem that caused joining on the wrong columns (#902)
* the problem is not necessarily caused by #884, because sort temp files consume disk space only in proportion to the table size, which for a small import would not be nearly enough to fill the disk
h3. fix
h3(. VM rollback approach
# restore vegbiendev to last working configuration
# upgrade Postgres
# see if this reproduces the problem
h3(. Postgres rollback approach
# roll back Postgres to the version it was at in the last successful import
** this may require building Postgres from source, because past _revisions_ of the same numeric version might only be available in version control, not in binary form via apt-get (which numbers packages by numeric version)
** if this isn't possible, it may be necessary to downgrade to Postgres 9.2 (which will unfortunately be missing some features that we now use)
# see if this fixes the problem
h3(. clean VM approach
# install the database from scratch on a clean VM (VirtualBox)
** this would involve adding any missing dependencies to our install scripts
# test the import in the clean VM with a sample datasource to see if that reproduces the problem
** if it does, we know it's a bug in Postgres/Ubuntu and can troubleshoot using VM images with different Postgres/Ubuntu versions
** if it doesn't, it's a problem specific to just the vegbiendev VM and we would reset the vegbiendev VM to a clean Ubuntu install and reinstall our dependencies