Task #887: fix disk space leak that fills the disk and crashes the import - BIEN 3 - NCEAS Projects

Actions

Copy link

Task #887

closed

fix disk space leak that fills the disk and crashes the import

Added by Aaron Marcuse-Kubitza almost 11 years ago. Updated almost 8 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

Aaron Marcuse-Kubitza

Start date:

Due date:

% Done:

100%

Estimated time:

(Total: 0.00 h)

Activity type:

Description

the bug that triggers this Postgres bug (#902) has now been fixed, so no need to fix this

issue¶

in the last full-database import, this caused disk space errors in 29 of 41 datasources:

ssh -t vegbiendev.nceas.ucsb.edu exec sudo su - aaronmk
export version=r13016
grep --files-with-matches -F "No space left on device" inputs/{.,}*/*/logs/$version.log.sql
# and uniqify by datasource

in a test run, importing even moderate-sized datasources (eg. NY, SALVIAS) causes all the available disk space to be used up, and crashes the import:

ssh -t vegbiendev.nceas.ucsb.edu exec sudo su - aaronmk
export version=test_import
grep --files-with-matches -F "No space left on device" inputs/{.,}*/*/logs/$version.log.sql
# the list includes all the datasources in the test run: NY, SALVIAS

because the test run crashed as well, the problem is most likely a bug in Postgres itself. this unfortunately means that we can't run the import until we find the system bug that is causing the problem.
- based on the testing done for issue #902 (join columns), this is almost certainly not an Ubuntu bug, because it also occurs on Mac OS X
the problem still occurs after upgrading to Ubuntu 14.04, both with the compatibility version of Postgres we had been using before the upgrade as well as with the official Ubuntu version supplied with Ubuntu 14.04

broken queries¶

the disk fills up during queries that normally run quickly but now seem to run indefinitely until the disk space is used up (note the query_start times above each query, compared to when the snapshot was taken):

pg_stat_activity snapshot taken: "Apr  2 11:42:26 2014" 

running for 5.5 h:
"query_start" 
"2014-04-02 06:20:01.952174-07" 
/*NVS.StemObservation*/    CREATE TEMP TABLE "stemobservation_pkeys" AS
SELECT
"in#11"."row_num" 
, "stemobservation"."stemobservation_id" AS "out.stemobservation_id" 
FROM "in#11" 
JOIN "stemobservation" ON COALESCE("stemobservation"."height_m", CAST('NaN' AS double precision)) = COALESCE("in#11"."NVS.StemObservation.Height::double", CAST('NaN' AS double precision))
/* EXPLAIN:
Merge Join  (cost=1045217.21..95909185.15 rows=6323041431 width=8)
  Merge Cond: ((COALESCE(stemobservation.height_m, 'NaN'::double precision)) = (COALESCE("in#11"."NVS.StemObservation.Height::double", 'NaN'::double precision)))
  ->  Sort  (cost=30039.56..30492.76 rows=181279 width=12)
        Sort Key: (COALESCE(stemobservation.height_m, 'NaN'::double precision))
        ->  Seq Scan on stemobservation  (cost=0.00..14206.79 rows=181279 width=12)
  ->  Materialize  (cost=1015177.64..1050057.81 rows=6976033 width=12)
        ->  Sort  (cost=1015177.64..1032617.73 rows=6976033 width=12)
              Sort Key: (COALESCE("in#11"."NVS.StemObs
*/

running for 21 h:
"query_start" 
"2014-04-01 14:56:05.573418-07" 
/*CVS.taxon_observation.***/    CREATE TEMP TABLE "location_pkeys" AS
SELECT
"in#5"."row_num" 
, "location"."location_id" AS "out.location_id" 
FROM "in#5" 
JOIN "location" ON ("location"."accesslevel" = "in#5"."CVS.taxon_observation.**.accessRights::accesslevel" OR ("location"."accesslevel" IS NULL AND "in#5"."CVS.taxon_observation.**.accessRights::accesslevel" IS NULL))
/* EXPLAIN:
Nested Loop  (cost=0.00..83049883712.45 rows=27683225000 width=8)
  Join Filter: ((location.accesslevel = "in#5"."CVS.taxon_observation.**.accessRights::accesslevel") OR ((location.accesslevel IS NULL) AND ("in#5"."CVS.taxon_observation.**.accessRights::accesslevel" IS NULL)))
  ->  Seq Scan on location  (cost=0.00..175778.45 rows=5536645 width=8)
  ->  Materialize  (cost=0.00..35434.00 rows=1000000 width=8)
        ->  Seq Scan on "in#5"  (cost=0.00..30434.00 rows=1000000 width=8)
*/

config: Ubuntu 14.04 with official Ubuntu version of Postgres

import command:
declare -ax inputs=(inputs/{ARIZ,SALVIAS}/)
export version=test_import
. bin/import_all

pg_stat_activity snapshot taken: "2014-04-23 19:20:59.165896-07" 

running for 2 h:
"query_start" 
"2014-04-23 17:23:43.400618-07" 
/*ARIZ.omoccurrences*/    CREATE TEMP TABLE "location_pkeys" AS
SELECT
"in#6"."occurrenceID" 
, "location"."location_id" AS "out.location_id" 
FROM "in#6" 
JOIN "location" ON COALESCE("location"."elevation_m", CAST('NaN' AS double precision)) = COALESCE("in#6"."_alt(1=verbatimElevation, 2=_avg(1=minimumElevat.result::double", CAST('NaN' AS double precision))
/* EXPLAIN:
Merge Join  (cost=52039.20..3508151.98 rows=230343442 width=8)
  Merge Cond: ((COALESCE(location.elevation_m, 'NaN'::double precision)) = (COALESCE("in#6"."_alt(1=verbatimElevation, 2=_avg(1=minimumElevat.result::double", 'NaN'::double precision)))
  ->  Sort  (cost=23780.86..24261.44 rows=192230 width=12)
        Sort Key: (COALESCE(location.elevation_m, 'NaN'::double precision))
        ->  Seq Scan on location  (cost=0.00..6910.30 rows=192230 width=12)
  ->  Sort  (cost=28258.34..28857.47 rows=239654 width=12)
        Sort Key: (COALESCE("in#6"."_alt(1=verbatimElevation, 2=_avg(1=minimumElevat.result::double", 'NaN'::double precision))
        -
*/

uses >350 GB for sort temp storage (disk space increases until out of space):

# before stopping import (160 GB free):
df
Filesystem                   Size  Used Avail Use% Mounted on
/dev/mapper/vegbiendev-root  2.0T  1.8T  160G  92% /
none                         4.0K     0  4.0K   0% /sys/fs/cgroup
udev                          16G  4.0K   16G   1% /dev
tmpfs                        3.2G  400K  3.2G   1% /run
none                         5.0M     0  5.0M   0% /run/lock
none                          16G  152K   16G   1% /run/shm
none                         100M     0  100M   0% /run/user
/dev/vda1                    485M   66M  394M  15% /boot

# after stopping import (516 GB free):
df
Filesystem                   Size  Used Avail Use% Mounted on
/dev/mapper/vegbiendev-root  2.0T  1.4T  516G  73% /
none                         4.0K     0  4.0K   0% /sys/fs/cgroup
udev                          16G  4.0K   16G   1% /dev
tmpfs                        3.2G  400K  3.2G   1% /run
none                         5.0M     0  5.0M   0% /run/lock
none                          16G  152K   16G   1% /run/shm
none                         100M     0  100M   0% /run/user
/dev/vda1                    485M   66M  394M  15% /boot

may be related to the same problem that caused joining on the wrong columns (#902)
the problem is not necessarily caused by #884, because sort temp files consume disk space only in proportion to the table size, which for a small import would not be nearly enough to fill the disk

fix¶

see proposed steps in #905

Subtasks 2 (0 open — 2 closed)

Actions

Copy link

Updated by Aaron Marcuse-Kubitza almost 11 years ago

Description updated (diff)

Actions

Copy link

Updated by Aaron Marcuse-Kubitza almost 11 years ago

Description updated (diff)

Actions

Copy link

Updated by Aaron Marcuse-Kubitza almost 11 years ago

Subject changed from fix Postgres bug that causes all the available disk space to be used up to fix disk space leak that causes all the available disk space to be used up
Description updated (diff)

Actions

Copy link

Updated by Aaron Marcuse-Kubitza almost 11 years ago

Description updated (diff)

Actions

Copy link

Updated by Aaron Marcuse-Kubitza almost 11 years ago

Description updated (diff)

Actions

Copy link

Updated by Aaron Marcuse-Kubitza almost 11 years ago

Subject changed from fix disk space leak that causes all the available disk space to be used up to fix disk space leak that fills the disk and crashes the import

Actions

Copy link