Task #887: fix disk space leak that fills the disk and crashes the import - BIEN 3 - NCEAS Projects

Task #887

Updated by Aaron Marcuse-Kubitza over 10 years ago

h3. issue 

 * in the last full-database import, this caused disk space errors in 29 of [[VegBIEN_contents#datasources|41 datasources]]: 
 <pre> 
 ssh -t vegbiendev.nceas.ucsb.edu exec sudo su - aaronmk 
 export version=r13016 
 grep --files-with-matches -F "No space left on device" inputs/{.,}*/*/logs/$version.log.sql 
 # and uniqify by datasource 
 </pre> 
 * in a test run, importing even moderate-sized datasources (eg. NY, SALVIAS) causes all the available disk space to be used up, and crashes the import: 
 <pre> 
 ssh -t vegbiendev.nceas.ucsb.edu exec sudo su - aaronmk 
 export version=test_import 
 grep --files-with-matches -F "No space left on device" inputs/{.,}*/*/logs/$version.log.sql 
 # the list includes all the datasources in the test run: NY, SALVIAS 
 </pre> 
 * because the test run crashed as well, the problem is most likely a bug in Postgres or Linux itself. this unfortunately means that we *can't run the import until we find the Postgres/Linux bug* that is causing the problem. 

 h3((. h3. broken queries 

 * the disk fills up during queries that normally run quickly but now seem to run indefinitely until the disk space is used up (note the @query_start@ times above each query, compared to when the snapshot was taken): 
 <pre><code class="SQL"> 
 pg_stat_activity snapshot taken: "Apr    2 11:42:26 2014" 

 running for 5.5 h: 
 "query_start" 
 "2014-04-02 06:20:01.952174-07" 
 /*NVS.StemObservation*/ 	 CREATE TEMP TABLE "stemobservation_pkeys" AS 
 SELECT 
 "in#11"."row_num" 
 , "stemobservation"."stemobservation_id" AS "out.stemobservation_id" 
 FROM "in#11" 
 JOIN "stemobservation" ON COALESCE("stemobservation"."height_m", CAST('NaN' AS double precision)) = COALESCE("in#11"."NVS.StemObservation.Height::double", CAST('NaN' AS double precision)) 
 /* EXPLAIN: 
 Merge Join    (cost=1045217.21..95909185.15 rows=6323041431 width=8) 
   Merge Cond: ((COALESCE(stemobservation.height_m, 'NaN'::double precision)) = (COALESCE("in#11"."NVS.StemObservation.Height::double", 'NaN'::double precision))) 
   ->    Sort    (cost=30039.56..30492.76 rows=181279 width=12) 
         Sort Key: (COALESCE(stemobservation.height_m, 'NaN'::double precision)) 
         ->    Seq Scan on stemobservation    (cost=0.00..14206.79 rows=181279 width=12) 
   ->    Materialize    (cost=1015177.64..1050057.81 rows=6976033 width=12) 
         ->    Sort    (cost=1015177.64..1032617.73 rows=6976033 width=12) 
               Sort Key: (COALESCE("in#11"."NVS.StemObs 
 */ 

 running for 21 h: 
 "query_start" 
 "2014-04-01 14:56:05.573418-07" 
 /*CVS.taxon_observation.***/ 	 CREATE TEMP TABLE "location_pkeys" AS 
 SELECT 
 "in#5"."row_num" 
 , "location"."location_id" AS "out.location_id" 
 FROM "in#5" 
 JOIN "location" ON ("location"."accesslevel" = "in#5"."CVS.taxon_observation.**.accessRights::accesslevel" OR ("location"."accesslevel" IS NULL AND "in#5"."CVS.taxon_observation.**.accessRights::accesslevel" IS NULL)) 
 /* EXPLAIN: 
 Nested Loop    (cost=0.00..83049883712.45 rows=27683225000 width=8) 
   Join Filter: ((location.accesslevel = "in#5"."CVS.taxon_observation.**.accessRights::accesslevel") OR ((location.accesslevel IS NULL) AND ("in#5"."CVS.taxon_observation.**.accessRights::accesslevel" IS NULL))) 
   ->    Seq Scan on location    (cost=0.00..175778.45 rows=5536645 width=8) 
   ->    Materialize    (cost=0.00..35434.00 rows=1000000 width=8) 
         ->    Seq Scan on "in#5"    (cost=0.00..30434.00 rows=1000000 width=8) 
 */ 
 </code></pre><pre><code class="SQL"> 
 config: Ubuntu 14.04 with official Ubuntu version of Postgres 
 pg_stat_activity snapshot taken: "2014-04-23 19:20:59.165896-07" 

 running for 2 h: 
 "query_start" 
 "2014-04-23 17:23:43.400618-07" 
 /*ARIZ.omoccurrences*/ 	 CREATE TEMP TABLE "location_pkeys" AS 
 SELECT 
 "in#6"."occurrenceID" 
 , "location"."location_id" AS "out.location_id" 
 FROM "in#6" 
 JOIN "location" ON COALESCE("location"."elevation_m", CAST('NaN' AS double precision)) = COALESCE("in#6"."_alt(1=verbatimElevation, 2=_avg(1=minimumElevat.result::double", CAST('NaN' AS double precision)) 
 /* EXPLAIN: 
 Merge Join    (cost=52039.20..3508151.98 rows=230343442 width=8) 
   Merge Cond: ((COALESCE(location.elevation_m, 'NaN'::double precision)) = (COALESCE("in#6"."_alt(1=verbatimElevation, 2=_avg(1=minimumElevat.result::double", 'NaN'::double precision))) 
   ->    Sort    (cost=23780.86..24261.44 rows=192230 width=12) 
         Sort Key: (COALESCE(location.elevation_m, 'NaN'::double precision)) 
         ->    Seq Scan on location    (cost=0.00..6910.30 rows=192230 width=12) 
   ->    Sort    (cost=28258.34..28857.47 rows=239654 width=12) 
         Sort Key: (COALESCE("in#6"."_alt(1=verbatimElevation, 2=_avg(1=minimumElevat.result::double", 'NaN'::double precision)) 
         - 
 */ 

 uses >350 GB for sort temp storage (disk space increases until out of space): 

 # before stopping import (160 GB free): 
 df 
 Filesystem                     Size    Used Avail Use% Mounted on 
 /dev/mapper/vegbiendev-root    2.0T    1.8T    160G    92% / 
 none                           4.0K       0    4.0K     0% /sys/fs/cgroup 
 udev                            16G    4.0K     16G     1% /dev 
 tmpfs                          3.2G    400K    3.2G     1% /run 
 none                           5.0M       0    5.0M     0% /run/lock 
 none                            16G    152K     16G     1% /run/shm 
 none                           100M       0    100M     0% /run/user 
 /dev/vda1                      485M     66M    394M    15% /boot 

 # after stopping import (516 GB free): 
 df 
 Filesystem                     Size    Used Avail Use% Mounted on 
 /dev/mapper/vegbiendev-root    2.0T    1.4T    516G    73% / 
 none                           4.0K       0    4.0K     0% /sys/fs/cgroup 
 udev                            16G    4.0K     16G     1% /dev 
 tmpfs                          3.2G    400K    3.2G     1% /run 
 none                           5.0M       0    5.0M     0% /run/lock 
 none                            16G    152K     16G     1% /run/shm 
 none                           100M       0    100M     0% /run/user 
 /dev/vda1                      485M     66M    394M    15% /boot 
 </code></pre> 

 * may be related to the same problem that caused joining on the wrong columns (#902) 
 * the problem is not necessarily caused by #884, because sort temp files consume disk space only in proportion to the table size, which for a small import would not be nearly enough to fill the disk 

 h3. fix 

 try each of the following, in order: 

 h3(. alternate OS approach 

 # test the import on Mac (eg. the local testing machine, which already has most of the dependencies) 
 # if that doesn't work, try the clean VM approach below 

 h3(. clean VM approach 

 # install the database from scratch on a clean VM (VirtualBox) 
 ** this would involve adding any missing dependencies to our install scripts 
 # test the import in the clean VM with a sample datasource to see if that reproduces the problem 
 ** if it does, we know it's a bug in Postgres/Ubuntu and can troubleshoot using VM images with different Postgres/Ubuntu versions 
 ** if it doesn't, it's a problem specific to just the vegbiendev VM and we would reset the vegbiendev VM to a clean Ubuntu install and reinstall our dependencies 

 h3(. Postgres rollback approach 

 # roll back Postgres to the version it was at in the last successful import 
 ** this may require building Postgres from source, because past _revisions_ of the same numeric version might only be available in version control, not in binary form via apt-get (which numbers packages by numeric version) 
 ** if this isn't possible, it may be necessary to downgrade to Postgres 9.2 (which will unfortunately be missing some features that we now use) 
 # see if this fixes the problem 

 h3(. VM rollback approach 

 # restore vegbiendev to last working configuration 
 # upgrade Postgres 
 # see if this reproduces the problem

Back

Project

General

Profile

Task #887