Task #887: fix disk space leak that fills the disk and crashes the import - BIEN 3 - NCEAS Projects

Task #887

Updated by Aaron Marcuse-Kubitza over 10 years ago

h3. issue 

 * in the last full-database import, this caused disk space errors in 29 of [[VegBIEN_contents#datasources|41 datasources]]: 
 <pre> 
 ssh -t vegbiendev.nceas.ucsb.edu exec sudo su - aaronmk 
 export version=r13016 
 grep --files-with-matches -F "No space left on device" inputs/{.,}*/*/logs/$version.log.sql 
 # and uniqify by datasource 
 </pre> 
 * in a test run, importing even moderate-sized datasources (eg. NY, SALVIAS) causes all the available disk space to be used up, and crashes the import: 
 <pre> 
 ssh -t vegbiendev.nceas.ucsb.edu exec sudo su - aaronmk 
 export version=test_import 
 grep --files-with-matches -F "No space left on device" inputs/{.,}*/*/logs/$version.log.sql 
 # the list includes all the datasources in the test run: NY, SALVIAS 
 </pre> 
 * because the test run crashed as well, the problem is most likely a bug in Postgres or Linux itself. this unfortunately means that we *can't run the import until we find the Postgres/Linux bug* that is causing the problem. 
 * the problem still occurs after upgrading to Ubuntu 14.04, both with the compatibility version of Postgres we had been using before the upgrade as well as with the official Ubuntu version supplied with Ubuntu 14.04 

 h3((. broken queries 

 * the disk fills up during queries that normally run quickly but now seem to run indefinitely until the disk space is used up (note the @query_start@ times above each query, compared to when the snapshot was taken): 
 <pre><code class="SQL"> 
 pg_stat_activity snapshot taken: "Apr    2 11:42:26 2014" 

 running for 5.5 h: 
 "query_start" 
 "2014-04-02 06:20:01.952174-07" 
 /*NVS.StemObservation*/ 	 CREATE TEMP TABLE "stemobservation_pkeys" AS 
 SELECT 
 "in#11"."row_num" 
 , "stemobservation"."stemobservation_id" AS "out.stemobservation_id" 
 FROM "in#11" 
 JOIN "stemobservation" ON COALESCE("stemobservation"."height_m", CAST('NaN' AS double precision)) = COALESCE("in#11"."NVS.StemObservation.Height::double", CAST('NaN' AS double precision)) 
 /* EXPLAIN: 
 Merge Join    (cost=1045217.21..95909185.15 rows=6323041431 width=8) 
   Merge Cond: ((COALESCE(stemobservation.height_m, 'NaN'::double precision)) = (COALESCE("in#11"."NVS.StemObservation.Height::double", 'NaN'::double precision))) 
   ->    Sort    (cost=30039.56..30492.76 rows=181279 width=12) 
         Sort Key: (COALESCE(stemobservation.height_m, 'NaN'::double precision)) 
         ->    Seq Scan on stemobservation    (cost=0.00..14206.79 rows=181279 width=12) 
   ->    Materialize    (cost=1015177.64..1050057.81 rows=6976033 width=12) 
         ->    Sort    (cost=1015177.64..1032617.73 rows=6976033 width=12) 
               Sort Key: (COALESCE("in#11"."NVS.StemObs 
 */ 

 running for 21 h: 
 "query_start" 
 "2014-04-01 14:56:05.573418-07" 
 /*CVS.taxon_observation.***/ 	 CREATE TEMP TABLE "location_pkeys" AS 
 SELECT 
 "in#5"."row_num" 
 , "location"."location_id" AS "out.location_id" 
 FROM "in#5" 
 JOIN "location" ON ("location"."accesslevel" = "in#5"."CVS.taxon_observation.**.accessRights::accesslevel" OR ("location"."accesslevel" IS NULL AND "in#5"."CVS.taxon_observation.**.accessRights::accesslevel" IS NULL)) 
 /* EXPLAIN: 
 Nested Loop    (cost=0.00..83049883712.45 rows=27683225000 width=8) 
   Join Filter: ((location.accesslevel = "in#5"."CVS.taxon_observation.**.accessRights::accesslevel") OR ((location.accesslevel IS NULL) AND ("in#5"."CVS.taxon_observation.**.accessRights::accesslevel" IS NULL))) 
   ->    Seq Scan on location    (cost=0.00..175778.45 rows=5536645 width=8) 
   ->    Materialize    (cost=0.00..35434.00 rows=1000000 width=8) 
         ->    Seq Scan on "in#5"    (cost=0.00..30434.00 rows=1000000 width=8) 
 */ 
 </code></pre><pre><code class="SQL"> 
 config: Ubuntu 14.04 with official Ubuntu version of Postgres 
 pg_stat_activity snapshot taken: "2014-04-23 19:20:59.165896-07" 

 running for 2 h: 
 "query_start" 
 "2014-04-23 17:23:43.400618-07" 
 /*ARIZ.omoccurrences*/ 	 CREATE TEMP TABLE "location_pkeys" AS 
 SELECT 
 "in#6"."occurrenceID" 
 , "location"."location_id" AS "out.location_id" 
 FROM "in#6" 
 JOIN "location" ON COALESCE("location"."elevation_m", CAST('NaN' AS double precision)) = COALESCE("in#6"."_alt(1=verbatimElevation, 2=_avg(1=minimumElevat.result::double", CAST('NaN' AS double precision)) 
 /* EXPLAIN: 
 Merge Join    (cost=52039.20..3508151.98 rows=230343442 width=8) 
   Merge Cond: ((COALESCE(location.elevation_m, 'NaN'::double precision)) = (COALESCE("in#6"."_alt(1=verbatimElevation, 2=_avg(1=minimumElevat.result::double", 'NaN'::double precision))) 
   ->    Sort    (cost=23780.86..24261.44 rows=192230 width=12) 
         Sort Key: (COALESCE(location.elevation_m, 'NaN'::double precision)) 
         ->    Seq Scan on location    (cost=0.00..6910.30 rows=192230 width=12) 
   ->    Sort    (cost=28258.34..28857.47 rows=239654 width=12) 
         Sort Key: (COALESCE("in#6"."_alt(1=verbatimElevation, 2=_avg(1=minimumElevat.result::double", 'NaN'::double precision)) 
         - 
 */ 

 uses >350 GB for sort temp storage (disk space increases until out of space): 

 # before stopping import (160 GB free): 
 df 
 Filesystem                     Size    Used Avail Use% Mounted on 
 /dev/mapper/vegbiendev-root    2.0T    1.8T    160G    92% / 
 none                           4.0K       0    4.0K     0% /sys/fs/cgroup 
 udev                            16G    4.0K     16G     1% /dev 
 tmpfs                          3.2G    400K    3.2G     1% /run 
 none                           5.0M       0    5.0M     0% /run/lock 
 none                            16G    152K     16G     1% /run/shm 
 none                           100M       0    100M     0% /run/user 
 /dev/vda1                      485M     66M    394M    15% /boot 

 # after stopping import (516 GB free): 
 df 
 Filesystem                     Size    Used Avail Use% Mounted on 
 /dev/mapper/vegbiendev-root    2.0T    1.4T    516G    73% / 
 none                           4.0K       0    4.0K     0% /sys/fs/cgroup 
 udev                            16G    4.0K     16G     1% /dev 
 tmpfs                          3.2G    400K    3.2G     1% /run 
 none                           5.0M       0    5.0M     0% /run/lock 
 none                            16G    152K     16G     1% /run/shm 
 none                           100M       0    100M     0% /run/user 
 /dev/vda1                      485M     66M    394M    15% /boot 
 </code></pre> 

 * may be related to the same problem that caused joining on the wrong columns (#902) 
 * the problem is not necessarily caused by #884, because sort temp files consume disk space only in proportion to the table size, which for a small import would not be nearly enough to fill the disk 

 h3. fix 

 try each of the following, in order: 

 h3(. alternate OS approach 

 # test the import on Mac (eg. the local testing machine, which already has most of the dependencies) 
 # if that doesn't work, try the clean VM approach below 

 h3(. clean VM approach 

 # install the database from scratch on a clean VM (VirtualBox) 
 ** this would involve adding any missing dependencies to our install scripts 
 # test the import in the clean VM with a sample datasource to see if that reproduces the problem 
 ** if it does, we know it's a bug in Postgres/Ubuntu and can troubleshoot using VM images with different Postgres/Ubuntu versions 
 ** if it doesn't, it's a problem specific to just the vegbiendev VM and we would reset the vegbiendev VM to a clean Ubuntu install and reinstall our dependencies 

 h3(. Postgres rollback approach 

 # roll back Postgres to the version it was at in the last successful import 
 ** this may require building Postgres from source, because past _revisions_ of the same numeric version might only be available in version control, not in binary form via apt-get (which numbers packages by numeric version) 
 ** if this isn't possible, it may be necessary to downgrade to Postgres 9.2 (which will unfortunately be missing some features that we now use) 
 # see if this fixes the problem 

 h3(. VM rollback approach 

 # restore vegbiendev to last working configuration 
 # upgrade Postgres 
 # see if this reproduces the problem

Back

Project

General

Profile

Task #887