Task #905: narrow down the cause of the import bug (incorrect join columns and disk space leak) - BIEN 3 - NCEAS Projects

Task #905

Updated by Aaron Marcuse-Kubitza over 10 years ago

_see #887, #902_ 

 h3. -alternate OS approach- 

 p(. _tried, and problem also occurs on Mac, so using other approaches_ 

 # -fix @`make install`@, which sets up the entire BIEN installation and dependencies- 
 # -test the import on the local testing machine (a Mac), which already has most of the dependencies- 
 # -if that doesn't work, try the other approaches below- 

 h3. -working-backup approach- 

 p(. _restoring to a working backup allows us to successfully run the import, so that we can then test system changes to see what broke things_ 

 # -restore vegbiendev to last working configuration- 
 # -get restored VM to work on VirtualBox- 
 ## -install bootloader- 
 ## -install device drivers- _the Linux VM configuration does not support the VirtualBox ethernet device natively, so it must be configured manually_ 
 # -set up database configuration[1] (the files and data that are not part of a Postgres-level backup)- 
 # -install database contents[1]- 
 there are 2 possible strategies for this: 
 ** -a) fix the bugs in the version of @make install@ available in that backup, and run it to build the database from the flat files. the advantage of this approach is that @make install@ is very fast (~15 min) when restoring only a single datasource in addition to the metadata datasources, while @pg_restore@ would take a day or more or require a custom script to do a selective restore.- _not using this method_ 
 ** -b) find the nearest database backup, restore it, and undo/redo modifications between the time of the VM backup and the time of the database backup. this involves either restoring the full DB (~1 day on vegbiendev, likely slower on an external USB drive) or writing a script to do a selective restore of the data needed for the test import; and then figuring out which modifications were made to the DB before/since the VM backup. (see "README.TXT":http://vegpath.org/fs/config/VirtualBox_VMs/vegbiendev/README.TXT > to set up the VM)- 
 # -test import and verify that it works- 
 # -perform system upgrades and see if import still works (see "README.TXT":http://vegpath.org/fs/config/VirtualBox_VMs/vegbiendev/README.TXT > to install system updates)- _the import does still work[2], so the join columns bug (#902) is fortunately not in a system upgrade._ 
 ** -if not, selectively install system upgrades to narrow down the problem- _N/A_ 
 # -update svn and see if import still works (see "README.TXT":http://vegpath.org/fs/config/VirtualBox_VMs/vegbiendev/README.TXT > to test a different revision)- _it doesn't, so this means the bug is somewhere in our code_ 
 ** -if not, test different revisions to narrow down the problem- _the problem was caused by r12153, so debugging can be limited to the code affected by this commit_ 

 fn1(. the backups from that time did not include the database because of "incorrect information in the Postgres documentation":http://www.postgresql.org/docs/9.3/static/backup-file.html stating that a filesystem-level backup of the live database would not be useful (in fact Postgres usually keeps around enough WAL files[3], and can be configured to keep around even more of them, so that a filesystem-level backup would work perfectly fine in most cases). this incorrect information (essentially a disclaimer) is likely there because the Postgres developers do not want users being upset at them if their filesystem-level backup happens to be unrestorable. 

 fn2(. note that there is a slowdown, where the import freezes for 25 min on a query, before resuming: 
 <pre><code class="SQL"> 
 pg_stat_activity snapshot taken: "15:34:51" 

 "query_start" 
 "2014-07-14 14:56:59.482365-07" 
 "backend_start" 
 "2014-07-14 14:43:41.845428-07" 

 query running for 25 min ("0:23:56.119442" in inputs/ARIZ/omoccurrences/logs/test_import.log.sql, but strangely 40 min ("14:56:59".."15:34:51") according to pg_stat_activity) 

 [2] DB query: non-cacheable: 
 Took 0:23:56.119442 sec 
 /*ARIZ.omoccurrences*/ 
 CREATE TEMP TABLE "coordinates_pkeys" AS 
 SELECT 
 "in#41"."*occurrenceID" 
 , "coordinates"."coordinates_id" AS "out.coordinates_id" 
 FROM "in#41" 
 JOIN "coordinates" ON 
 "coordinates"."source_id" = 5 
 AND "coordinates"."latitude_deg" = "in#41"."_nullIf(decimalLatitude).result" 
 AND "coordinates"."longitude_deg" = "in#41"."_nullIf(decimalLongitude).result" 
 AND COALESCE("coordinates"."verbatimlatitude", CAST('\N' AS text)) = COALESCE(NULL, CAST('\N' AS text)) 
 AND COALESCE("coordinates"."verbatimlongitude", CAST('\N' AS text)) = COALESCE(NULL, CAST('\N' AS text)) 
 AND COALESCE("coordinates"."verbatimcoordinates", CAST('\N' AS text)) = COALESCE("in#41"."ARIZ.omoccurrences.verbatimCoordinates", CAST('\N' AS text)) 
 AND COALESCE("coordinates"."footprintgeometry_dwc", CAST('\N' AS text)) = COALESCE("in#41"."ARIZ.omoccurrences.footprintWKT", CAST('\N' AS text)) 
 /* EXPLAIN: 
 Nested Loop    (cost=5333.13..5366.19 rows=1 width=8) 
   ->    Index Scan using coordinates_unique on coordinates    (cost=0.41..18.01 rows=1 width=84) 
         Index Cond: ((source_id = 5) AND (COALESCE(verbatimlatitude, '\N'::text) = '\N'::text) AND (COALESCE(verbatimlongitude, '\N'::text) = '\N'::text)) 
   ->    Bitmap Heap Scan on "in#41"    (cost=5332.72..5348.17 rows=1 width=73) 
         Recheck Cond: ((COALESCE("ARIZ.omoccurrences.verbatimCoordinates", '\N'::text) = COALESCE(coordinates.verbatimcoordinates, '\N'::text)) AND (COALESCE("ARIZ.omoccurrences.footprintWKT", '\N'::text) = COALESCE(coordinates.footprintgeometry_dwc, '\N'::text))) 
         Filter: ((coordinates.latitude_deg = "_nullIf(decimalLatitude).result") AND (coordinates.longitude_deg = "_nullIf(decimalLongitude).result")) 
         ->    Bitmap Index Scan on "in#41_coalesce_coalesce1_coalesce2_coalesce3_idx"    (cost=0.00..5332.72 rows=4 width=0) 
               Index Cond: ((COALESCE("ARIZ.omoccurrences.verbatimCoordinates", '\N'::text) = COALESCE(coordinates.verbatimcoordinates, '\N'::text)) AND (COALESCE("ARIZ.omoccurrences.footprintWKT", '\N'::text) = COALESCE(coordinates.footprintgeometry_dwc, '\N'::text))) 
 */ 
 </code></pre> 

 fn3(. fn3. except when the database is under very heavy load, such as at the beginning of full-database import 
 &nbsp; 

 h3. tracing approach 

 # attempt to trace where the incorrect columns list is being introduced 
 ** generally time-consuming to trace data through complex code 
 ** narrowing down the affected section of code (using the working-backup approach) would reduce the amount of code that needs to be traced 

 h3. rewriting approach 

 # rewrite the section of code with the bug 
 ** in this case, that code section is the most complex part of our codebase, so this would be potentially very time-consuming 
 ** narrowing down the affected section of code (using the working-backup approach) would reduce the amount of code that needs to be rewritten 

 --- 

 h3. -clean VM approach- 

 p(. _not using this approach because we have a backup from the time of the last successful import, so we don't need to obtain/install dependencies in the versions they had in the last successful import_ 

 # -prepare clean VMs- 
 # -fix @`make install`@, which sets up the entire BIEN installation and dependencies- 
 ** normally, we do not reinstall the DB from scratch, so the bugs in @`make install`@ only become apparent when it is run on a partial installation 
 # install the database from scratch on a clean VM (VirtualBox) 
 ** this would involve adding any missing dependencies to our install scripts 
 # test the import in the clean VM with a sample datasource to see if that reproduces the problem 
 ** if it does, we know it's a bug in Postgres/Ubuntu and can troubleshoot using VM images with different Postgres/Ubuntu versions 
 ** if it doesn't, it's a problem specific to just the vegbiendev VM and we would reset the vegbiendev VM to a clean Ubuntu install and reinstall our dependencies 

 h3. -Postgres rollback approach- 

 p(. _not using this approach because it only works if the problem is with Postgres, but it could also be a problem in the other import code (eg. Python)_ 

 # roll back Postgres to the version it was at in the last successful import 
 ** this may require building Postgres from source, because past _revisions_ of the same numeric version might only be available in version control, not in binary form via apt-get (which numbers packages by numeric version) 
 ** if this isn't possible, it may be necessary to downgrade to Postgres 9.2 (which will unfortunately be missing some features that we now use) 
 # see if this fixes the problem

Back

Project

General

Profile

Task #905