Task #905
Updated by Aaron Marcuse-Kubitza over 10 years ago
_see #887, #902_ h3. -alternate OS approach- p(. _tried, and problem also occurs on Mac, so using other approaches_ # -fix @`make install`@, which sets up the entire BIEN installation and dependencies- # -test the import on the local testing machine (a Mac), which already has most of the dependencies- # -if that doesn't work, try the other approaches below- h3. -working-backup approach- p(. _restoring to a working backup allows us to successfully run the import, so that we can then test system changes to see what broke things_ # -restore vegbiendev to last working configuration- # -get restored VM to work on VirtualBox- ## -install bootloader- ## -install device drivers- _the Linux VM configuration does not support the VirtualBox ethernet device natively, so it must be configured manually_ # -set up database configuration[1] (the files and data that are not part of a Postgres-level backup)- # -install database contents[1]- there are 2 possible strategies for this: ** -a) fix the bugs in the version of @make install@ available in that backup, and run it to build the database from the flat files. the advantage of this approach is that @make install@ is very fast (~15 min) when restoring only a single datasource in addition to the metadata datasources, while @pg_restore@ would take a day or more or require a custom script to do a selective restore.- _not using this method_ ** -b) find the nearest database backup, restore it, and undo/redo modifications between the time of the VM backup and the time of the database backup. this involves either restoring the full DB (~1 day on vegbiendev, likely slower on an external USB drive) or writing a script to do a selective restore of the data needed for the test import; and then figuring out which modifications were made to the DB before/since the VM backup. (see "README.TXT":http://vegpath.org/fs/config/VirtualBox_VMs/vegbiendev/README.TXT > to set up the VM)- # -test import and verify that it works- # -perform system upgrades and see if import still works (see "README.TXT":http://vegpath.org/fs/config/VirtualBox_VMs/vegbiendev/README.TXT > to install system updates)- _the import does still work[2], so the join columns bug (#902) is fortunately not in a system upgrade._ ** -if not, selectively install system upgrades to narrow down the problem- _N/A_ # -update svn and see if import still works (see "README.TXT":http://vegpath.org/fs/config/VirtualBox_VMs/vegbiendev/README.TXT > to test a different revision)- _it doesn't, so this means the bug is somewhere in our code_ ** -if not, test different revisions to narrow down the problem- _the problem was caused by r12153, so debugging can be limited to the code affected by this commit_ fn1(. the backups from that time did not include the database because of "incorrect information in the Postgres documentation":http://www.postgresql.org/docs/9.3/static/backup-file.html stating that a filesystem-level backup of the live database would not be useful (in fact Postgres usually keeps around enough WAL files[3], and can be configured to keep around even more of them, so that a filesystem-level backup would work perfectly fine in most cases). this incorrect information (essentially a disclaimer) is likely there because the Postgres developers do not want users being upset at them if their filesystem-level backup happens to be unrestorable. fn2(. note that there is a slowdown, where the import freezes for 25 min on a query, before resuming: <pre><code class="SQL"> pg_stat_activity snapshot taken: "15:34:51" "query_start" "2014-07-14 14:56:59.482365-07" "backend_start" "2014-07-14 14:43:41.845428-07" query running for 25 min ("0:23:56.119442" in inputs/ARIZ/omoccurrences/logs/test_import.log.sql, but strangely 40 min ("14:56:59".."15:34:51") according to pg_stat_activity) [2] DB query: non-cacheable: Took 0:23:56.119442 sec /*ARIZ.omoccurrences*/ CREATE TEMP TABLE "coordinates_pkeys" AS SELECT "in#41"."*occurrenceID" , "coordinates"."coordinates_id" AS "out.coordinates_id" FROM "in#41" JOIN "coordinates" ON "coordinates"."source_id" = 5 AND "coordinates"."latitude_deg" = "in#41"."_nullIf(decimalLatitude).result" AND "coordinates"."longitude_deg" = "in#41"."_nullIf(decimalLongitude).result" AND COALESCE("coordinates"."verbatimlatitude", CAST('\N' AS text)) = COALESCE(NULL, CAST('\N' AS text)) AND COALESCE("coordinates"."verbatimlongitude", CAST('\N' AS text)) = COALESCE(NULL, CAST('\N' AS text)) AND COALESCE("coordinates"."verbatimcoordinates", CAST('\N' AS text)) = COALESCE("in#41"."ARIZ.omoccurrences.verbatimCoordinates", CAST('\N' AS text)) AND COALESCE("coordinates"."footprintgeometry_dwc", CAST('\N' AS text)) = COALESCE("in#41"."ARIZ.omoccurrences.footprintWKT", CAST('\N' AS text)) /* EXPLAIN: Nested Loop (cost=5333.13..5366.19 rows=1 width=8) -> Index Scan using coordinates_unique on coordinates (cost=0.41..18.01 rows=1 width=84) Index Cond: ((source_id = 5) AND (COALESCE(verbatimlatitude, '\N'::text) = '\N'::text) AND (COALESCE(verbatimlongitude, '\N'::text) = '\N'::text)) -> Bitmap Heap Scan on "in#41" (cost=5332.72..5348.17 rows=1 width=73) Recheck Cond: ((COALESCE("ARIZ.omoccurrences.verbatimCoordinates", '\N'::text) = COALESCE(coordinates.verbatimcoordinates, '\N'::text)) AND (COALESCE("ARIZ.omoccurrences.footprintWKT", '\N'::text) = COALESCE(coordinates.footprintgeometry_dwc, '\N'::text))) Filter: ((coordinates.latitude_deg = "_nullIf(decimalLatitude).result") AND (coordinates.longitude_deg = "_nullIf(decimalLongitude).result")) -> Bitmap Index Scan on "in#41_coalesce_coalesce1_coalesce2_coalesce3_idx" (cost=0.00..5332.72 rows=4 width=0) Index Cond: ((COALESCE("ARIZ.omoccurrences.verbatimCoordinates", '\N'::text) = COALESCE(coordinates.verbatimcoordinates, '\N'::text)) AND (COALESCE("ARIZ.omoccurrences.footprintWKT", '\N'::text) = COALESCE(coordinates.footprintgeometry_dwc, '\N'::text))) */ </code></pre> fn3(. fn3. except when the database is under very heavy load, such as at the beginning of full-database import h3. tracing approach # attempt to trace where the incorrect columns list is being introduced ** generally time-consuming to trace data through complex code ** narrowing down the affected section of code (using the working-backup approach) would reduce the amount of code that needs to be traced h3. rewriting approach # rewrite the section of code with the bug ** in this case, that code section is the most complex part of our codebase, so this would be potentially very time-consuming ** narrowing down the affected section of code (using the working-backup approach) would reduce the amount of code that needs to be rewritten --- h3. -clean VM approach- p(. _not using this approach because we have a backup from the time of the last successful import, so we don't need to obtain/install dependencies in the versions they had in the last successful import_ # -prepare clean VMs- # -fix @`make install`@, which sets up the entire BIEN installation and dependencies- ** normally, we do not reinstall the DB from scratch, so the bugs in @`make install`@ only become apparent when it is run on a partial installation # install the database from scratch on a clean VM (VirtualBox) ** this would involve adding any missing dependencies to our install scripts # test the import in the clean VM with a sample datasource to see if that reproduces the problem ** if it does, we know it's a bug in Postgres/Ubuntu and can troubleshoot using VM images with different Postgres/Ubuntu versions ** if it doesn't, it's a problem specific to just the vegbiendev VM and we would reset the vegbiendev VM to a clean Ubuntu install and reinstall our dependencies h3. -Postgres rollback approach- p(. _not using this approach because it only works if the problem is with Postgres, but it could also be a problem in the other import code (eg. Python)_ # roll back Postgres to the version it was at in the last successful import ** this may require building Postgres from source, because past _revisions_ of the same numeric version might only be available in version control, not in binary form via apt-get (which numbers packages by numeric version) ** if this isn't possible, it may be necessary to downgrade to Postgres 9.2 (which will unfortunately be missing some features that we now use) # see if this fixes the problem