Task #905
closed100%
Description
alternate OS approach¶
tried, and problem also occurs on Mac, so using other approaches
fix`make install`
, which sets up the entire BIEN installation and dependenciestest the import on the local testing machine (a Mac), which already has most of the dependenciesif that doesn't work, try the other approaches below
working-backup approach¶
restoring to a working backup allows us to successfully run the import, so that we can then test system changes to see what broke things
restore vegbiendev to last working configurationget restored VM to work on VirtualBoxinstall bootloaderinstall device driversthe Linux VM configuration does not support the VirtualBox ethernet device natively, so it must be configured manually
set up database configuration1 (the files and data that are not part of a Postgres-level backup)install database contents1
there are 2 possible strategies for this:a) fix the bugs in the version ofnot using this methodmake install
available in that backup, and run it to build the database from the flat files. the advantage of this approach is thatmake install
is very fast (~15 min) when restoring only a single datasource in addition to the metadata datasources, whilepg_restore
would take a day or more or require a custom script to do a selective restore.b) find the nearest database backup, restore it, and undo/redo modifications between the time of the VM backup and the time of the database backup. this involves either restoring the full DB (~1 day on vegbiendev, likely slower on an external USB drive) or writing a script to do a selective restore of the data needed for the test import; and then figuring out which modifications were made to the DB before/since the VM backup. (see README.TXT > to set up the VM)
test import and verify that it worksperform system upgrades and see if import still works (see README.TXT > to install system updates)the import does still work2, so the join columns bug (#902) is fortunately not in a system upgrade.if not, selectively install system upgrades to narrow down the problemN/A
update svn and see if import still works (see README.TXT > to test a different revision)it doesn't, so this means the bug is somewhere in our codeif not, test different revisions to narrow down the problemthe problem was caused by r12153, so debugging can be limited to the code affected by this commit
1 the backups from that time did not include the database because of incorrect information in the Postgres documentation stating that a filesystem-level backup of the live database would not be useful (in fact Postgres usually keeps around enough WAL files3, and can be configured to keep around even more of them, so that a filesystem-level backup would work perfectly fine in most cases). this incorrect information (essentially a disclaimer) is likely there because the Postgres developers do not want users being upset at them if their filesystem-level backup happens to be unrestorable.
2 note that there is a slowdown, where the import freezes for 25 min on a query, before resuming:
pg_stat_activity snapshot taken: "15:34:51"
"query_start"
"2014-07-14 14:56:59.482365-07"
"backend_start"
"2014-07-14 14:43:41.845428-07"
query running for 25 min ("0:23:56.119442" in inputs/ARIZ/omoccurrences/logs/test_import.log.sql, but strangely 40 min ("14:56:59".."15:34:51") according to pg_stat_activity)
[2] DB query: non-cacheable:
Took 0:23:56.119442 sec
/*ARIZ.omoccurrences*/
CREATE TEMP TABLE "coordinates_pkeys" AS
SELECT
"in#41"."*occurrenceID"
, "coordinates"."coordinates_id" AS "out.coordinates_id"
FROM "in#41"
JOIN "coordinates" ON
"coordinates"."source_id" = 5
AND "coordinates"."latitude_deg" = "in#41"."_nullIf(decimalLatitude).result"
AND "coordinates"."longitude_deg" = "in#41"."_nullIf(decimalLongitude).result"
AND COALESCE("coordinates"."verbatimlatitude", CAST('\N' AS text)) = COALESCE(NULL, CAST('\N' AS text))
AND COALESCE("coordinates"."verbatimlongitude", CAST('\N' AS text)) = COALESCE(NULL, CAST('\N' AS text))
AND COALESCE("coordinates"."verbatimcoordinates", CAST('\N' AS text)) = COALESCE("in#41"."ARIZ.omoccurrences.verbatimCoordinates", CAST('\N' AS text))
AND COALESCE("coordinates"."footprintgeometry_dwc", CAST('\N' AS text)) = COALESCE("in#41"."ARIZ.omoccurrences.footprintWKT", CAST('\N' AS text))
/* EXPLAIN:
Nested Loop (cost=5333.13..5366.19 rows=1 width=8)
-> Index Scan using coordinates_unique on coordinates (cost=0.41..18.01 rows=1 width=84)
Index Cond: ((source_id = 5) AND (COALESCE(verbatimlatitude, '\N'::text) = '\N'::text) AND (COALESCE(verbatimlongitude, '\N'::text) = '\N'::text))
-> Bitmap Heap Scan on "in#41" (cost=5332.72..5348.17 rows=1 width=73)
Recheck Cond: ((COALESCE("ARIZ.omoccurrences.verbatimCoordinates", '\N'::text) = COALESCE(coordinates.verbatimcoordinates, '\N'::text)) AND (COALESCE("ARIZ.omoccurrences.footprintWKT", '\N'::text) = COALESCE(coordinates.footprintgeometry_dwc, '\N'::text)))
Filter: ((coordinates.latitude_deg = "_nullIf(decimalLatitude).result") AND (coordinates.longitude_deg = "_nullIf(decimalLongitude).result"))
-> Bitmap Index Scan on "in#41_coalesce_coalesce1_coalesce2_coalesce3_idx" (cost=0.00..5332.72 rows=4 width=0)
Index Cond: ((COALESCE("ARIZ.omoccurrences.verbatimCoordinates", '\N'::text) = COALESCE(coordinates.verbatimcoordinates, '\N'::text)) AND (COALESCE("ARIZ.omoccurrences.footprintWKT", '\N'::text) = COALESCE(coordinates.footprintgeometry_dwc, '\N'::text)))
*/
3 except when the database is under very heavy load, such as at the beginning of full-database import
tracing approach¶
- attempt to trace where the incorrect columns list is being introduced
- generally time-consuming to trace data through complex code
- narrowing down the affected section of code (using the working-backup approach) would reduce the amount of code that needs to be traced
rewriting approach¶
- rewrite the section of code with the bug
- in this case, that code section is the most complex part of our codebase, so this would be potentially very time-consuming
- narrowing down the affected section of code (using the working-backup approach) would reduce the amount of code that needs to be rewritten
clean VM approach¶
not using this approach because we have a backup from the time of the last successful import, so we don't need to obtain/install dependencies in the versions they had in the last successful import
prepare clean VMsfix`make install`
, which sets up the entire BIEN installation and dependencies- normally, we do not reinstall the DB from scratch, so the bugs in
`make install`
only become apparent when it is run on a partial installation
- normally, we do not reinstall the DB from scratch, so the bugs in
- install the database from scratch on a clean VM (VirtualBox)
- this would involve adding any missing dependencies to our install scripts
- test the import in the clean VM with a sample datasource to see if that reproduces the problem
- if it does, we know it's a bug in Postgres/Ubuntu and can troubleshoot using VM images with different Postgres/Ubuntu versions
- if it doesn't, it's a problem specific to just the vegbiendev VM and we would reset the vegbiendev VM to a clean Ubuntu install and reinstall our dependencies
Postgres rollback approach¶
not using this approach because it only works if the problem is with Postgres, but it could also be a problem in the other import code (eg. Python)
- roll back Postgres to the version it was at in the last successful import
- this may require building Postgres from source, because past revisions of the same numeric version might only be available in version control, not in binary form via apt-get (which numbers packages by numeric version)
- if this isn't possible, it may be necessary to downgrade to Postgres 9.2 (which will unfortunately be missing some features that we now use)
- see if this fixes the problem
Updated by Aaron Marcuse-Kubitza over 10 years ago
- Description updated (diff)
- % Done changed from 0 to 30
Updated by Aaron Marcuse-Kubitza over 10 years ago
- Description updated (diff)
- % Done changed from 40 to 50
Updated by Aaron Marcuse-Kubitza over 10 years ago
- Description updated (diff)
- % Done changed from 100 to 70
Updated by Aaron Marcuse-Kubitza over 10 years ago
- Description updated (diff)
- % Done changed from 70 to 80
Updated by Aaron Marcuse-Kubitza over 10 years ago
- % Done changed from 80 to 90
Updated by Aaron Marcuse-Kubitza over 10 years ago
- Subject changed from narrow down the cause of the incorrect join columns and disk space leak to narrow down the cause of the import bug (incorrect join columns and disk space leak)
Updated by Aaron Marcuse-Kubitza over 10 years ago
- Description updated (diff)
- Status changed from New to Resolved
- % Done changed from 90 to 100
bug fixed in r14074