Activity - BIEN 3 - NCEAS Projects

Activity

From 06/29/2012 to 07/28/2012

07/27/2012

11:12 PM Revision 3673: Regenerated vegbien.ERD exports: Aaron Marcuse-Kubitza
11:10 PM Revision 3672: schemas/vegbien.ERD.mwb: Fixed lines: Aaron Marcuse-Kubitza
11:08 PM Revision 3671: schemas/vegbien.ERD.mwb: Synced with schema: Aaron Marcuse-Kubitza
10:51 PM Revision 3670: bin/map: Call sys.stdout.flush() after every call to sys.stdout.write() to avoid interleaved stdout/stderr output due to stdout buffering: Aaron Marcuse-Kubitza
10:48 PM Revision 3669: bin/map: Call sys.stdout.flush() after every call to sys.stdout.write() to avoid interleaved stdout/stderr output due to stdout buffering: Aaron Marcuse-Kubitza
10:13 PM Revision 3668: schemas/vegbien.sql: *_unique_datasource UNIQUE INDEXes: Removed COALESCE() from datasource_id and datasource_id IS NOT NULL filter, because datasource_id is now always NOT NULL: Aaron Marcuse-Kubitza
10:07 PM Revision 3667: schemas/filter_ERD.csv: Removed AUTO_INCREMENT because that is not added to any other tables: Aaron Marcuse-Kubitza
10:05 PM Revision 3666: Regenerated schemas/vegbien.my.sql: Aaron Marcuse-Kubitza
10:04 PM Revision 3665: schemas/vegbien.sql: specimenreplicate: Inherit datasource_id from taxonoccurrence instead of defining it independently: Aaron Marcuse-Kubitza
09:56 PM Revision 3664: xml_func.py: Removed no longer needed local XML functions that have been translated to SQL functions: Aaron Marcuse-Kubitza
09:52 PM Revision 3663: input.Makefile: Testing: Removed VegBIEN.%.xml test because the import.%.xml test output includes the template tree that it's inserting, so there is no need to generate the XML tree in a separate test. This will also remove the need to maintain local XML functions that have already been translated to DB functions for the sole purpose of this automated test.: Aaron Marcuse-Kubitza
09:40 PM Revision 3662: schemas/vegbien.sql: Made datasource_id required on every table that has it, to trigger the automatic population of it by sql_io.put_table()'s col_defaults: Aaron Marcuse-Kubitza
09:38 PM Revision 3661: Moved importing of col_defaults from db_xml.put_table() to bin/map, so that it also happens in row-based mode. Note that this causes a DB entry for the datasource to always be created, even if the datasource has no mappings or no rows.: Aaron Marcuse-Kubitza
09:13 PM Revision 3660: Use new exc.reraise() where exc.raise_() was used, so that the stack trace is preserved when the exception is rethrown: Aaron Marcuse-Kubitza
09:11 PM Revision 3659: exc.py: reraise(): Take optional exception argument so it can be invoked in the same way as raise_(). Interestingly, this missing parameter does not produce the usual "...() takes no arguments (1 given)" error when the function is called inside an except block.: Aaron Marcuse-Kubitza
09:04 PM Revision 3658: exc.py: Added reraise(): Aaron Marcuse-Kubitza
09:02 PM Revision 3657: db_xml.py: put(): Inserting node: Wrap sql_io.put_table() call in catch-all exception handler that calls on_error_() (wrapper for error handler provided by caller) and returns None. This both adds additional debugging info to the exception (in on_error_()) and allows recovery from arbitrary exceptions that happen in sql_io.put_table(), so that an exception does not abort the import.: Aaron Marcuse-Kubitza
08:50 PM Revision 3656: exc.py: get_e_tracebacks_str(): Use the current system traceback if the exception doesn't contain its own traceback(s): Aaron Marcuse-Kubitza
08:35 PM Revision 3655: schemas/vegbien.sql: specimenreplicate: Added locationevent fkey, since fkeys are not inherited from parent tables: Aaron Marcuse-Kubitza
08:30 PM Revision 3654: schemas/vegbien.sql: Added datasource_id fkey constraints to all tables that needed it: Aaron Marcuse-Kubitza
08:21 PM Revision 3653: bin/map: out_is_db: Use col_defaults in row-based mode as well: Aaron Marcuse-Kubitza
08:02 PM Revision 3652: db_xml.py: Renamed put_table_special_funcs to put_special_funcs because it is now used by put() as well: Aaron Marcuse-Kubitza
08:00 PM Revision 3651: db_xml.py: Moved put() before the functions that use it: Aaron Marcuse-Kubitza
07:58 PM Revision 3650: db_xml.py: Renamed _put_table_part() to put(), replacing the existing put() whose functionality it now performs: Aaron Marcuse-Kubitza
07:52 PM Revision 3649: db_xml.py: _put_table_part(): Reordered params to match put(), so that it can eventually be substituted for it: Aaron Marcuse-Kubitza
07:44 PM Revision 3648: db_xml.py: _put_table_part(): Allow being invoked directly by adding defaults for parameters: Aaron Marcuse-Kubitza
07:41 PM Revision 3647: db_xml.py: put(): Use _put_table_part(). This will ensure that all the put-related functionality is in one place, rather than duplicated.: Aaron Marcuse-Kubitza
07:30 PM Revision 3646: db_xml.py: _put_table_part(): Append the node to errors handled with on_error(): Aaron Marcuse-Kubitza
07:29 PM Revision 3645: sql_io.py: Added own SyntaxError class to replace built-in SyntaxError because it stringifies to only the first line: Aaron Marcuse-Kubitza
06:46 PM Revision 3644: input.Makefile: Testing: Removed $(via).%.xml tests because they require the via format (DwC/VegX) to be XML, but we want to flatten VegX into a DwC-like set of CSV column names: Aaron Marcuse-Kubitza
06:45 PM Revision 3643: Removed inputs/NY/test/VegX.specimens.xml.ref because NY is not mapped via VegX: Aaron Marcuse-Kubitza
06:31 PM Revision 3642: input.Makefile: Testing: Renamed import.*.out tests to end in .xml because they now contain XML import trees for validation, and this extension turns on XML syntax highlighting in a text editor: Aaron Marcuse-Kubitza
06:03 PM Revision 3641: bin/map: out_is_db: Output the put template to stdout so it will be validated in the automated testing: Aaron Marcuse-Kubitza
05:41 PM Revision 3640: xml_func.py: process(): If local XML function can't be found, just replace with last param instead of returning an error. This allows DB-only functions to be ignored in XML output mode.: Aaron Marcuse-Kubitza
05:32 PM Revision 3639: sql_gen.py: ColDict.__setitem__(): Fixed bug where None value should not be replaced with column default value if column has no underlying table: Aaron Marcuse-Kubitza
05:27 PM Revision 3638: sql.py: DbConn.col_info(): If column does not exist, raise sql_gen.NoUnderlyingTableException: Aaron Marcuse-Kubitza
04:58 PM Revision 3637: sql_io.py: put_table(): In log messages, use `.to_str(db)` instead of repr() where possible to use the SQL syntax of the DB driver: Aaron Marcuse-Kubitza
04:51 PM Revision 3636: sql_io.py: put_table(): ignore(): Replacing invalid value with NULL in nullable column: Corrected log message to "Replacing invalid value ... with NULL in column ..." because the rows with that value are not ignored in that case: Aaron Marcuse-Kubitza
04:47 PM Revision 3635: sql.py: run_query(): InvalidValueException: Parse any exception ending in "out of range", not just "field value out of range", in order to support errors that the timezone is out of range: Aaron Marcuse-Kubitza
04:35 PM Revision 3634: schemas/py_functions.sql: _dateRange*(): Made functions STRICT because they return NULL on NULL input: Aaron Marcuse-Kubitza

07/26/2012

09:53 PM Revision 3633: sql_io.py: put(): Use a simple case of put_table(), which now supports everything put() needs. This will enable all row-based and column-based processing to be maintained in the same function, put_table(), and avoids the need to reimplement any column-based functionality (like SQL functions) in put().: Aaron Marcuse-Kubitza
09:51 PM Revision 3632: xml_dom.py: NodeTextEntryIter: Allow empty values through as None, and instead filter them out in TextEntryOnlyIter using new helper function non_empty(). This allows XML functions to decide for themselves whether empty values should be filtered out, because process() will now no longer automatically remove them. This will enable process() to work with SQL functions, which *must not* have empty values filtered out because this will remove required, but nullable, arguments.: Aaron Marcuse-Kubitza
09:45 PM Revision 3631: xml_func.py: Use conv_items() in every XML function that needs empty (NULL) entries removed, so that they are not dependent on what process() does to the items: Aaron Marcuse-Kubitza
09:43 PM Revision 3630: sql_io.py: put_table(): ignore(): Support invalid literals in addition to invalid column values. This also allows put_table() to fully support being called by put().: Aaron Marcuse-Kubitza
08:55 PM Revision 3629: xml_func.py: process(): In row-based mode, if function is not explicitly a relational function but does not exist as a local XML function, treat it as a relational function. This will help in merging sql_io.put() and put_table(), since put() did not support SQL functions but put_table() does, and this ensures that a SQL function is always used if the local XML function has been removed in favor of it.: Aaron Marcuse-Kubitza
08:37 PM Revision 3628: sql_io.py: put_table(): Removed into param to set a custom into table name because put_table() now has all the info it needs to generate this name automatically, and callers are no longer providing it: Aaron Marcuse-Kubitza
07:56 PM Revision 3627: bin/map: by_col: db_xml.put_table() call: Use new col_defaults param to automatically set datasource_id to the in_label (datasource name): Aaron Marcuse-Kubitza
07:46 PM Revision 3626: xpath.py: path2xml(): Skip to tree created inside root, since that is how callers want to use the returned node: Aaron Marcuse-Kubitza
07:45 PM Revision 3625: db_xml.py: put_table(): Import col_defaults to translate nodes to pkeys: Aaron Marcuse-Kubitza
07:44 PM Revision 3624: db_xml.py: _put_table_part(): Support no in_table, for iterations with only literal values: Aaron Marcuse-Kubitza
07:27 PM Revision 3623: sql_io.py: put_table(): is_literals: When ignoring all rows, return default value instead of always None: Aaron Marcuse-Kubitza
06:35 PM Revision 3622: db_xml.py: put_table(): Removed parent_ids_loc and next params since these are only used in the recursion: Aaron Marcuse-Kubitza
06:17 PM Revision 3621: db_xml.py: put_table(): Split into an outer function that sets up the database environment and subsets in_table, and a (recursive) inner function that imports the data: Aaron Marcuse-Kubitza
05:55 PM Revision 3620: db_xml.py: put_table(): Subsetting and partitioning in_table: Documented that it's OK to do this even if table already the right size because it takes <1 sec: Aaron Marcuse-Kubitza
05:43 PM Revision 3619: sql_io.py: put_table(): Use is_function where caller-provided is_func was used, since is_function determines whether something is a function based on whether it actually exists as a SQL function instead of just whether its name starts with "_". Removed now-unneeded is_func param.: Aaron Marcuse-Kubitza
05:36 PM Revision 3618: sql_io.py: put_table(): Added col_defaults param and use it if there's a missing mapping for a NOT NULL column. This requires callers passing arguments by position to add an empty value for this parameter.: Aaron Marcuse-Kubitza
04:48 PM Revision 3617: bin/map: by_col: Only clear errors table if doing full re-import starting from row 0, not if restarting import at a later row: Aaron Marcuse-Kubitza
04:47 PM Revision 3616: input.Makefile: Import to VegBIEN: Fixed bug where `&>>` was used to append stdout and stderr to the log file, but is not supported on Mac OS X. Replaced with `&>` (overwrite instead of append) because log file is unique by date/time the import runs, so there won't be an existing log file that would be overwritten.: Aaron Marcuse-Kubitza
04:34 PM Revision 3615: schemas/vegbien.sql: Added datasource_id to all tables with a sourceaccessioncode (and corresponding *_unique_datasource constraint on these columns) so they can be directly looked up using just the input table's own fkey to parent. This will enable loading hierarchical (plots) data without "breadcrumbs", a huge benefit! Also added sourceaccessioncode wherever there was a datasource_id, to standardize on these names as being the columns that link directly to the input table rows.: Aaron Marcuse-Kubitza
01:15 PM Revision 3614: README.TXT: Datasource setup: Installing the staging tables: View the logs: Fixed bug in tail syntax to also work on Linux: Aaron Marcuse-Kubitza

07/25/2012

11:04 PM Revision 3613: Added inputs/Madidi/ with empty mappings: Aaron Marcuse-Kubitza
11:01 PM Revision 3612: README.TXT: Datasource setup: Populating the src/ subdir with input data: Added step to make sure each header in multiple part files for a table is EXACTLY the same: Aaron Marcuse-Kubitza
10:56 PM Revision 3611: README.TXT: Datasource setup: Installing the staging tables: Added steps to deal with colliding column names in the flat file headers. Added command to view the logs.: Aaron Marcuse-Kubitza
10:53 PM Revision 3610: csv2db: log(): sys.stderr.write(): Run strings.to_raw_str() on message to handle Unicode chars: Aaron Marcuse-Kubitza
10:52 PM Revision 3609: csv2db: Run strings.to_unicode() on column names to handle Unicode chars: Aaron Marcuse-Kubitza
10:36 PM Revision 3608: csv2db: esc_name(): Use db.esc_name(): Aaron Marcuse-Kubitza
09:25 PM Revision 3607: Added inputs/BIEN2.datasources.xlsx (formerly bien_data_sources.xlsx in nimoy:/home/bien/raw_data/): Aaron Marcuse-Kubitza
09:06 PM Revision 3606: exc.py: e_msg(): Added assertions to check that e.args is compatible with this function: Aaron Marcuse-Kubitza
08:59 PM Revision 3605: exc.py: Use new e_str() where its definition was used: Aaron Marcuse-Kubitza
08:54 PM Revision 3604: exc.py: Use new Unicode-safe e_msg() instead of strings.ustr() on exceptions: Aaron Marcuse-Kubitza
08:47 PM Revision 3603: exc.py: e_msg(): Run strings.ustr() on the returned string so it will be appendable to other Unicode strings: Aaron Marcuse-Kubitza
08:43 PM Revision 3602: exc.py: Added e_msg(), e_str() (from SQL py_functions._date()): Aaron Marcuse-Kubitza
02:06 PM Revision 3601: db_xml.py: put_table(): Adding fkey to parent: Fixed bug where should only add parent_ids_loc table to list of tables not to truncate if it's a column, because it is sometimes just a pkey value when that iteration contained only literals: Aaron Marcuse-Kubitza
01:56 PM Revision 3600: inputs/import.stats.xls: Updated with stats from latest import: Aaron Marcuse-Kubitza
01:42 PM Revision 3599: inputs/import.stats.xls: Corrected date of last import: Aaron Marcuse-Kubitza

07/24/2012

09:52 AM Revision 3598: sql_gen.py: plpythonu_error_handler: Fixed bug where PL/Python exceptions could not be filtered by strings after the first line, because only the "message" portion of the exception is available in SQLERRM: Aaron Marcuse-Kubitza
09:35 AM Revision 3597: schemas/py_functions.sql: _date(): YMD parsing: Fixed bug where exception for ValueError needed to be stored in local var so its message could be parsed: Aaron Marcuse-Kubitza
09:33 AM Revision 3596: sql_gen.py: plpythonu_error_handler: Always raise PL/Python exceptions as data_exception so they go in the errors table, instead of aborting the iteration: Aaron Marcuse-Kubitza
09:16 AM Revision 3595: sql_gen.py: plpythonu_error_handler: Fixed bug where not all PL/Python exceptions start with "PL/Python: " (e.g. on PostgreSQL 9.1 on vegbiendev), so the PL/Python prefix must be optional. Refactored to put IF clause for non-PL/Python exception at end for a more logical ordering of the conditions.: Aaron Marcuse-Kubitza
08:41 AM Revision 3594: Added inputs/CVS/: Aaron Marcuse-Kubitza
08:40 AM Revision 3593: README.TXT: Datasource setup: Added steps to place the relevant files under version control: Aaron Marcuse-Kubitza
08:31 AM Revision 3592: README.TXT: Datasource setup: Accepting the test cases: Don't auto-accept the initial tests because there could be bugs in the initial mappings that would be revealed upon inspecting the test output: Aaron Marcuse-Kubitza
08:14 AM Revision 3591: sql_gen.py: plpythonu_error_handler: Added section comment before handler block, so that it's clear in the (very long) wrapper function definition what the block is doing: Aaron Marcuse-Kubitza
07:59 AM Revision 3590: input.Makefile: Documentation: import/steps.by_col.sql: Added -s to make to avoid echoing make commands to the log file: Aaron Marcuse-Kubitza
07:46 AM Revision 3589: README.TXT: Moved Reinstall all datasources at once to Schema changes and renamed it to Reinstall staging tables to reflect that it is only necessary when the staging table format is changed: Aaron Marcuse-Kubitza
07:43 AM Revision 3588: README.TXT: Datasource setup: Updating vegbiendev: Added step to also install the staging tables on vegbiendev: Aaron Marcuse-Kubitza
07:42 AM Revision 3587: README.TXT: Datasource setup: Moved Install the staging tables before Map each table's columns because the install can run in the background while you're mapping. It must, however, come after Auto-create the map spreadsheets because it uses the filenames of the created maps to determine which staging tables to create.: Aaron Marcuse-Kubitza
07:40 AM Revision 3586: README.TXT: Datasource setup: Adding a new datasource: Changed <short_name> to <name> to match usage elsewhere. Documented that it may not contain spaces, and should be abbreviated.: Aaron Marcuse-Kubitza
07:33 AM Revision 3585: README.TXT: Datasource setup: Added steps to update vegbiendev: Aaron Marcuse-Kubitza
07:31 AM Revision 3584: inputs/Makefile: Input data: Added upload target: Aaron Marcuse-Kubitza
07:21 AM Revision 3583: README.TXT: Datasource setup: Added steps to accept the test cases and commit: Aaron Marcuse-Kubitza
07:18 AM Revision 3582: README.TXT: Datasource setup: Added step to install the staging tables: Aaron Marcuse-Kubitza
07:18 AM Revision 3581: bin/map: in_is_xml: doc2rows(): "Root not found in input" warning: Changed "error" to "warning" to match the type of error condition signaled: Aaron Marcuse-Kubitza
07:15 AM Revision 3580: bin/map: map_rows(): out_is_db: Changed `id_node != None` assertion to a warning because this is a normal circumstance in the base case where there are no mappings: Aaron Marcuse-Kubitza
07:13 AM Revision 3579: input.Makefile: Testing: Added test/accept-all: Aaron Marcuse-Kubitza
07:11 AM Revision 3578: csv2db: COPY FROM: Fixed %-injection bug where column names' %s were not escaped prior to cursor.mogrify(), by changing the code to use inline db.esc_value() instead: Aaron Marcuse-Kubitza
06:37 AM Revision 3577: bin/map: in_is_xml: doc2rows(): "Root not found in input" error: Changed SystemExit to a warning because this is a normal circumstance in the base case where the input XML file contains no rows: Aaron Marcuse-Kubitza
06:12 AM Revision 3576: README.TXT: Datasource setup: Documented how to map each table's columns: Aaron Marcuse-Kubitza
05:57 AM Revision 3575: README.TXT: Datasource setup: Changed "Auto-create the src column spreadsheets" to "Auto-create map spreadsheets" and updated command to bootstrap all maps, including newly-autogeneratable via maps: Aaron Marcuse-Kubitza
05:50 AM Revision 3574: input.Makefile: Maps building: maps/$(via).%.csv: Auto-create by copying the src map if doesn't exist. Existing maps discovery: Look up via format in src maps' roots if no via map already exists.: Aaron Marcuse-Kubitza
05:46 AM Revision 3573: src_map: Fixed bug where non-header rows needed to be materialized with empty fields for each column in the header: Aaron Marcuse-Kubitza
04:27 AM Revision 3572: input.Makefile: Maps building: Via maps cleanup: Match maps/$(via).%.csv with pattern instead of $(viaMaps) var so that a non-existing via map will have the recipe run, too. When auto-creating via maps is later added, this will be required.: Aaron Marcuse-Kubitza
04:07 AM Revision 3571: inputs/*/maps/src.*.csv: Regenerated using new src_map output format: Aaron Marcuse-Kubitza
04:06 AM Revision 3570: parallelproc.py: MultiProducerPool: Removed warning if not using parallel processing because this also gets generated when it's explicitly turned off, which is currently the case and clutters up stderr when testing: Aaron Marcuse-Kubitza
03:57 AM Revision 3569: src_map: Also add columns for the output mappings and comments, so that the src map can be directly copied for use as the via map (DwC.specimens.csv, etc.). The output mapping column name must be provided by the caller, which input.Makefile maps/src.%.csv provides using the new mappings roots.: Aaron Marcuse-Kubitza
03:52 AM Revision 3568: Added mappings/roots for use in creating src maps: Aaron Marcuse-Kubitza
03:41 AM Revision 3567: input.Makefile: Maps building: maps/src.%.csv: Clean up by passing through `$(bin)/cols '*'` whenever it's changed. This ensures that the CSV dialect is always consistently Python's Excel dialect. (Note that this dialect actually uses \r\n as the line ending. The \n line endings were from src maps generated by a previous version of bin/src_map.): Aaron Marcuse-Kubitza
03:28 AM Revision 3566: input.Makefile: Maps building: maps/$(via).%.full.csv: Removed alternate rule when $(srcMap) doesn't exist, because this effect is actually achieved by the no-prereqs rule for maps/src.%.csv, which causes make to think it exists when matching pattern rules even if its recipe doesn't actually create it: Aaron Marcuse-Kubitza
03:23 AM Revision 3565: input.Makefile: Maps building: maps/$(via).%.full.csv: Added alternate rule when $(srcMap) doesn't exist: Aaron Marcuse-Kubitza
03:21 AM Revision 3564: inputs/CTFS/maps/: Removed unneeded src.organisms.csv since there is an way to deal with it not existing in input.Makefile: Aaron Marcuse-Kubitza
03:18 AM Revision 3563: inputs/CTFS/maps/: Removed unneeded .VegX.plots.csv.last_cleanup: Aaron Marcuse-Kubitza
02:13 AM Revision 3562: inputs/*/maps/src.*.csv: Standardized line endings to \n: Aaron Marcuse-Kubitza
01:56 AM Revision 3561: input.Makefile: Maps building: maps/$(via).%.full.csv: Added the src map as a prerequisite so it would be rebuilt when the src map changes. This is possible now that every datasource has at least an empty src map. (An empty src map is now treated the same way as a non-existing one.): Aaron Marcuse-Kubitza
01:52 AM Revision 3560: inputs/*/maps/src.*.csv: Removed extraneous quotes around fields, which are added by Excel but not by Python: Aaron Marcuse-Kubitza
01:49 AM Revision 3559: inputs/*/maps/src.*.csv: Removed extraneous quotes around fields, which are added by Excel but not by Python: Aaron Marcuse-Kubitza
01:41 AM Revision 3558: inputs/CTFS: Added empty maps/src.organisms.csv so that every table of every datasource has a src map: Aaron Marcuse-Kubitza
12:18 AM Revision 3557: README.TXT: Datasource setup: Documented how to populate the src/ subdir with input data: Aaron Marcuse-Kubitza

07/23/2012

10:52 PM Revision 3556: Added inputs/CVS/: Aaron Marcuse-Kubitza
10:28 PM Revision 3555: sql_gen.py: plpythonu_error_handler: Translate specific Python exception types to PostgreSQL error codes (ValueError -> data_exception) instead of assuming everything is a data_exception. When removing the PL/Python prefix, preserve the Python exception class in a DETAIL message. Support non-PL/Python internal_errors by re-raising them.: Aaron Marcuse-Kubitza
10:25 PM Revision 3554: sql_gen.py: Added reraise_exc: Aaron Marcuse-Kubitza
10:21 PM Revision 3553: schemas/py_functions.sql: _date(): Raise (or pass through) ValueErrors directly instead of wrapping them in FormatExceptions, to simplify the code. This will also enable later translation of ValueErrors to data_exceptions. When year is required and missing, output a parsable 'null value in column year violates not-null constraint' error.: Aaron Marcuse-Kubitza
09:48 PM Revision 3552: sql_io.py: put_table(): log_exc(): Handle infinite loops from repeated exceptions by removing all rows, instead of just aborting with a failed assertion: Aaron Marcuse-Kubitza
09:36 PM Revision 3551: sql_io.py: put_table(): is_function: Fixed bug where special case for unrecoverable errors needed to avoid creating an empty output pkeys table because function mode defines the returned pkeys table separately: Aaron Marcuse-Kubitza
09:08 PM Revision 3550: sql_io.py: put_table(): is_function: Factored defining the error handling wrapper function out of the main loop because it only needs to run once. Don't log "Trying to insert new rows" in function mode because it's inaccurate.: Aaron Marcuse-Kubitza
07:14 PM Revision 3549: sql_gen.py: Exceptions: Added suppress_exc and use it in ExcHandler.to_str(): Aaron Marcuse-Kubitza
06:53 PM Revision 3548: README.TXT: Backups: After a new import: Added step to delete previous imports so they won't bloat the full DB backup. (Note that these imports have already been backed up, and only the most recent import needs to be live in the DB.): Aaron Marcuse-Kubitza
06:48 PM Revision 3547: README.TXT: Backups: Documented what to do after a new import: Aaron Marcuse-Kubitza
06:39 PM Revision 3546: backups/Makefile: Full DB: Added vegbien.backup/all to run both test and rotate: Aaron Marcuse-Kubitza
06:24 PM Revision 3545: README.TXT: Renamed Maintenance section to Backups for clarity: Aaron Marcuse-Kubitza
06:19 PM Revision 3544: backups/Makefile: %.sql: When testing, turn it off so make won't skip `%.sql: %` in favor of it: Aaron Marcuse-Kubitza
06:07 PM Revision 3543: backups/Makefile: Split %.backup and %.sql into separate targets for clarity: Aaron Marcuse-Kubitza
05:56 PM Revision 3542: inputs/import.stats.xls: Updated with stats from latest import. Note that this import adds data provider feedback for SQL functions as well as additional date processing using _date().: Aaron Marcuse-Kubitza

07/20/2012

07:10 AM Revision 3541: schemas/py_functions.sql: _date(): Re-enabled now that exceptions thrown are properly handled. FormatException: Support raising parsable data_exceptions when provided with the value that was invalid. Date parsing mode: Return date as the value in FormatException so it can be filtered out automatically by column-based import.: Aaron Marcuse-Kubitza
07:06 AM Revision 3540: sql_io.py: put_table(): is_function: Creating error handling wrapper function: Fixed bug where needed to cast NULL returned in error handler to appropriate type, because it's contained within a SELECT query which does not do implicit casts from type unknown: Aaron Marcuse-Kubitza
07:03 AM Revision 3539: sql_gen.py: Cast: Support types which are Code objects: Aaron Marcuse-Kubitza
06:05 AM Revision 3538: sql_io.py: func_wrapper_exception_handler(): Use new sql_gen.merge_not_null() to try to ensure that NULL values are not folded (which would cause the concatenated values not to match up with the concatenated column names). Note that this adds a dependency on the db object, which callers must now provide.: Aaron Marcuse-Kubitza
06:03 AM Revision 3537: sql_gen.py: Added merge_not_null(): Aaron Marcuse-Kubitza
06:03 AM Revision 3536: sql_gen.py: Added try_mk_not_null(): Aaron Marcuse-Kubitza
05:54 AM Revision 3535: sql_gen.py: Renamed ArrayJoin to ArrayMerge to avoid confusion with Join (a SQL construct): Aaron Marcuse-Kubitza
05:46 AM Revision 3534: sql_io.py: put_table(): is_function: Creating error handling wrapper function: Set srcs on row_var so that the column type and nullability info of row_var's columns can be retrieved for use with sql_gen.ensure_not_null(): Aaron Marcuse-Kubitza
05:38 AM Revision 3533: sql_gen.py: RowExcIgnore.to_str(): Compare self.row_var to global const row_var using == to allow caller to provide a copy of row_var with the underlying table set appropriately: Aaron Marcuse-Kubitza
05:35 AM Revision 3532: sql_gen.py: underlying_table(): Support derived tables and row vars by obtaining the underlying table from the srcs: Aaron Marcuse-Kubitza
05:25 AM Revision 3531: sql_io.py: put_table(): Setting pkeys of missing rows: Fixed bug where also needed to do this when is_function if an empty pkeys table was created (due to an error that could not be localized to a row): Aaron Marcuse-Kubitza
05:16 AM Revision 3530: sql_io.py: put_table(): After main loop: If is_literals, return immediately to avoid needing to test for is_literals in all the code that follows (which only applies to the normal case): Aaron Marcuse-Kubitza
04:43 AM Revision 3529: sql_gen.py: RowExcIgnore: If a custom row_var is used, require it to already be defined. This also allows sql_io.ExcToErrorsTable to place the column var definition in the outer DECLARE, eliminating the extra DECLARE block.: Aaron Marcuse-Kubitza
04:30 AM Revision 3528: sql_io.py: put_table(): is_function: Creating error handling wrapper function: Use new sql_gen.row_var: Aaron Marcuse-Kubitza
04:28 AM Revision 3527: sql_gen.py: RowExcIgnore: Created global constant for default row_var for callers to use: Aaron Marcuse-Kubitza
04:24 AM Revision 3526: sql_gen.py: RowExcIgnore.to_str(): Moved SQL comment explaining the use of an EXCEPTION block for each individual row to Python code to avoid cluttering the logged SQL code: Aaron Marcuse-Kubitza
04:19 AM Revision 3525: sql_io.py: put_table(): is_function: Creating error handling wrapper function: Handle errors using new func_wrapper_exception_handler(), which saves any data_exceptions in the errors table in addition to handling PL/Python errors: Aaron Marcuse-Kubitza
04:13 AM Revision 3524: sql_io.py: Added func_wrapper_exception_handler(): Aaron Marcuse-Kubitza
04:10 AM Revision 3523: sql_gen.py: Added ArrayJoin: Aaron Marcuse-Kubitza
04:10 AM Revision 3522: sql_gen.py: Added Array and to_Array(): Aaron Marcuse-Kubitza
02:47 AM Revision 3521: sql_gen.py: Added List and inherit from it in Tuple: Aaron Marcuse-Kubitza
02:45 AM Revision 3520: sql_gen.py: Renamed Tuple to Row and List to Tuple to more accurately reflect the datatype generated by each class (a Tuple being merely a grouping of values): Aaron Marcuse-Kubitza
02:43 AM Revision 3519: sql_gen.py: Moved Composite types to Literal values section as a subsection, since Composite types was really about just the input syntaxes for these types: Aaron Marcuse-Kubitza
02:32 AM Revision 3518: sql_gen.py: Replaced srcs_str() with cross_join_srcs() which more correctly combines the srcs of each column using a Cartesian product. Eventually, the entire tree of srcs will need to be preserved instead of flattened in order to properly attribute errors to a specific column or set of columns.: Aaron Marcuse-Kubitza
02:03 AM Revision 3517: sql_gen.py: srcs_str(): Fixed bug where needed to filter out columns with no srcs so that there aren't empty elements in the ","-separated list: Aaron Marcuse-Kubitza
02:00 AM Revision 3516: sql_gen.py: Added has_srcs(): Aaron Marcuse-Kubitza
01:44 AM Revision 3515: sql_gen.py: Added NestedExcHandler: Aaron Marcuse-Kubitza
01:44 AM Revision 3514: sql_gen.py: Added srcs_str(): Aaron Marcuse-Kubitza
01:43 AM Revision 3513: sql_gen.py: as_Col(): Support non-Code, non-string inputs by making them Literals: Aaron Marcuse-Kubitza
01:42 AM Revision 3512: sql_gen.py: Added is_col() and use it in is_table_col(): Aaron Marcuse-Kubitza

07/19/2012

11:54 PM Revision 3511: sql_io.py: ExcToErrorsTable: Require users to explicitly specify an expression for the value that caused the error, instead of assuming that a variable named "value" already exists. This allows a value expression to be computed only if needed for error handling.: Aaron Marcuse-Kubitza
11:22 PM Revision 3510: sql_gen.py: Moved __repr__() from ExcHandler to BaseExcHandler: Aaron Marcuse-Kubitza
11:21 PM Revision 3509: sql_gen.py: Added BaseExcHandler and inherit from it in ExcHandlers: Aaron Marcuse-Kubitza
10:58 PM Revision 3508: sql_io.py: cast(): Determining if will be saving errors: Don't add extra check if isinstance(col, sql_gen.Col) because the special case for sql_gen.Literal handles supported non-columns: Aaron Marcuse-Kubitza
10:56 PM Revision 3507: sql_io.py: data_exception_handler(): Removed no longer needed db param: Aaron Marcuse-Kubitza
10:47 PM Revision 3506: sql_io.py: Added ExcToErrorsTable, which separates out the errors table inserting code from the exception handling code. data_exception_handler(): Refactored to use new sql_gen.data_exception_handler() and ExcToErrorsTable.: Aaron Marcuse-Kubitza
10:43 PM Revision 3505: sql_gen.py: Added data_exception_handler: Aaron Marcuse-Kubitza
10:08 PM Revision 3504: sql_io.py: data_exception_handler(): Refactored to use new sql_gen.ExcToWarning when not using an errors table: Aaron Marcuse-Kubitza
10:03 PM Revision 3503: sql_gen.py: Added ExcToWarning: Aaron Marcuse-Kubitza
10:02 PM Revision 3502: schemas/vegbien.sql: taxondetermination: taxondetermination_taxonoccurrence_id_fkey(): Fixed bug where string containing a \-escape needed an "E" prefix: Aaron Marcuse-Kubitza
09:42 PM Revision 3501: sql_io.py: data_exception_handler(): Require the caller to provide a statement to return a default value in case of error, rather than assuming the caller can accept a return value of NULL: Aaron Marcuse-Kubitza
09:27 PM Revision 3500: sql_io.py: data_exception_handler(): Refactored to use new sql.define_func(): Aaron Marcuse-Kubitza
09:20 PM Revision 3499: sql_io.py: put_table(): is_function: Calling function on input rows: Convert PL/Python exceptions (internal_errors) to data_exceptions using sql_gen.plpythonu_error_handler and an error handling wrapper function: Aaron Marcuse-Kubitza
09:10 PM Revision 3498: debug2redmine.csv: EXPLAIN comments: Fixed bug where needed to also match whitespace at beginning of line (indent): Aaron Marcuse-Kubitza
09:07 PM Revision 3497: Use sql_gen.ReturnQuery where RETURN QUERY was previously manually prepended: Aaron Marcuse-Kubitza
09:05 PM Revision 3496: sql_gen.py: Added ReturnQuery: Aaron Marcuse-Kubitza
08:48 PM Revision 3495: sql.py: define_func(): Fixed bug where next_version() needed to have module name removed since it's in the same module: Aaron Marcuse-Kubitza
08:47 PM Revision 3494: sql.py: mk_select(): Added explain param to turn off automatically running EXPLAIN on the created query. This is useful for SELECT statements which use local variables in PL/pgSQL functions.: Aaron Marcuse-Kubitza
08:44 PM Revision 3493: sql_gen.py: with_table(): Only set the table if the passed-in value is a Col or FunctionCall: Aaron Marcuse-Kubitza
08:41 PM Revision 3492: sql_gen.py: Added Tuple: Aaron Marcuse-Kubitza
08:41 PM Revision 3491: sql_gen.py: Added List and use it in Values.to_str(): Aaron Marcuse-Kubitza
08:14 PM Revision 3490: sql.py: Added define_func(): Aaron Marcuse-Kubitza
07:07 PM Revision 3489: Use sql_gen.SetOf where SETOF was previously manually prepended: Aaron Marcuse-Kubitza
07:06 PM Revision 3488: sql_gen.py: Added SetOf: Aaron Marcuse-Kubitza
07:06 PM Revision 3487: sql_gen.py: FunctionDef: Support return_types which are Code objects: Aaron Marcuse-Kubitza
06:55 PM Revision 3486: Use sql_gen.ColType where %TYPE was previously manually appended: Aaron Marcuse-Kubitza
06:54 PM Revision 3485: sql_gen.py: Added ColType: Aaron Marcuse-Kubitza
06:47 PM Revision 3484: Use sql_gen.RowType where %ROWTYPE was previously manually appended: Aaron Marcuse-Kubitza
06:45 PM Revision 3483: sql_gen.py: Added RowType: Aaron Marcuse-Kubitza
06:45 PM Revision 3482: sql_gen.py: RowExcIgnore: Accept row types which are Code objects: Aaron Marcuse-Kubitza
06:42 PM Revision 3481: sql_gen.py: TypedCol: Accept types which are Code objects: Aaron Marcuse-Kubitza
06:34 PM Revision 3480: sql_io.py: data_exception_handler(): Documented that the invalid value must be in a local variable of type text: Aaron Marcuse-Kubitza
06:33 PM Revision 3479: sql_io.py: data_exception_handler(): Documented that the invalid value must be in a local variable of type text: Aaron Marcuse-Kubitza
06:32 PM Revision 3478: sql_io.py: put_table(): is_function: Creating empty pkeys table so its row type can be used: Don't do this if is_literals because special error handling does not apply to that: Aaron Marcuse-Kubitza
06:13 PM Revision 3477: sql_io.py: put_table(): is_function: Create empty pkeys table before calling function on all rows so its row type can later be used in an error handling wrapper function: Aaron Marcuse-Kubitza
05:33 PM Revision 3476: input.Makefile: Staging tables: import/install-%: Run csv2db with a nice increment of +5 to avoid interfering with the user's other processes: Aaron Marcuse-Kubitza
05:28 PM Revision 3475: root map: Run bin/map with a nice increment of +5 to avoid interfering with the user's other processes: Aaron Marcuse-Kubitza
05:24 PM Revision 3474: sql_io.py: put_table(): Handle psycopg2.extensions.TransactionRollbackError by retrying the last query: Aaron Marcuse-Kubitza
05:00 PM Revision 3473: sql_io.py: Creating an empty output pkeys table: Assert that there are no join columns, so that the input pkeys table will be created correctly for the empty output pkeys table: Aaron Marcuse-Kubitza
04:53 PM Revision 3472: sql_io.py: put_table(): Creating an empty output pkeys table: Added "output" to clarify that the created table contains just the output pkeys, and must be joined with the input pkeys table: Aaron Marcuse-Kubitza
04:39 PM Revision 3471: sql_gen.py: FunctionDef: Renamed args to params: Aaron Marcuse-Kubitza
04:35 PM Revision 3470: sql_gen.py: FunctionDef: Accept parameters as FunctionParam objects instead of strings: Aaron Marcuse-Kubitza
04:32 PM Revision 3469: sql_gen.py: Added FunctionParam: Aaron Marcuse-Kubitza
04:04 PM Revision 3468: sql_gen.py: Added plpythonu_error_handler: Aaron Marcuse-Kubitza

07/18/2012

11:06 PM Revision 3467: Autogenerated SQL code: Use new strings.indent() where needed: Aaron Marcuse-Kubitza
11:05 PM Revision 3466: strings.py: Added indent(): Aaron Marcuse-Kubitza
10:50 PM Revision 3465: sql_io.py: data_exception_handler(): Refactored to use sql_gen.RowExcIgnore: Aaron Marcuse-Kubitza
10:31 PM Revision 3464: sql_io.py: cast(): Refactored to use sql_gen.FunctionDef: Aaron Marcuse-Kubitza
10:28 PM Revision 3463: sql_gen.py: ExcHandler: Removed extra newline after handler: Aaron Marcuse-Kubitza
10:27 PM Revision 3462: sql.py: mk_insert_select(): ignore: RETURN QUERY statement: Added back missing newline after ';': Aaron Marcuse-Kubitza
10:23 PM Revision 3461: sql_gen.py: FunctionDef: Added support for parameters: Aaron Marcuse-Kubitza
10:03 PM Revision 3460: sql_io.py: cast(): Just use the first word of the type in the function name to help avoid name collisions. Note that type name collisions that may be introduced by this change are not a problem because the function name is versioned. (The caching mechanism prevents versioning when the function has the same name and definition as an already-defined function.): Aaron Marcuse-Kubitza
09:59 PM Revision 3459: sql_io.py: Added data_exception_handler() and use it in cast(): Aaron Marcuse-Kubitza
09:58 PM Revision 3458: sql_gen.py: ExcHandler.to_str(): Removed extra newline after body: Aaron Marcuse-Kubitza
08:41 PM Revision 3457: sql_gen.py: ExcHandler: Added __repr__() since it's not a Code object: Aaron Marcuse-Kubitza
08:17 PM Revision 3456: sql_gen.py: FunctionDef: Support custom function modifiers: Aaron Marcuse-Kubitza
08:04 PM Revision 3455: sql_gen.py: RowExcIgnore: Changed exc param to exc_handler to allow user to specify handler code for the exception: Aaron Marcuse-Kubitza
08:01 PM Revision 3454: sql_gen.py: Added ExcHandler, unique_violation_handler: Aaron Marcuse-Kubitza
07:53 PM Revision 3453: sql_gen.py: RowExcIgnore: Don't automatically add 'RETURN QUERY' before the with_row code or ';' after it: Aaron Marcuse-Kubitza
07:02 PM Revision 3452: sql_gen.py: RowExcIgnore: Allow user to specify a custom row var name: Aaron Marcuse-Kubitza
06:55 PM Revision 3451: sql_gen.py: FunctionDef: Don't automatically add 'SETOF ' before the return type: Aaron Marcuse-Kubitza
06:44 PM Revision 3450: sql.py: mk_insert_select(): embeddable: Use new sql_gen.RowExcIgnore: Aaron Marcuse-Kubitza
06:44 PM Revision 3449: sql_gen.py: Added RowExcIgnore: Aaron Marcuse-Kubitza
06:12 PM Revision 3448: sql_gen.py: FunctionDef: Determine the lang from the body's Code object instead of receiving it as a parameter: Aaron Marcuse-Kubitza
06:10 PM Revision 3447: sql_gen.py: as_Code(): Fixed bug where needed to handle inputs that are already Code objects: Aaron Marcuse-Kubitza
05:52 PM Revision 3446: sql_gen.py: Code: Added lang instance var: Aaron Marcuse-Kubitza
05:49 PM Revision 3445: sql_gen.py: Fixed bug where Code subclasses needed to call Code.__init__() in their __init__() function. BasicObject: Fixed bug where __init__() expected a value param, when in fact the value param is something added by certain subclasses.: Aaron Marcuse-Kubitza
05:31 PM Revision 3444: sql_gen.py: FunctionDef: body param: Support Code inputs in addition to strings: Aaron Marcuse-Kubitza
05:14 PM Revision 3443: sql.py: mk_insert_select(): embeddable: Use new sql_gen.FunctionDef: Aaron Marcuse-Kubitza
05:13 PM Revision 3442: sql_gen.py: Added FunctionDef: Aaron Marcuse-Kubitza
03:49 PM Revision 3441: README.TXT: Schema changes: Documented how to reinstall errors tables: Aaron Marcuse-Kubitza
03:46 PM Revision 3440: csv2db: Creating errors table: Only drop existing errors table in errors_table_only mode, so that errors tables are not unintentionally deleted when `make inputs/install` is run. This helps to make `make install` idempotent.: Aaron Marcuse-Kubitza
03:40 PM Revision 3439: README.TXT: Maintenance: Full DB: Changed commands to autorotate the created backup and then test and restore a rotated backup: Aaron Marcuse-Kubitza
03:31 PM Revision 3438: backups/Makefile: Added %.backup/rotate: Aaron Marcuse-Kubitza
02:58 PM Revision 3437: backups/Makefile: Rearranged sections so that backup targets, which apply to both Archived imports and Full DB, are at the top of a common Backups section: Aaron Marcuse-Kubitza
02:54 PM Revision 3436: inputs/import.stats.xls: Updated with stats from latest import: Aaron Marcuse-Kubitza

07/17/2012

11:09 PM Revision 3435: schemas/py_functions.sql: Disabled _date() because it does not yet output errors in a format parsable by the import process, and the import process does not yet trap errors produced by SQL functions: Aaron Marcuse-Kubitza
11:00 PM Revision 3434: sql_io.py: put_table(): Determining if can use optimization for only literal values: Fixed bug where needed initial value for reduce(): Aaron Marcuse-Kubitza
10:52 PM Revision 3433: sql_io.py: put_table(): Needing >= one column for INSERT SELECT: Fixed bug where can't add pkey column if calling a function instead of outputting to a table: Aaron Marcuse-Kubitza
10:36 PM Revision 3432: sql_io.py: put_table(): Optimization for only literal values: Also support an empty in_tables list, for use by put(): Aaron Marcuse-Kubitza
10:20 PM Revision 3431: sql_io.py: put_table(): Added optimization for only literal values, which does the same operations as put() but with the additional error handling of put_table(): Aaron Marcuse-Kubitza
10:17 PM Revision 3430: pg_dump_vegbien: Don't use SET SESSION AUTHORIZATION because it doesn't work with the py_functions schema (it requires PL/Python functions to be created as user postgres and then the owner changed to bien, which SET SESSION AUTHORIZATION won't do): Aaron Marcuse-Kubitza
08:53 PM Revision 3429: sql_gen.py: Added is_literal() and use it where isinstance(..., Literal) is used: Aaron Marcuse-Kubitza
08:44 PM Revision 3428: db_xml.py: put_table(): Divide fields into input columns and literal values: Translate values: Allow literal values other than strings or None (from the XML parsing), because sql_io.put_table() is getting an optimization for iterations containing only literal values, which just returns the pkey of the single row for these values (which is usually an integer) instead of a temp table with the same value in each row: Aaron Marcuse-Kubitza
08:28 PM Revision 3427: bin/map: by_col: Stripping XML functions not in the DB: Remove DB functions based on whether a plain SQL function of that name exists, rather than whether a relational function (i.e. a table) of that name exists. This will allow column-based import to use plain SQL functions that don't have a corresponding relational function.: Aaron Marcuse-Kubitza
08:23 PM Revision 3426: db_xml.py: Don't remove any explicit pkey because the output table may be a SQL function, which does not have a pkey. This feature only existed to support importing VegBank XML exports, which we don't use (and which would be incompatible with the schema anyway).: Aaron Marcuse-Kubitza
08:19 PM Revision 3425: sql.py: function_exists(): Fixed bug where select() needed to be run with auto-rollback in case it raised an exception: Aaron Marcuse-Kubitza
08:08 PM Revision 3424: xml_func.py: process(): Changed rel_funcs param to a callback is_rel_func, so that caller can specify any dynamic function to determine if a name is a relational function rather than having to list out all known relational functions: Aaron Marcuse-Kubitza
07:54 PM Revision 3423: sql.py: function_exists(): Use simpler cast to regproc instead of query of information_schema.routines to determine if function exists. When the schema is not specified, this also limits the schemas checked to the search_path instead of the whole DB.: Aaron Marcuse-Kubitza
07:51 PM Revision 3422: schemas/functions.sql, py_functions.sql: Renamed trigger functions to avoid collisions with plain SQL functions of the same name but different signatures, so that the plain SQL functions can be uniquely identified by their name without also requiring their signature: Aaron Marcuse-Kubitza
07:39 PM Revision 3421: sql.py: mk_select(): In queries without a FROM clause, don't order by pkey: Aaron Marcuse-Kubitza
07:15 PM Revision 3420: sql.py: mk_select(): Support queries without a FROM clause: Aaron Marcuse-Kubitza
07:03 PM Revision 3419: sql.py: Added DoesNotExistException and parse it in run_query(): Aaron Marcuse-Kubitza
06:46 PM Revision 3418: sql_io.py: put_table(): Removed no longer used conds var (invalid rows are removed from the in_table using sql.delete() instead of being filtered out in the main select): Aaron Marcuse-Kubitza
06:43 PM Revision 3417: sql_io.py: put_table(): Removed no longer used distinct_on var (sql.distinct_table() handles filtering the join_cols): Aaron Marcuse-Kubitza
06:24 PM Revision 3416: schemas/py_functions.sql: _date(): Just run str() on the returned datetime because it will usually be converted to a PostgreSQL timestamp anyway, so excluding the time from the string isn't necessary: Aaron Marcuse-Kubitza
06:15 PM Revision 3415: schemas/py_functions.sql: Added _date(): Aaron Marcuse-Kubitza
06:14 PM Revision 3414: sql.py: run_query(): Exception parsing: Remove PL/Python prefix from exception message so that the regexps can match at the beginning of the message: Aaron Marcuse-Kubitza
05:50 PM Revision 3413: sql_io.py: put_table(): Handle sql.InvalidValueExceptions by filtering the value out of all input columns. This will be useful for SQL functions that raise exceptions.: Aaron Marcuse-Kubitza
04:49 PM Revision 3412: schemas/vegbien.sql: namedplace, plantname: *_unique UNIQUE INDEX: Reordered columns to put rank after parent_id and plantname so that these columns, which are usually input table columns, can be used in a merge join index scan, while rank, which is usually a literal value, can applied as an index filter condition after the merge join: Aaron Marcuse-Kubitza
04:42 PM Revision 3411: sql.py: distinct_table(): Removed literal values from UNIQUE INDEXes because the query planner did not seem to use them to do a merge join: Aaron Marcuse-Kubitza
04:01 PM Revision 3410: README.TXT: Maintenance: Full DB: Documented how to test full DB backup: Aaron Marcuse-Kubitza
03:47 PM Revision 3409: backups/Makefile: Added %.backup/test: Aaron Marcuse-Kubitza

07/16/2012

08:32 PM Revision 3408: README.TXT: Documented maintenance of full DB (back up/restore): Aaron Marcuse-Kubitza
08:23 PM Revision 3407: backups/Makefile: Full DB backups: Added vegbien.backup: Aaron Marcuse-Kubitza
08:22 PM Revision 3406: pg_dump_vegbien: If first arg is "all", dump entire DB. Require a first arg so that Usage message will be displayed if run with no args.: Aaron Marcuse-Kubitza
08:03 PM Revision 3405: Always output Usage messages to stderr and word-wrap them using `fold -s`: Aaron Marcuse-Kubitza
07:37 PM Revision 3404: backups/Makefile: Factored backup command into $(backup) for later use by full DB backups. Made Backups, Archived imports sections subsections of Archived imports so Full DB backups can have its own section.: Aaron Marcuse-Kubitza
07:16 PM Revision 3403: backups/Makefile: Fixed bug where $(SHELL) needed to be set to bash so that $'\n' would be interpreted correctly: Aaron Marcuse-Kubitza
07:06 PM Revision 3402: backups/Makefile: Fixed bug where *.sql files needed to be restored using psql because pg_restore only supports "non-plain-text formats": Aaron Marcuse-Kubitza
06:35 PM Revision 3401: pg_dump_vegbien: For consistency with setting the --schema option, use `set -- "$@" args...` to append options to $@ which are then passed to pg_dump, instead of specifying several variables which are then included in the pg_dump command: Aaron Marcuse-Kubitza
06:26 PM Revision 3400: pg_dump_vegbien: Pass command line options directly to pg_dump after parsing out any schema name: Aaron Marcuse-Kubitza
06:19 PM Revision 3399: backups/Makefile: Don't log stderr or run the command verbosely and instead just output the command and run time to the terminal. This matches what we do for pg_dump, which works better because it just prints the useful information when it's done running.: Aaron Marcuse-Kubitza
05:28 PM Revision 3398: backups/Makefile: Remove log files after successful restore/extraction because they are only useful for tail -f when the restore operation is running in the background: Aaron Marcuse-Kubitza
05:14 PM Revision 3397: pg_dump_vegbien: Save owners when saving data (for full export): Aaron Marcuse-Kubitza
05:03 PM Revision 3396: pg_dump_vegbien: Use SET SESSION AUTHORIZATION to ensure that owners are always recorded in the same format. This will help make plain text backups comparable using diff.: Aaron Marcuse-Kubitza
04:39 PM Revision 3395: backups/Makefile: Backups: Fixed bug where `%.sql: %` needed to come before %.sql with no prerequisites to be matched first: Aaron Marcuse-Kubitza
04:14 PM Revision 3394: Moved archived imports and make targets to maintain them to new backups dir: Aaron Marcuse-Kubitza
04:08 PM Revision 3393: Moved archived imports and make targets to maintain them to new backups dir: Aaron Marcuse-Kubitza
03:29 PM Revision 3392: Added psql_script_vegbien: Aaron Marcuse-Kubitza
12:54 PM Revision 3391: root Makefile: VegBIEN DB: Schemas: Added schemas/%.sql to extract a compressed custom-format backup to plain SQL: Aaron Marcuse-Kubitza
12:33 PM Revision 3390: root Makefile: VegBIEN DB: Schemas: Added schemas/%.backup/uninstall so that a schema can be removed by its backup file name (with extension) as well as its name: Aaron Marcuse-Kubitza

07/13/2012

04:43 PM Revision 3389: README.TXT: Documented that files and directories can easily be remade with `make <file>-remake` and `make <dir>/remake`, which do the make clean or rm automatically: Aaron Marcuse-Kubitza
04:37 PM Revision 3388: pg_dump_vegbien: Added plain env var which creates plain text output instead, for use in debugging and validation: Aaron Marcuse-Kubitza
04:14 PM Revision 3387: input.Makefile: Mapping: $(map): Fixed bug where use_staged was causing in_schema, in_database, etc. to be set even when not outputting to vegbien, causing bin/map to try to output to a DB when it should output to stdout, by moving the use_staged test to $(map2db): Aaron Marcuse-Kubitza
03:07 PM Revision 3386: sql_io.py: put_table(): Creating full_in_table: If out_table is a SQL function, just use in_table because in_table is not modified: Aaron Marcuse-Kubitza
02:47 PM Revision 3385: input.Makefile: Documentation: import/steps.by_col.sql: $(steps): Fixed bug where stderr needed to be redirected to stdout so that it would be passed to debug2redmine and included in the file: Aaron Marcuse-Kubitza
02:26 PM Revision 3384: root Makefile: Datasources: import: Don't pause after "To import all inputs at once" message because this prevents running sequential import with & (as `make import &`): Aaron Marcuse-Kubitza
02:09 PM Revision 3383: sql.py: create_table(): Only version temp tables, so that e.g. staging tables are not created with a version # if they already exist. This will help make `make install` idempotent.: Aaron Marcuse-Kubitza
02:00 PM Revision 3382: input.Makefile: Import to VegBIEN: $(import): Only redirect stderr to stdout if outputting to a log file. This allows unlogged imports to redirect the "Inserted # new rows into database" message, which is output to stdout rather than stderr for use by validation and automated testing, to a separate location.: Aaron Marcuse-Kubitza
01:54 PM Revision 3381: README.TXT: Documented how to remove an archived import. Moved archived imports commands from Data import to new section Maintenance of archived imports. Changed "Empty the DB" to "Remove the last import" for clarity, and made the command apply just to the public schema and not also to the functions schemas (reinstalling those is documented under Schema changes).: Aaron Marcuse-Kubitza
01:44 PM Revision 3380: root Makefile: $(rmSchemaCmd) for uninstalling schemas: Escape schema name with "" so that schemas/%/uninstall can be used to remove archived imports (as well as aborted restores of them): Aaron Marcuse-Kubitza
01:14 PM Revision 3379: root Makefile: schemas/%.backup: Removed explicit check if file exists because now that the target name matches the name of the file created, the file will be rebuilt only if it doesn't exist: Aaron Marcuse-Kubitza
01:13 PM Revision 3378: schemas: Ignore log files (from restore operations) for svn: Aaron Marcuse-Kubitza
12:49 PM Revision 3377: main Makefile: Installation: install: Download input data using new inputs/download target: Aaron Marcuse-Kubitza
12:48 PM Revision 3376: main Makefile: VegBIEN DB: Schemas: schemas/%/install, schemas/py_functions/install: Ignore errors if schema exists, so that install can be idempotent: Aaron Marcuse-Kubitza
12:38 PM Revision 3375: inputs/Makefile: Added download to download input data from vegbiendev: Aaron Marcuse-Kubitza
10:05 AM Revision 3374: main Makefile: VegBIEN DB: Removing database warning message: Changed "public schema versions" to "archived imports" for clarity: Aaron Marcuse-Kubitza
10:01 AM Revision 3373: main Makefile: VegBIEN DB: Schemas: Added schemas/%.backup/restore and documented it in README.TXT: Aaron Marcuse-Kubitza
09:42 AM Revision 3372: inputs/import.stats.xls: Updated with stats from latest import. Marked partial imports affecting the slowest datasource with a comment when they might otherwise appear to be comparable with full imports due to having a similar # of rows.: Aaron Marcuse-Kubitza
08:58 AM Revision 3371: root Makefile: schemas/%.backup: Removed _always prerequisite because an existing backup should not be overwritten by running `make schemas/%.backup`: Aaron Marcuse-Kubitza
08:53 AM Revision 3370: README.TXT: Updated warning messages to match what's actually output by the commands in question: Aaron Marcuse-Kubitza
08:47 AM Revision 3369: README.TXT: Documented how to back up an archived import: Aaron Marcuse-Kubitza

07/12/2012

03:26 PM Revision 3368: root Makefile: VegBIEN DB: Schemas: schemas/%/backup: Time the backup: Aaron Marcuse-Kubitza
03:25 PM Revision 3367: root Makefile: VegBIEN DB: Schemas: schemas/%/backup: Renamed to the name of the file created (schemas/%.backup) so that the file will automatically be removed by make if you cancel the backup. If this didn't happen, the existing partial file would prevent make from backing it up again.: Aaron Marcuse-Kubitza
03:19 PM Revision 3366: schemas: Ignore *.backup files for svn: Aaron Marcuse-Kubitza
03:00 PM Revision 3365: root Makefile: VegBIEN DB: Schemas: schemas/%/backup: Documented that it can't be used for the current (unrotated) public schema because pg_dump doesn't back up the CREATE SCHEMA statement for it: Aaron Marcuse-Kubitza
02:45 PM Revision 3364: root Makefile: VegBIEN DB: Schemas: schemas/%/backup: Changed file extension to .backup to match what pgAdmin3 expects: Aaron Marcuse-Kubitza
02:40 PM Revision 3363: inputs/SALVIAS/maps/VegX.stems.csv: Fixed PlotObsID mapping to point to sourceAccessionCode so that it would map to the same place as organisms' PlotObsID and the entries would be merged. This will hopefully also fix a SALVIAS stems import bug where the specimenreplicate join on existing/inserted rows matched multiple rows, in spite of using the specimenreplicate_unique_catalognumber UNIQUE INDEX.: Aaron Marcuse-Kubitza
02:35 PM Revision 3362: root Makefile: VegBIEN DB: Schemas: Added schemas/%/backup for backing up an import's schema: Aaron Marcuse-Kubitza
02:34 PM Revision 3361: pg_dump_vegbien: Added data option for backing up an import's schema: Aaron Marcuse-Kubitza
01:32 PM Revision 3360: sql_io.py: cast(): Don't output warnings for errors that are stored in the errors table, because this clutters the log output unnecessarily. Warnings are still output when there is no errors table.: Aaron Marcuse-Kubitza
01:20 PM Revision 3359: schemas/vegbien.sql: party: party_unique UNIQUE INDEX: Moved surname before givenname, middlename to enable the index to be used when just the person's last name is known. (Note that if the organization is not known, and explicit NULL would need to be put in for that field to use the index.): Aaron Marcuse-Kubitza
01:14 PM Revision 3358: sql_io.py: put_table(): DuplicateKeyException: Uniquifying input table to avoid internal duplicate keys: Include all join columns (including literal values) in the unique index, not just table columns, to encourage PostgreSQL to use the index in a merge join.: Aaron Marcuse-Kubitza
01:11 PM Revision 3357: sql.py: distinct_table(): Support literal values as column expressions. Filter out untyped NULLs because the type unknown can't be indexed.: Aaron Marcuse-Kubitza
01:09 PM Revision 3356: sql.py: add_index(): Support literal values as expressions: Aaron Marcuse-Kubitza
01:06 PM Revision 3355: sql_gen.py: ensure_not_null(): Support Literal value inputs, which are passed through if not nullable: Aaron Marcuse-Kubitza
01:05 PM Revision 3354: sql_gen.py: Added cast_literal(): Aaron Marcuse-Kubitza
11:52 AM Revision 3353: sql.py: Added simplify_expr() and use it in parse_expr_col(). This will remove extraneous information from the condition that gets included in the errors table, so it's easy for the data provider to see what constraint is being violated as it applies to their data.: Aaron Marcuse-Kubitza
11:40 AM Revision 3352: sql_io.py: put_table(): Handle CheckExceptions. This fixes a REMIB bug where some rows violated the specimenreplicate.specimenreplicate_required_key CHECK constraint, but there was previously no way to filter out just these rows so all rows were removed instead.: Aaron Marcuse-Kubitza
11:33 AM Revision 3351: sql.py: Added map_expr(): Aaron Marcuse-Kubitza
11:33 AM Revision 3350: sql_gen.py: Added NotCond: Aaron Marcuse-Kubitza
10:06 AM Revision 3349: sql.py: run_query(): ConstraintException: Get and include the constraint condition in the exception: Aaron Marcuse-Kubitza
10:05 AM Revision 3348: sql.py: Added constraint_cond(): Aaron Marcuse-Kubitza
09:26 AM Revision 3347: sql.py: run_query(): ConstraintException: Include table in the constraint name because constraints are scoped by table: Aaron Marcuse-Kubitza
09:18 AM Revision 3346: sql.py: Added ConstraintException and parse it in run_query(): Aaron Marcuse-Kubitza
09:12 AM Revision 3345: sql.py: ConstraintException: Also store the constraint's condition, if any: Aaron Marcuse-Kubitza
08:24 AM Revision 3344: schemas/vegbien.sql: specimenreplicate: specimenreplicate_unique_catalognumber UNIQUE INDEX: Limited to rows where sourceaccessioncode IS NULL, so that it will only be used as an alternate if the datasource does not specify a sourceaccessioncode. This should fix the SpeciesLink bug where input rows with catalognumber_dwc IS NULL match multiple output rows when compared on the specimenreplicate_unique_catalognumber UNIQUE INDEX because they were excluded from the uniquifying by the index filter.: Aaron Marcuse-Kubitza
08:10 AM Revision 3343: schemas/vegbien.ERD.mwb: Synced with schema: Aaron Marcuse-Kubitza
08:06 AM Revision 3342: schemas/vegbien.sql: specimenreplicate: Added UNIQUE INDEX on plantobservation_id for direct vouchers in plots data: Aaron Marcuse-Kubitza
07:16 AM Revision 3341: inputs/import.stats.xls: Updated with stats from latest import: Aaron Marcuse-Kubitza

07/11/2012

10:36 PM Revision 3340: sql.py: Removed no longer used constraint_cols(). index_cols() can always be used for a UNIQUE constraint because a UNIQUE index is automatically created for it.: Aaron Marcuse-Kubitza
10:34 PM Revision 3339: bin/map: Fixed bug where errors table indexes could not be looked up using index_cols() because their schema was not in the search_path, by explicitly adding the in_schema at the end of the search_path. This is the only reason the in_schema needs to be in the search_path, but it's unavoidable because the "duplicate key value violates unique constraint" error does not included the constraint's schema.: Aaron Marcuse-Kubitza
10:24 PM Revision 3338: sql.py: run_query(): DuplicateKeyException: Don't parse table name out of constraint name because it isn't used and can't reliably be determined for table names containing "_": Aaron Marcuse-Kubitza
09:10 PM Revision 3337: mappings/VegX-VegBIEN.stems.csv: Reattached location.datasource_id :[] mappings directly to the location itself, rather than to the location via the locationevent: Aaron Marcuse-Kubitza
08:37 PM Revision 3336: mappings/VegX-VegBIEN.stems.csv: Mapped to taxonoccurrence.sourceaccessioncode to avoid underconstraining taxonoccurrence: Aaron Marcuse-Kubitza
08:02 PM Revision 3335: xml_func.py: process(): Merge mergeable siblings recursively so that newly-mergeable children are also merged: Aaron Marcuse-Kubitza
07:44 PM Revision 3334: xml_dom.py: Removed no longer used merge_same_name(): Aaron Marcuse-Kubitza
07:43 PM Revision 3333: xml_func.py: process(): Recombining pieces of nodes that were split apart in the mappings: Only combine nodes that are explicitly marked as mergeable, to avoid unwanted merges. Refactored to use new xml_dom.merge_by_name().: Aaron Marcuse-Kubitza
07:39 PM Revision 3332: mappings/VegX-VegBIEN.stems.csv: Marked aggregateoccurrence as mergeable, in preparation for switching to explicit merging only to avoid unwanted merges: Aaron Marcuse-Kubitza
07:33 PM Revision 3331: xml_dom.py: Moved merge_adjacent() recursion and name checking into merge() so that other callers of merge() can take advantage of it, too. Added merge_by_name() and use it in merge_same_name(). Removed now-unused merge_adjacent().: Aaron Marcuse-Kubitza
07:30 PM Revision 3330: xpath.py: get(): Support checking all children when the elem name is '*': Aaron Marcuse-Kubitza
07:01 PM Revision 3329: xml_dom.py: replace(): Assert that old node not removed from parent tree: Aaron Marcuse-Kubitza
06:44 PM Revision 3328: xpath.py: Added is_all(): Aaron Marcuse-Kubitza
06:17 PM Revision 3327: xml_func.py: process(): List out the xml_dom.NodeElemIter iterator so that it won't be affected by concurrent changes to the DOM tree: Aaron Marcuse-Kubitza
05:32 PM Revision 3326: xml_func.py: process(): Merging children: Support merging non-adjacent children by using new xml_dom.merge_same_name(). This should eliminate the need to sort children of the same name next to each other in the mappings so that they will be merged.: Aaron Marcuse-Kubitza
05:30 PM Revision 3325: xml_dom.py: Added merge_same_name(): Aaron Marcuse-Kubitza
05:15 PM Revision 3324: bin/map: ex_tracker: Don't add row_ct to iters count in column-based import (by_col) because errors are not done by row, so a % of rows affected is not meaningful: Aaron Marcuse-Kubitza
05:13 PM Revision 3323: sql_io.py: put_table(): NullValueException: Missing mapping for NOT NULL column: Raise error if there is no default input column to fill in the rows with. Missing mappings for required columns are a normal circumstance with path mappings (plantname, namedplace) because the mappings include intervening levels that may not be specified by the datasource. These mappings have default input columns for the intervening levels, so for them, this will continue to not be flagged as an error.: Aaron Marcuse-Kubitza
04:55 PM Revision 3322: sql.py: index_cols(): Fixed bug where index columns were sorted by their position in the table, not their position in the index. This was causing the order of columns in covering indexes not to match their order in the corresponding output table index, preventing merge joins and slowing down joins because of the need to do a seq scan. Refactored to use pg_get_indexdef() on each column position index and parsing the result with parse_expr_col(), which *greatly* simplifies the query and automatically provides the correct sort order.: Aaron Marcuse-Kubitza
04:50 PM Revision 3321: sql.py: Added parse_expr_col(): Aaron Marcuse-Kubitza
04:50 PM Revision 3320: sql_gen.py: Added unesc_name(): Aaron Marcuse-Kubitza
04:02 PM Revision 3319: sql.py: index_cols(): Removed unneeded table param, since the index name is unique within its schema. (The schema is looked up in the search_path if needed, by using ::regclass.): Aaron Marcuse-Kubitza
03:59 PM Revision 3318: sql.py: index_cols(): Fixed bug where index was looked up only by name, not by schema-qualified name, causing columns from other schemas (such as previous versions of public) to be included in the index columns list. This problem is only noticeable when a column in an index is deleted, as locationevent.datasource_id in locationevent_unique_accessioncode was recently.: Aaron Marcuse-Kubitza

07/10/2012

09:53 PM Revision 3317: input.Makefile: Name logs with extension .log.sql so they are syntax-highlighted for the SQL statements they contain: Aaron Marcuse-Kubitza
09:24 PM Revision 3316: schemas/vegbien.sql: taxonoccurrence: Added sourceaccessioncode so that taxonoccurrences can be uniquely identified in plots data: Aaron Marcuse-Kubitza
09:12 PM Revision 3315: schemas/vegbien.sql: locationevent: Removed datasource_id because locationevents are now scoped by their required location, which itself is scoped by datasource: Aaron Marcuse-Kubitza
09:07 PM Revision 3314: mappings/VegX-VegBIEN.stems.csv: Removed locationevent.datasource_id mappings because locationevents are now scoped by their required location, which itself is scoped by datasource: Aaron Marcuse-Kubitza
08:42 PM Revision 3313: sql.py: distinct_table(): Don't sort the inserted rows by pkey because they should stay in the table order that they were in. (The select order with no ORDER BY should be the table order. Even if it isn't, it doesn't matter what order they are in for our current application.): Aaron Marcuse-Kubitza
08:38 PM Revision 3312: sql_io.py: put_table(): Creating an empty pkeys table: Don't sort the inserted result by pkey because it's empty (limit=0): Aaron Marcuse-Kubitza
08:32 PM Revision 3311: sql_io.py: put_table(): ignore(): Fixed bug where in_col's table needed to be changed to insert_in_table, because it's *insert_in_table's* rows that are being modified but mapping (which in_col comes from) qualifies columns by *in_table*: Aaron Marcuse-Kubitza
08:28 PM Revision 3310: sql_io.py: put_table(): ignore(): Also add an index on in_col if mapping the value to NULL: Aaron Marcuse-Kubitza
08:28 PM Revision 3309: sql_io.py: put_table(): ignore(): Only delete from the insert_in_table, because the invalid rows only need to be removed from the rows that are actually being inserted into the DB. If there are invalid rows in the full (not uniquified) in_table, that's OK, as they can still get a valid output pkey if the first copy of a row they were considered a duplicate of *is* valid (this is a very unusual situation, so this change should not affect most real data).: Aaron Marcuse-Kubitza
08:22 PM Revision 3308: sql_io.py: put_table(): ignore(): Merged filter_ var into sql.delete() call because that's the only place it's used: Aaron Marcuse-Kubitza
08:18 PM Revision 3307: sql_io.py: put_table(): insert_into_pkeys(): Removed no longer used distinct param: Aaron Marcuse-Kubitza
08:16 PM Revision 3306: sql_io.py: put_table(): Getting output table pkeys of existing/inserted rows: Don't DISTINCT ON the joined rows by input pkey, because this adds sorting overhead. This should not be needed because there generally should not be any duplicate rows for the columns in a unique index (if there are, this is an index configuration problem and should be fixed in the schema). It's possible that partial indexes (with a filter condition) were causing this, but testing without it in place will be needed to determine the cause.: Aaron Marcuse-Kubitza
08:02 PM Revision 3305: sql.py: flatten(): Auto-add a pkey on the created temp table. This should be standard practice for most temp tables, and for sql_io.put_table() especially this will be useful if we ever want to add back sorting the in_table by row_num (possibly by CLUSTERing on the pkey to avoid pkey index scans).: Aaron Marcuse-Kubitza
07:54 PM Revision 3304: sql.run_query_into() calls: Use new add_pkey_ param instead of manually calling sql.add_pkey(): Aaron Marcuse-Kubitza
07:53 PM Revision 3303: sql.py: run_query_into(): Changed add_indexes_ param to add_pkey_ and add just a pkey if it's set. It's no longer necessary to create indexes on every column of a temp table, because the covering indexes for the join columns have been fixed to have columns in the same order as the output table's corresponding index so that they can be used for a merge join.: Aaron Marcuse-Kubitza
07:41 PM Revision 3302: sql_io.py: put_table(): Add pkey on pkeys table right when it's created, so that any duplicates are detected right away instead of at the end of the iteration. (Duplicates are created as a result of joins matching multiple rows, which often indicates a database misconfiguration.): Aaron Marcuse-Kubitza
07:34 PM Revision 3301: sql_io.py: put_table(): Adding pkey on pkeys table: Removed log message because adding an index is considered a low-level operation, which isn't included in the Redmine SQL: Aaron Marcuse-Kubitza
07:27 PM Revision 3300: schemas/tree_cross-links.sql: Ancestors table: Synced with current definition, which removes unneeded fki_* indexes. Note that the index on ancestor_id might be needed in the future if we ever want to get all the descendants of a plantname/namedplace or perform deletions on plantname/namedplace (which cascade to *_ancestor). For getting all the plantnames/namedplaces (of any rank) for a plantconcept/locationdetermination, though, the *_ancestor_pkey index is sufficient because plantname_id/namedplace_id is the first column in it.: Aaron Marcuse-Kubitza
07:20 PM Revision 3299: schemas/vegbien.sql: {plantname,namedplace}_update_ancestors(): Fixed slowdown due to removed index on {plantname,namedplace}.parent_id by adding COALESCE() to enable using the plantname_unique index for the lookup instead: Aaron Marcuse-Kubitza
06:26 PM Revision 3298: sql.mk_select() calls: Removed no longer needed start=0 to turn off missing WHERE, LIMIT, or OFFSET clause warning: Aaron Marcuse-Kubitza
06:21 PM Revision 3297: sql.py: mk_select(): Don't output warning if there is no WHERE, LIMIT, or OFFSET clause, because column-based import has many queries where this is the case and it's annoying to need to specify start=0 to turn off this warning: Aaron Marcuse-Kubitza
06:04 PM Revision 3296: sql.py: flatten(): Don't sort the input tables by the pkey because it doesn't matter what order the datasource's rows are inserted in. Note that PostgreSQL doesn't guarantee the order of rows in a table, so it is possible that the rows were being inserted in an unknown order before this change, as well.: Aaron Marcuse-Kubitza
05:58 PM Revision 3295: sql.py: delete(): Cache deletes by default: Aaron Marcuse-Kubitza
05:56 PM Revision 3294: sql_io.py: put_table(): Merged remove_rows() and invalid2null() into one ignore() function that chooses the action (map to NULL or delete) depending on the value and whether NULLs have been filtered out of the column: Aaron Marcuse-Kubitza
05:46 PM Revision 3293: sql_io.py: put_table(): remove_rows(): Delete the rows containing the invalid value instead of filtering them out of each select, so that the filtering can be profiled separately from the insertion. This also requires deleting rows with invalid non-NULL values instead of mapping them to NULL if NULLs have already been filtered out of the column in question.: Aaron Marcuse-Kubitza
05:34 PM Revision 3292: sql_io.py: put_table(): Main insert: Don't run it inside an extra savepoint, because this will cause the creation of any helper SQL functions to be rolled back if an exception is thrown. If those functions are later re-used, the cache will think they exist when they no longer do. (Calling a function on input rows is now run in recover mode, so that it doesn't need the outer savepoint anymore.): Aaron Marcuse-Kubitza
05:30 PM Revision 3291: sql_io.py: put_table(): main_insert(): Moved code that is not part of the main query outside the function, so it wouldn't be subject to the exception handling. Preparing to insert new rows: Only do the preparation code for insert_select() if the out_table is not a function.: Aaron Marcuse-Kubitza
05:19 PM Revision 3290: sql_io.py: put_table(): main_insert(): is_function: Run insert_into_pkeys() with recover=True so that errors in the function are properly rolled back: Aaron Marcuse-Kubitza
05:18 PM Revision 3289: sql_io.py: put_table(): insert_into_pkeys(): Support custom query kw_args, such as recover: Aaron Marcuse-Kubitza
04:41 PM Revision 3288: sql_io.py: put_table(): Use full_in_table in the into table row count assertion, since in_table may have rows deleted: Aaron Marcuse-Kubitza
04:36 PM Revision 3287: sql_io.py: put_table(): Save default values for all rows in new temp table full_in_table since in_table may have rows deleted: Aaron Marcuse-Kubitza
04:13 PM Revision 3286: sql.py: Added mk_delete() and delete(): Aaron Marcuse-Kubitza
03:36 PM Revision 3285: sql_io.py: put_table(): mk_main_select(): Turned off unnecessary ORDER BY to avoid sorting the entire table every time it's used. (PostgreSQL has no concept of reordering a table and re-using that ordering, so it just re-sorts the table each time. Index scans on the pkey do not appear to be used in practice, according to EXPLAIN results from live imports.) Document that we instead assume that identical SELECT queries retrieve rows in the same order.: Aaron Marcuse-Kubitza
01:56 PM Revision 3284: schemas/vegbien.sql: taxondetermination: Fixed bug where taxondetermination_taxonoccurrence_id_fkey trigger was applied before the NOT NULL constraint on taxonoccurrence_id was checked, causing the trigger to fail on NULL taxonoccurrence_ids, by making it an AFTER trigger. (An AFTER trigger will still roll back the entire insert if it fails, even though it runs after the insert itself.): Aaron Marcuse-Kubitza

07/09/2012

05:45 PM Revision 3283: schemas/vegbien.sql: specimenreplicate: institution_id: Fixed typo in comment: Aaron Marcuse-Kubitza
05:26 PM Revision 3282: inputs/import.stats.xls: Fixed date for most recent import: Aaron Marcuse-Kubitza
05:26 PM Revision 3281: sql.py: DbConn.run_query(): Put the data source comment on a separate line in the log file instead of using a carriage return, which sometimes had the desired effect of overwriting the src comment with the first line of the query but sometimes the line lengths weren't right and there wasn't enough overlap: Aaron Marcuse-Kubitza
04:53 PM Revision 3280: schemas/vegbien.ERD.mwb: Synced with schema: Aaron Marcuse-Kubitza
04:42 PM Revision 3279: schemas/vegbien.sql: Removed per-column indexes, which are no longer needed by either row-based or column-based import because they are able to do a merge join or lookup using the table's UNIQUE INDEX. Instead of forcing the database to build and maintain large indexes (15+ GB!) that are not used, optimization-only (non-UNIQUE) indexes should be added as needed only once the database is actually used for queries. In most cases it will not even be necessary to add additional indexes then, because most UNIQUE indexes can be reused for broad lookups (rather than just duplicate elimination). Even the foreign key covering indexes (fki_*) are not needed because we virtually never delete rows in the DB, and even if we were to start doing that regularly, the cost of maintaining the indexes on import is most likely not worth the speed improvements for cascading deletes.: Aaron Marcuse-Kubitza
04:32 PM Revision 3278: schemas/py_functions.sql: Removed per-column indexes on relational functions, which are no longer needed by row-based import because it is able to do a merge join-style lookup using the table's UNIQUE INDEX. (Note that column-based import doesn't use the (slower) relational functions at all anymore, and instead calls the corresponding SQL function directly using named arguments.): Aaron Marcuse-Kubitza
04:31 PM Revision 3277: schemas/functions.sql: Removed per-column indexes on relational functions, which are no longer needed by row-based import because it is able to do a merge join-style lookup using the table's UNIQUE INDEX. (Note that column-based import doesn't use the (slower) relational functions at all anymore, and instead calls the corresponding SQL function directly using named arguments.): Aaron Marcuse-Kubitza
04:26 PM Revision 3276: schemas/vegbien.sql: plantname: plantname_unique UNIQUE INDEX: Moved scope_id to the back so that the index can easily be used for lookup queries (not just column-based import) without having to explicitly specify NULL for that field. This takes advantage of a btree sorting feature where a broader lookup can be done using just the first n columns of the index.: Aaron Marcuse-Kubitza
04:15 PM Revision 3275: schemas/vegbien.sql: locationevent, specimenreplicate: Turned UNIQUE CONSTRAINTs and UNIQUE INDEXes with nullable fields into partial UNIQUE INDEXes with IS NOT NULL filter conditions, in order to work automatically with sql_gen without requiring a separate covering lookup index. Removed no longer needed covering lookup indexes.: Aaron Marcuse-Kubitza
03:07 PM Revision 3274: sql_io.py: put_table(): DuplicateKeyException: Fixed bug where combining multiple unique constraints was incorrectly allowed, when in fact the constraints need to be separately applied to the different rows that violate them, which is not currently supported: Aaron Marcuse-Kubitza
03:02 PM Revision 3273: sql.py: DbConn.with_savepoint(): Log transaction profiling info with level=4 like the rest of the transaction commands, so that it isn't output when the transaction itself should be hidden (e.g. for name versioning or internal commands): Aaron Marcuse-Kubitza
02:16 PM Revision 3272: sql.py: DbConn.with_savepoint(): Profile (nested) transactions so that the run time for groups of commands (e.g. csv2db INSERTs) is known: Aaron Marcuse-Kubitza
02:04 PM Revision 3271: csv2db: verbosity defaults to 3 so that detailed queries with profiling stats are included in the log file, to assist in optimization: Aaron Marcuse-Kubitza
02:01 PM Revision 3270: csv2db: Don't cache per-row INSERT queries because this bloats the cache (there aren't repeated identical INSERTs that shouldn't be re-run like in row-based import): Aaron Marcuse-Kubitza
01:57 PM Revision 3269: sql.py with_explain_comment(), DbConn: Fixed bug where with_explain_comment() was being run in per-row imports (row-based import and csv2db with INSERT), causing the overhead of an EXPLAIN query for every single INSERT and filling up the cache with EXPLAIN query results, by adding autoexplain mode, only running with_explain_comment() in autoexplain mode, and only enabling autoexplain mode for column-based import: Aaron Marcuse-Kubitza
01:11 PM Revision 3268: db_xml.py: put_table(): Turn on autoanalyze mode to help the query planner avoid sequential scans on tables that now contain data. (Don't do this in row-based import because it creates too much overhead per insert.): Aaron Marcuse-Kubitza
12:24 PM Revision 3267: sql.py: Run all EXPLAIN queries with log_level=4 since the EXPLAIN information is now usually generated when the query is generated rather than when it's run, so the log_level is not known: Aaron Marcuse-Kubitza
12:21 PM Revision 3266: sql.py: Added with_explain_comment() to query generating functions so that nested queries will also have EXPLAIN information: Aaron Marcuse-Kubitza
12:11 PM Revision 3265: sql.py: Added with_explain_comment() and use it in run_query(): Aaron Marcuse-Kubitza
12:01 PM Revision 3264: sql.py: run_query(): EXPLAIN output: Run explain() with log_level 1 higher than the query's log_level, so that low-level queries' EXPLAIN queries are not output when the queries themselves are not output. This also ensures that only level 2 (major) queries have the EXPLAIN logged (to introduce the query that is being run), to avoid cluttering the log output.: Aaron Marcuse-Kubitza
11:54 AM Revision 3263: sql.py: explain(): Support custom log_level: Aaron Marcuse-Kubitza
11:48 AM Revision 3262: schemas/vegbien.sql: taxondetermination: taxondetermination_taxonoccurrence_id_fkey manual fkey constraint: Fixed bug where needed to raise foreign_key_violation instead of unique_violation: Aaron Marcuse-Kubitza
11:23 AM Revision 3261: inputs/import.stats.xls: Updated with stats from latest import: Aaron Marcuse-Kubitza

07/06/2012

04:43 PM Revision 3260: debug2redmine.csv: Remove newline before EXPLAIN comment: Aaron Marcuse-Kubitza
04:33 PM Revision 3259: debug2redmine.csv: Filter out EXPLAIN comments: Aaron Marcuse-Kubitza
04:29 PM Revision 3258: sql.py: run_query(): EXPLAIN all explainable queries before they are run, to provide query plans for later profiling and index analysis. At verbosity 3+, this also effectively allows the user to see what query is being run before it's executed.: Aaron Marcuse-Kubitza
04:26 PM Revision 3257: sql.py: is_explainable(): Fixed bug where needed r'' syntax to escape \ in \b: Aaron Marcuse-Kubitza
04:23 PM Revision 3256: sql.py: Added explain() and is_explainable(): Aaron Marcuse-Kubitza
04:19 PM Revision 3255: strings.py: Added join_lines(): Aaron Marcuse-Kubitza
02:50 PM Revision 3254: mk_rm_indexes: Also include the search_path in the outputted commands: Aaron Marcuse-Kubitza
02:45 PM Revision 3253: schemas/vegbien.sql: commclass: Fixed bug where commclass_unique needed to be a UNIQUE INDEX: Aaron Marcuse-Kubitza
02:42 PM Revision 3252: schemas/vegbien.sql: plantname: Removed unneeded indexes on plantname and rank (plantname_unique takes care of joins): Aaron Marcuse-Kubitza
02:33 PM Revision 3251: pg_dump_vegbien: Enclose the schema name in "" because pg_dump requires this for schema names with special characters: Aaron Marcuse-Kubitza
02:09 PM Revision 3250: inputs/import.stats.xls: Updated with stats from 2012-7-3 and 2012-7-5 imports. Note that the 2012-7-5 import was partial, so its stats can't be directly compared.: Aaron Marcuse-Kubitza
01:28 PM Revision 3249: root Makefile: VegBIEN DB: Schemas: Added schemas/%/rm_indexes: Aaron Marcuse-Kubitza
01:27 PM Revision 3248: Added mk_rm_indexes: Aaron Marcuse-Kubitza
11:14 AM Revision 3247: sql.py: Added drop() and use it in drop_table(): Aaron Marcuse-Kubitza
10:59 AM Revision 3246: debug2redmine: Remove profiling information from the logging output: Aaron Marcuse-Kubitza
10:43 AM Revision 3245: sql.py: DbConn.run_query(): Only print notices in debug mode, because they are output with a log level higher than the debug verbosity threshold, and this avoid unnecessary overhead: Aaron Marcuse-Kubitza
10:41 AM Revision 3244: sql.py: DbConn: Added profile_row_ct setting, which is passed to profiler.stop() in run_query(): Aaron Marcuse-Kubitza
10:38 AM Revision 3243: bin/map: Logging: Raised debug-mode verbosity threshold to 1.5 so that in row-based imports, which have a default verbosity of 1.1, sql.DbConn.run_query() will not profile the query, to avoid unnecessary overhead: Aaron Marcuse-Kubitza
10:34 AM Revision 3242: sql.py: DbConn.run_query(): Only profile queries in debug mode, to avoid unnecessary overhead when the run time will not be displayed: Aaron Marcuse-Kubitza
10:29 AM Revision 3241: sql.py: DbConn.run_query(): Profile using the profiling.ItersProfiler class, which pretty-prints the run time: Aaron Marcuse-Kubitza
10:22 AM Revision 3240: sql.py: DbConn.run_query(): Added profiling of query execution, which is logged with the query: Aaron Marcuse-Kubitza
09:26 AM Revision 3239: sql.py: DbConn.run_query(): Move log_msg() to where it's used, so that it runs after the query is run and can refer to profiling variables: Aaron Marcuse-Kubitza
09:21 AM Revision 3238: sql.py: DbConn.run_query(): Use else blocks to avoid applying exception handling to commands run after the main command: Aaron Marcuse-Kubitza
09:18 AM Revision 3237: sql.py: DbConn.run_query(): Always output or return the log message after the query is run, so that it can be output with profiling statistics in the log message header: Aaron Marcuse-Kubitza
09:05 AM Revision 3236: sql.py: run_query(): Always output the log message after the query is run, so that it can be output with profiling statistics in the log message header: Aaron Marcuse-Kubitza

07/05/2012

03:16 PM Revision 3235: Regenerated vegbien.ERD exports: Aaron Marcuse-Kubitza
03:13 PM Revision 3234: schemas/vegbien.sql: locationevent: Added covering lookup indexes on the unique constraints to enable fast merge joins in column-based import. Removed no longer needed individual-column lookup indexes because the constraint-covering lookup indexes now handle lookups. This also avoids index bloat.: Aaron Marcuse-Kubitza
03:00 PM Revision 3233: schemas/vegbien.sql: specimenreplicate: Removed no longer needed individual-column lookup indexes because the constraint-covering lookup indexes now handle lookups. This also avoids index bloat.: Aaron Marcuse-Kubitza
02:57 PM Revision 3232: schemas/vegbien.sql: specimenreplicate: Added covering lookup indexes on the unique constraints to enable fast merge joins in column-based import: Aaron Marcuse-Kubitza
02:48 PM Revision 3231: schemas/vegbien.sql: specimenreplicate: Added CHECK constraint which ensures that there is at least one key to sufficiently uniquely identify the specimenreplicate: Aaron Marcuse-Kubitza
02:44 PM Revision 3230: inputs/CTFS/maps/VegX.organisms.csv: Mapped VegX sourceAccessionCode = VegBIEN plantobservation,specimenreplicate.sourceaccessioncode so that specimenreplicate would have a required key: Aaron Marcuse-Kubitza
02:38 PM Revision 3229: mappings/VegX-VegBIEN.stems.csv: Sort the plantobservation.sourceaccessioncode/specimenreplicate.sourceaccessioncode mapping with the other _ifs so the adjacent node merging works properly and it gets created before _ignore removes voucherType: Aaron Marcuse-Kubitza
02:34 PM Revision 3228: mappings/VegX-VegBIEN.stems.csv: Also map plantobservation.sourceaccessioncode to specimenreplicate.sourceaccessioncode so specimenreplicate always has a key and will never be underconstrained: Aaron Marcuse-Kubitza
02:12 PM Revision 3227: xml_func.py: process(): Fixed bug where an evaluated XML function might create a node of the same name as an existing node, but these nodes would not be merged even though they referred to the same object, by merging siblings of a newly-evaluated (replaced) node if they have the same name: Aaron Marcuse-Kubitza
02:09 PM Revision 3226: xml_dom.py: Added merge() and merge_adjacent(): Aaron Marcuse-Kubitza
02:08 PM Revision 3225: xml_dom.py: replace_with_text(): Return the new node: Aaron Marcuse-Kubitza
12:33 PM Revision 3224: mappings/VegX-VegBIEN.stems.csv: Indirect voucher mappings: Removed no longer needed ":[*_id/taxonoccurrence]" because a specimenreplicate *is* a taxonoccurrence, so it doesn't need to *have* an empty taxonoccurrence: Aaron Marcuse-Kubitza
12:27 PM Revision 3223: mappings/VegX-VegBIEN.stems.csv: Fixing specimenreplicate->taxonoccurrence mapping bug where taxonoccurrence_id is no longer used as an fkey because it's instead a pkey inherited from taxonoccurrence, by instead using the new fkey to plantobservation for direct vouchers. Note that a duplicate aggregateoccurrence is created, because the _if XML function runs after the XPaths have created the initial tree, and thus the nodes it pulls forward do not automatically get merged with adjacent nodes of the same name. This will eventually need to be fixed by auto-merging the nodes.: Aaron Marcuse-Kubitza
12:00 PM Revision 3222: schemas/vegbien.sql: specimenreplicate: Fixing specimenreplicate->taxonoccurrence mapping bug where taxonoccurrence_id is no longer used as an fkey because it's instead a pkey inherited from taxonoccurrence, by instead adding an fkey to plantobservation for direct vouchers. Also, it makes more sense for a specimenreplicate to directly voucher the plant it came from rather than that plant's taxonoccurrence, because a direct voucher is a closer relationship to the plant.: Aaron Marcuse-Kubitza
11:22 AM Revision 3221: mappings/VegX-VegBIEN.stems.csv: Map collectiondate to specimenreplicate via voucher when the voucher is indirect, rather than always directly to the taxonoccurrence, because the collectiondate relates to the specimenreplicate, not the taxonoccurrence, and is not necessarily 1:1 with it: Aaron Marcuse-Kubitza
11:17 AM Revision 3220: mappings: Updated for_review VegX-VegBIEN mappings, which hadn't been auto-updated because of a modification time issue. (mappings/VegX-VegBIEN.stems.csv was replaced with an older version, which did not trigger make to remake files depending on it.): Aaron Marcuse-Kubitza
10:28 AM Revision 3219: schemas/vegbien.sql: locationevent: Added sql_gen-compatible indexes on all columns in the locationevent_unique_project_authorcode UNIQUE index: Changed locationevent_project_id index to use COALESCE(). Added index on obsstartdate.: Aaron Marcuse-Kubitza
10:19 AM Revision 3218: schemas/vegbien.sql: locationevent: Removed no longer needed COALESCE() index on location_id now that location_id is NOT NULL: Aaron Marcuse-Kubitza
10:16 AM Revision 3217: schemas/vegbien.sql: locationevent: Fixed bug where locationevent_unique_location index was overconstraining locationevent when a sourceaccessioncode or obsstartdate was specified, by combining the locationevent_unique_location, locationevent_unique_accessioncode, and locationevent_unique_location_date indexes into one COALESCE() index on the combined fields of those indexes: Aaron Marcuse-Kubitza
10:10 AM Revision 3216: schemas/vegbien.sql: locationevent: Made location_id required because every locationevent should have a location, even one with no locationdeterminations. This also avoids the creation of a parent locationevent when subplots are not being used.: Aaron Marcuse-Kubitza
09:48 AM Revision 3215: mappings/VegX-VegBIEN.stems.csv: Removed _collapse where it's no longer needed because sql_io.put() handles that now. Note that each locationevent will get an empty commclass, whether or not there are any commdeterminations. This can later be used to add new commdeterminations.: Aaron Marcuse-Kubitza
09:45 AM Revision 3214: schemas/vegbien.sql: commclass: Changed commclass_unique to COALESCE() classnotes so that there is only one commclass for a locationevent when the commclasses are not separately named. (Currently classnotes is used as the class name field, commname being the name of the community itself.): Aaron Marcuse-Kubitza
09:33 AM Revision 3213: schemas/vegbien.sql: commdetermination: Made commconcept_id NOT NULL because it doesn't make sense to have a commdetermination on nothing. Note that the commname field in commdetermination is not used for making determinations (and may need to be removed to avoid confusion); commname.commname is used instead.: Aaron Marcuse-Kubitza
09:28 AM Revision 3212: schemas/vegbien.sql: locationevent: Added COALESCE() index on location_id for use by column-based import: Aaron Marcuse-Kubitza
09:24 AM Revision 3211: mappings/VegX-VegBIEN.stems.csv: Removed _collapse where it's no longer needed because sql_io.put() handles that now. Note that each plantobservation will get an empty stemobservation, whether or not there are any stemtags. This can later be used to add further stemtags.: Aaron Marcuse-Kubitza
08:58 AM Revision 3210: mappings/VegX-VegBIEN.stems.csv: Removed _collapse where it's no longer needed because sql_io.put() handles that now: Aaron Marcuse-Kubitza
08:31 AM Revision 3209: schemas/vegbien.sql: location: Made datasource_id, sourceaccessioncode NOT NULL to ensure that all locations are uniquely identifiable by their datasource's unique key (sourceaccessioncode): Aaron Marcuse-Kubitza
08:28 AM Revision 3208: sql_io.py: put(): Handle NullValueExceptions by returning a NULL pkey, just like put_table() (column-based import) does: Aaron Marcuse-Kubitza

07/03/2012

05:29 PM Revision 3207: VegBIEN: Fixing import issue related to duplicate entries in tables with children, where when a new table entry duplicates an existing entry, the 1:1 tables of that table and those tables' children are not merged, causing them to become orphaned. It is described in detail at <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/Import_issues#Merging-duplicates-with-children>, including the rationale for this solution. Note that this is *not* a bug in column-based import, it applies to row-based import as well. This commit fixes the issue for locationevent->location in plots data, by also mapping locationevent's unique keys to location.sourceaccessioncode and setting location.datasource_id.: Aaron Marcuse-Kubitza
03:59 PM Revision 3206: sql.py: DbConn.run_query(): Separate the data source comment from the query with a tab in the executed query but a \r in the logged query, so that the query will be shown on the same line as the data source comment in pg_stat_activity, but be hidden by the following line when cating the file and be put on a separate line when viewed in a text editor. This causes the first line of the query to be at the left edge when the log file is viewed, so that it looks more natural.: Aaron Marcuse-Kubitza
03:15 PM Revision 3205: README.TXT: Data import: Import data into VegBIEN: Added command to use for column-based import: Aaron Marcuse-Kubitza
02:10 PM Revision 3204: schemas/vegbien.sql: locationevent: Allow a locationevent to be uniquely specified by its location (which is now datasource-scoped) and start date: Aaron Marcuse-Kubitza
01:26 PM Revision 3203: VegBIEN: Fixing import issue related to duplicate entries in tables with children, where when a new table entry duplicates an existing entry, the 1:1 tables of that table and those tables' children are not merged, causing them to become orphaned. It is described in detail at <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/Import_issues#Merging-duplicates-with-children>, including the rationale for this solution. Note that this is *not* a bug in column-based import, it applies to row-based import as well. This commit fixes the issue for specimenreplicate->...->location, by also mapping specimenreplicate's unique keys to location.sourceaccessioncode and setting location.datasource_id.: Aaron Marcuse-Kubitza
12:47 PM Revision 3202: schemas/vegbien.sql: locationevent: Allow to be uniquely specified by location_id. This is useful for specimens data where there is one location for every locationevent.: Aaron Marcuse-Kubitza
12:27 PM Revision 3201: schemas/vegbien.sql: location: Added datasource_id and sourceaccessioncode so locations can be uniquely specified by the input datasource, rather than being created automatically for each locationevent: Aaron Marcuse-Kubitza
11:45 AM Revision 3200: schemas/filter_ERD.csv: Add back taxondetermination->taxonoccurrence fkey because that has been replaced by a trigger in the SQL: Aaron Marcuse-Kubitza
11:06 AM Revision 3199: VegBIEN: Fixing import issue related to duplicate entries in tables with children, where when a new table entry duplicates an existing entry, the 1:1 tables of that table and those tables' children are not merged, causing them to become orphaned. It is described in detail at <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/Import_issues#Merging-duplicates-with-children>, including the rationale for this solution. Note that this is *not* a bug in column-based import, it applies to row-based import as well. This commit fixes the issue for specimenreplicate->taxonoccurrence.: Aaron Marcuse-Kubitza
07:40 AM Revision 3198: inputs/import.stats.xls: Updated with remaining stats from most recent import: Aaron Marcuse-Kubitza

07/02/2012

03:52 PM Revision 3197: PostgreSQL-MySQL.csv: Remove INHERITS clauses: Aaron Marcuse-Kubitza
02:37 PM Revision 3196: schemas/vegbien.ERD.mwb: Fixed lines: Aaron Marcuse-Kubitza
02:24 PM Revision 3195: db_xml.py: put_table(): Moved in_row_ct updating to Subsetting section so the cursor's rowcount can be used directly: Aaron Marcuse-Kubitza
02:18 PM Revision 3194: db_xml.py: put_table(): Subsetting in_table: Don't count # rows because this takes awhile for large datasets. Instead, use the chunking algorithm in digir_client, which ends the loop when a partial or empty partition is encountered.: Aaron Marcuse-Kubitza
01:58 PM Revision 3193: inputs/import.stats.xls: Updated with new stats from an independent import: Aaron Marcuse-Kubitza
01:20 PM Revision 3192: schemas/vegbien.sql: Fixed UNIQUE INDEXes that were still using COALESCE(..., 0) to use COALESCE(..., 2147483647) in order to match what sql_gen.EnsureNotNull uses: Aaron Marcuse-Kubitza
12:41 PM Revision 3191: schemas/vegbien.sql: specimenreplicate: UNIQUE INDEX on catalognumber_dwc: Added collectioncode_dwc so that datasources that specify it in addition to the institution_id (such as aggregators) will not need to have catalognumbers be unique within an institution: Aaron Marcuse-Kubitza
12:30 PM Revision 3190: inputs/import.stats.xls: Updated with more stats from latest import: Aaron Marcuse-Kubitza
11:44 AM Revision 3189: inputs/import.stats.xls: Updated with initial stats from latest import. Reformatted to put successive runs of column-based next to each other, so they could be directly compared and so that the row-based data wouldn't need to be duplicated. Added empty-value checks to formulas so that they don't need to be manually deleted when one of their inputs is empty.: Aaron Marcuse-Kubitza
10:32 AM Revision 3188: input.Makefile: Documentation: import/steps.by_col.sql: Fixed bug where needed to run import in test mode: Aaron Marcuse-Kubitza
10:12 AM Revision 3187: sql_io.py: put_table(): Don't set pkeys of missing rows to default value if out_table is a SQL function, because then there is already an entry for every row: Aaron Marcuse-Kubitza
10:03 AM Revision 3186: bin/map: by_col: Reuse existing out_db connection for in_db instead of opening separate connection: Aaron Marcuse-Kubitza
09:50 AM Revision 3185: sql.py: mk_select(): Replaced newlines with spaces when query is simple enough to fit on one line: Aaron Marcuse-Kubitza
09:40 AM Revision 3184: db_xml.py: put_table(): Set db.src to help identify the data source in pg_stat_activity: Aaron Marcuse-Kubitza
09:39 AM Revision 3183: sql.py: DbConn: Added src config param, which in autocommit mode, will be included in a comment in every query, to help identify the data source in pg_stat_activity: Aaron Marcuse-Kubitza
09:38 AM Revision 3182: sql_gen.py: Added lstrip() to remove comments: Aaron Marcuse-Kubitza
09:13 AM Revision 3181: sql.py: mk_insert_select(): Added src param to help identify the data source in pg_stat_activity: Aaron Marcuse-Kubitza
08:33 AM Revision 3180: mappings/DwC2-VegBIEN.specimens.csv: Mapped institutionCode. This will enable datasources to use specimenreplicate's institution_id index for duplicate elimination.: Aaron Marcuse-Kubitza
08:31 AM Revision 3179: input.Makefile: Prompt user to accept test, instead of providing command line func for doing so: Aaron Marcuse-Kubitza
07:45 AM Revision 3178: schemas/vegbien.sql: specimenreplicate: UNIQUE INDEX on catalognumber_dwc: Added institution_id so that datasources that specify it (such as aggregators) will not need to have catalognumbers be globally unique. Once the institution_id is mapped to, this will fix a bug where rows with the same catalognumber were assumed to be duplicates even though they were from different institutions. This should also avoid the need to do any duplicate elimination joins when importing specimenreplicate, speeding up column-based import.: Aaron Marcuse-Kubitza
07:32 AM Revision 3177: schemas/vegbien.sql: specimenreplicate: Renamed museum_id to institution_id to correspond with DwC's institutionCode, so that it would be more obvious where to map institutionCode fields to: Aaron Marcuse-Kubitza
07:16 AM Revision 3176: inputs/import.stats.xls: Updated to include run times for rest of datasources for most recent column-based import: Aaron Marcuse-Kubitza

06/29/2012

08:09 AM Revision 3175: db_xml.py: put_table(): Subsetting in_table: Prepend schema to subset table name so that in pg_stat_activity, it's clear which datasource a particular query is from: Aaron Marcuse-Kubitza
07:46 AM Revision 3174: sql_io.py: cast_temp_col(): add_col()'s distinguishing comment param: Add the type in case the same input column is being cast to different types, and both types have the same first word (causing their new column names to be the same): Aaron Marcuse-Kubitza
07:42 AM Revision 3173: sql_io.py: cast_temp_col(): Name the new column with only the first word of the type, to save space in the limited identifier length: Aaron Marcuse-Kubitza
07:41 AM Revision 3172: strings.py: Added first_word(): Aaron Marcuse-Kubitza
07:35 AM Revision 3171: sql_io.py: cast_temp_col(): Use sql_gen.suffixed_col() to create the new column name: Aaron Marcuse-Kubitza
06:16 AM Revision 3170: inputs/import.stats.xls: Added run time for SALVIAS organisms, which just finished: Aaron Marcuse-Kubitza
06:14 AM Revision 3169: inputs/import.stats.xls: Use [1]-style footnotes because copying and pasting to Gmail doesn't preserve the superscripts: Aaron Marcuse-Kubitza
06:11 AM Revision 3168: inputs/import.stats.xls: Updated for latest simultaneous column-based import: Aaron Marcuse-Kubitza
04:42 AM Revision 3167: sql_io.py: cast_temp_col(): Don't automatically create an index on the new column, because it doesn't necessarily need an index and the main index used for the join is now added automatically by distinct_table(): Aaron Marcuse-Kubitza
04:39 AM Revision 3166: sql.py: flatten(): Don't automatically create indexes on all columns, because most columns don't need indexes and the main index used for the join is now added automatically by distinct_table(): Aaron Marcuse-Kubitza
04:35 AM Revision 3165: sql.py: Removed no longer needed add_index_col() and ensure_not_null() because we are not using index columns: Aaron Marcuse-Kubitza
04:33 AM Revision 3164: sql.py: add_index(): Don't create index columns for nullable columns, because they require indexes to be created on all columns in order to use a distinct_table() temp table. Also, now that we are no longer using LEFT JOINs, the COALESCE() call would only be evaluated once (in the plain JOIN) in the event that PostgreSQL doesn't use an index on a COALESCE() expression.: Aaron Marcuse-Kubitza
03:30 AM Revision 3163: schemas/vegbien.sql: location: Dropped unique constraint on lat/long because it covered only some rows, which interfered with column-based import's selection of different insert methods based on the presence or absence of duplicate keys. (With the constraint, locations with coordinates would have duplicates eliminated, but locations without coordinates would not be able to find which row was added for a particular location because there was no lookup key to join on, and would all just use the first inserted row.) The previous behavior didn't make much sense anyway, because it would assert that two locationevents occurred in the same place just because they had the same coordinates, which may not have been precise enough to make this determination. Asserting that two locationevents occurred in the same place is really part of the secondary validation, not the import process.: Aaron Marcuse-Kubitza
01:58 AM Revision 3162: sql.py: DbConn: Fixed bug where Exceptions did not have the query appended if the query was not run in cacheable mode, by moving _add_cursor_info() from DbCursor.execute() to run_query() so it would also get called for non-cacheable queries that use a native cursor rather than a wrapper. Fixed bug where non-cacheable queries were not autocommitted, by moving self.do_autocommit() from DbCursor.execute() to run_query() so it would also get called for non-cacheable queries that use a native cursor rather than a wrapper.: Aaron Marcuse-Kubitza
01:54 AM Revision 3161: sql.py: DbConn._db(): Record that a transaction is already open before setting the search_path so that a query is never run with an _savepoint value less than 1 (manual transactions are not supported yet): Aaron Marcuse-Kubitza
01:52 AM Revision 3160: sql.py: DbConn.with_savepoint(): Increment _savepoint before running queries so they don't get autocommitted: Aaron Marcuse-Kubitza
01:10 AM Revision 3159: sql.py: empty_temp(): Empty temp tables even in debug_temp mode, so that it can be seen which tables have been garbage collected and disk space leaks can be detected. This will not affect the external re-runnability of slow queries in debug_temp mode, as long as the user aborts the debug_temp import while the slow query is still running.: Aaron Marcuse-Kubitza
01:07 AM Revision 3158: sql_gen.py: ColDict: Use OrderedDict so that order of keys in input dict (if ordered) will be preserved. This should ensure that tempt table unique indexes have their columns in the same order as the output table, so that a merge join can be used.: Aaron Marcuse-Kubitza
01:01 AM Revision 3157: util.py: dict_subset(): Use OrderedDict so that order of keys in input dict (if ordered) will be preserved: Aaron Marcuse-Kubitza
12:55 AM Revision 3156: main Makefile: python-Darwin: Added pip installation instructions. python-Linux: Added ordereddict.: Aaron Marcuse-Kubitza
12:04 AM Revision 3155: sql.py: DbConn.col_info(): cacheable param defaults to True now that callers explicitly turn off cacheable when needed: Aaron Marcuse-Kubitza
12:00 AM Revision 3154: sql.py: add_index_col(): Explicitly set update()'s col_info caching depending on whether col_info will be changed later by add_not_null(): Aaron Marcuse-Kubitza

Also available in: Atom