/trunk/lib/db_xml.py - Changes - BIEN 3 - NCEAS Projects

root/trunk/lib/db_xml.py @ 12642

#	Date	Author	Comment
11970	01/20/2014 11:33 AM	Aaron Marcuse-Kubitza	moved everything into /trunk/ to create the standard svn layout, for use with tools that require this (eg. git-svn). IMPORTANT: do NOT do an `svn up`. instead, re-use your working copy's existing files with `svn switch` (http://svnbook.red-bean.com/en/1.6/svn.ref.svn.c.switch.html).
10829	08/30/2013 04:35 PM	Aaron Marcuse-Kubitza	bugfix: lib/db_xml.py: put_table(): turned off db.autoanalyze, since forcing an ANALYZE after every bulk insert is inefficient for small datasources. the default autovacuum settings in schemas/postgresql.conf should be fine; however, the frequency and/or threshold may need to be increased if autovacuum does not ANALYZE frequently enough to replace db.autoanalyze.
8265	04/01/2013 06:12 PM	Aaron Marcuse-Kubitza	lib/db_xml.py: put_table(): Fixed bug where command to advance start to fetch next set was unintentionally deleted when removing the is_view check
8221	03/28/2013 12:19 AM	Aaron Marcuse-Kubitza	lib/db_xml.py: put_table(): Fixed bug where also need to advance start to fetch next set when table is a view, because the views that are now being used with the import (inputs/FIA/occurrence_all/) are static rather than dynamic and do not return different rows after the previous set of rows has been imported
7243	01/16/2013 06:47 AM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Fixed bug where for views, shouldn't advance start (OFFSET clause) after each chunk, because views are typically dynamic and will contain a new set of rows after the first set is imported
6444	11/24/2012 02:40 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Support None in_table by calling put() directly
6413	11/24/2012 09:21 AM	Aaron Marcuse-Kubitza	db_xml.py: put(): _setDefault(): Delay the evaluation of each col_default's value until the col_default is actually retrieved. This fixes a bug in the source table mappings where the explicit source entry was being created after the col_default source entry, causing the initial entry, which did not have the additional fields populated, to be used instead.
6411	11/24/2012 08:59 AM	Aaron Marcuse-Kubitza	db_xml.py: put(): _setDefault(): Fixed bug where need to copy col_defaults before calling update() on it, to avoid modifying the input value (which may be reused by the caller, expecting it to be unmodified)
6410	11/24/2012 08:54 AM	Aaron Marcuse-Kubitza	db_xml.py: put(): col_defaults param: Fixed bug where need to use None as default value, because col_defaults will be modified by put() and the {} default value is a global instance
6408	11/24/2012 08:16 AM	Aaron Marcuse-Kubitza	db_xml.py: put(): Pass through the values of nodes which are text nodes
6407	11/24/2012 08:15 AM	Aaron Marcuse-Kubitza	db_xml.py: put(): put_(): Support setDefault() values which are text nodes, by passing text strings through when put() is run on all col_defaults entries
6406	11/24/2012 07:50 AM	Aaron Marcuse-Kubitza	db_xml.py: put(): _setDefault(): Support setting multiple col_defaults at once by using the param names themselves as the column names
6402	11/24/2012 07:25 AM	Aaron Marcuse-Kubitza	db_xml.py: put(): Added _setDefault() built-in function, which adds an entry to col_defaults
5725	10/23/2012 07:05 AM	Aaron Marcuse-Kubitza	db_xml.py: put(): Indicate no parent_ids_loc using no_parent_ids_loc sentinel instead of None to support parent_ids_locs that are equal to None (e.g. if the parent node had an error). Always forward parent_ids_loc to children with fkeys to parent, even on error, because the parent table may not be required for the child tables to be valid, such as for taxonomic-data-only datasets that nevertheless have nodes for the non-taxonomic tables in their mappings.
5716	10/23/2012 04:39 AM	Aaron Marcuse-Kubitza	db_xml.py: put(): Inserting children with fkeys to parent: Don't do this if this node had an error and sql_io.put_table() returned None as the generated pkey. This fixes a bug where a node with an error will still try to create children with fkeys to parent, but pass None as the fkey to parent, which the recursive put() call will then incorrectly treat as there being no field with an fkey to parent at all rather than a field whose value is NULL. This causes function overload resolution to be unable to find the intended function, because it is missing a parameter.
5706	10/23/2012 03:57 AM	Aaron Marcuse-Kubitza	inputs/.NCBI/: Renamed higher_taxa to nodes because it currently doesn't just contain the higher taxa
5523	10/15/2012 02:36 PM	Aaron Marcuse-Kubitza	sql.select() calls: Removed order_by=None everywhere that a stable row order is required (i.e. consistent between selects, or consistent between table transformations). This causes several tests to return different inserted row counts, because the input table is now being accessed in pkey order instead of in table order. This fixes a bug where tables with more rows than ~100 would return different results for repeated calls of the same non-ordered select.
5514	10/15/2012 10:21 AM	Aaron Marcuse-Kubitza	db_xml.py: partition_size: Increased to 1,000,000 (>= NCBI.higher_taxa's size) so NCBI.higher_taxa can be imported completely in one partition. This is necessary because NCBI's taxonconcepts are not in dependency order (parents first), so a later partition cannot rely on the parents of its taxonconcepts having already been imported. Instead, all taxonconcepts must be imported at once and then separately, the parents of all taxonconcepts must be set.
5508	10/15/2012 09:35 AM	Aaron Marcuse-Kubitza	db_xml.py: put(): Treat a child node which is a function (starts with _) as a child with fkey to parent rather than as a field in the table. Such a function accepts the table's pkey as one of its arguments.
5388	10/10/2012 05:01 AM	Aaron Marcuse-Kubitza	sql.py: Renamed pkey() to pkey_name()
5385	10/10/2012 04:43 AM	Aaron Marcuse-Kubitza	db_xml.py: cleanup_table(): Inline sql.pkey_col ('row_num') because this is the only place it's used
5382	10/10/2012 04:16 AM	Aaron Marcuse-Kubitza	db_xml.py: put(): Fixed bug where needed to avoid truncating the pkeys_loc table, in case it's the same as one of the in_tables. This occurs now that sql_io.put_table() passes through the actual input column instead of the joined-together input table's column when ignoring all rows.
5087	09/28/2012 08:05 AM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Fixed bug where pkeys_loc needed to be initialized. Note that this bug was only triggered when importing a table with zero rows (in this case, the initial empty TNRS.tnrs table), because otherwise it would be set in the loop.
5075	09/27/2012 10:51 AM	Aaron Marcuse-Kubitza	db_xml.py: put(): Added _alt optimization that just returns the first arg if it's non-NULL
5071	09/27/2012 10:19 AM	Aaron Marcuse-Kubitza	db_xml.py: put(): Parse input columns and process values in separate loops, so that structural XML function optimization code can be inserted between them
5026	09/26/2012 11:49 PM	Aaron Marcuse-Kubitza	bin/map, db_xml.put_table() (row-based and column-based import): Don't sort the input table by its pkey, in order to support input tables with no pkey. Note that reading the input table in table order and having this match the input flat file's order is only possible with sql_io.import_csv()'s truncation of the table on a failed import, which ensures that the rows will be stored in inserted order.
5024	09/26/2012 10:53 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting in_table: Add a row number column if in_table does not already have a pkey
5023	09/26/2012 10:48 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting in_table: Copy all of in_table's structure, rather than just the column types, by using sql.copy_table_struct() and sql.insert_select(). This preserves pkeys and NOT NULL constraints, which are useful for column-based import.
5022	09/26/2012 10:47 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting in_table: Create in_table as a completely new sql_gen.Table instead of copying full_in_table and relying on sql.run_query_into() to set is_temp and remove the schema
5019	09/26/2012 10:14 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Removed no longer accurate comment that full_in_table will be shadowed (hidden) by the created temp table. (The temp table is now named differently, so the shadowing does not occur.)
5018	09/26/2012 10:02 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Replaced no longer accurate Recurse comment with Import data. Rewrapped lines.
5007	09/26/2012 06:56 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Allow caller to specify custom partition_size
4491	09/06/2012 08:30 PM	Aaron Marcuse-Kubitza	Replaced str() with strings.ustr() (or equivalent) everywhere needed, to avoid future UnicodeEncodeErrors
4437	09/05/2012 02:49 AM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Create errors table if it doesn't exist
4242	08/27/2012 10:10 PM	Aaron Marcuse-Kubitza	db_xml.py: input_col_prefix: Use value of xml_func.var_name_prefix, which is now the place where this value is configured
4241	08/27/2012 10:09 PM	Aaron Marcuse-Kubitza	db_xml.py: Moved input_col_prefix above the put() function that uses it
4225	08/27/2012 06:51 PM	Aaron Marcuse-Kubitza	db_xml.py: put(): _simplifyPath() built-in function: Removed `require` param, which is not used by this _simplifyPath() implementation because the database constraints handle this
3719	08/01/2012 06:26 AM	Aaron Marcuse-Kubitza	db_xml.py: put(): wrap_e(): Call augment_error() to add the current node to the error message
3718	08/01/2012 06:14 AM	Aaron Marcuse-Kubitza	db_xml.py: put(): Raise an error if there are multiple fields with the same name, instead of silently overwriting the first with the second. This generally indicates the need to use `:[@merge=1]` on the fields in question.
3687	07/30/2012 05:59 PM	Aaron Marcuse-Kubitza	db_xml.py: put(): Fixed bug where strings starting with "$" were interpreted as input columns in row-based mode (this should only apply to column-based mode). Explicitly store whether in row-based mode in is_literals var (similar to is_literals in sql_io.put_table()).
3685	07/30/2012 03:53 PM	Aaron Marcuse-Kubitza	db_xml.py: put(): put_(): Removed no longer needed in_row_ct_ref param, which is only used by put_table(). Rewrapped function body.
3677	07/30/2012 12:11 PM	Aaron Marcuse-Kubitza	db_xml.py: put(): Don't suppress exceptions thrown by sql_io.put_table() by passing them to on_error(), because some exceptions indicate unrecoverable database connection problems such as a broken connection, which should abort the import
3676	07/30/2012 11:52 AM	Aaron Marcuse-Kubitza	db_xml.py: put(): Support datasets with no rows, where root.firstChild == None. Documented that to use an entire XML document, you need to pass root.firstChild rather than root.
3661	07/27/2012 09:38 PM	Aaron Marcuse-Kubitza	Moved importing of col_defaults from db_xml.put_table() to bin/map, so that it also happens in row-based mode. Note that this causes a DB entry for the datasource to always be created, even if the datasource has no mappings or no rows.
3660	07/27/2012 09:13 PM	Aaron Marcuse-Kubitza	Use new exc.reraise() where exc.raise_() was used, so that the stack trace is preserved when the exception is rethrown
3657	07/27/2012 09:02 PM	Aaron Marcuse-Kubitza	db_xml.py: put(): Inserting node: Wrap sql_io.put_table() call in catch-all exception handler that calls on_error_() (wrapper for error handler provided by caller) and returns None. This both adds additional debugging info to the exception (in on_error_()) and allows recovery from arbitrary exceptions that happen in sql_io.put_table(), so that an exception does not abort the import.
3652	07/27/2012 08:02 PM	Aaron Marcuse-Kubitza	db_xml.py: Renamed put_table_special_funcs to put_special_funcs because it is now used by put() as well
3651	07/27/2012 08:00 PM	Aaron Marcuse-Kubitza	db_xml.py: Moved put() before the functions that use it
3650	07/27/2012 07:58 PM	Aaron Marcuse-Kubitza	db_xml.py: Renamed _put_table_part() to put(), replacing the existing put() whose functionality it now performs
3649	07/27/2012 07:52 PM	Aaron Marcuse-Kubitza	db_xml.py: _put_table_part(): Reordered params to match put(), so that it can eventually be substituted for it
3648	07/27/2012 07:44 PM	Aaron Marcuse-Kubitza	db_xml.py: _put_table_part(): Allow being invoked directly by adding defaults for parameters
3647	07/27/2012 07:41 PM	Aaron Marcuse-Kubitza	db_xml.py: put(): Use _put_table_part(). This will ensure that all the put-related functionality is in one place, rather than duplicated.
3646	07/27/2012 07:30 PM	Aaron Marcuse-Kubitza	db_xml.py: _put_table_part(): Append the node to errors handled with on_error()
3628	07/26/2012 08:37 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Removed into param to set a custom into table name because put_table() now has all the info it needs to generate this name automatically, and callers are no longer providing it
3625	07/26/2012 07:45 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Import col_defaults to translate nodes to pkeys
3624	07/26/2012 07:44 PM	Aaron Marcuse-Kubitza	db_xml.py: _put_table_part(): Support no in_table, for iterations with only literal values
3622	07/26/2012 06:35 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Removed parent_ids_loc and next params since these are only used in the recursion
3621	07/26/2012 06:17 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Split into an outer function that sets up the database environment and subsets in_table, and a (recursive) inner function that imports the data
3620	07/26/2012 05:55 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting and partitioning in_table: Documented that it's OK to do this even if table already the right size because it takes <1 sec
3619	07/26/2012 05:43 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Use is_function where caller-provided is_func was used, since is_function determines whether something is a function based on whether it actually exists as a SQL function instead of just whether its name starts with "_". Removed now-unneeded is_func param.
3618	07/26/2012 05:36 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Added col_defaults param and use it if there's a missing mapping for a NOT NULL column. This requires callers passing arguments by position to add an empty value for this parameter.
3601	07/25/2012 02:06 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Adding fkey to parent: Fixed bug where should only add parent_ids_loc table to list of tables not to truncate if it's a column, because it is sometimes just a pkey value when that iteration contained only literals
3428	07/17/2012 08:44 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Divide fields into input columns and literal values: Translate values: Allow literal values other than strings or None (from the XML parsing), because sql_io.put_table() is getting an optimization for iterations containing only literal values, which just returns the pkey of the single row for these values (which is usually an integer) instead of a temp table with the same value in each row
3426	07/17/2012 08:23 PM	Aaron Marcuse-Kubitza	db_xml.py: Don't remove any explicit pkey because the output table may be a SQL function, which does not have a pkey. This feature only existed to support importing VegBank XML exports, which we don't use (and which would be incompatible with the schema anyway).
3304	07/10/2012 07:54 PM	Aaron Marcuse-Kubitza	sql.run_query_into() calls: Use new add_pkey_ param instead of manually calling sql.add_pkey()
3269	07/09/2012 01:57 PM	Aaron Marcuse-Kubitza	sql.py with_explain_comment(), DbConn: Fixed bug where with_explain_comment() was being run in per-row imports (row-based import and csv2db with INSERT), causing the overhead of an EXPLAIN query for every single INSERT and filling up the cache with EXPLAIN query results, by adding autoexplain mode, only running with_explain_comment() in autoexplain mode, and only enabling autoexplain mode for column-based import
3268	07/09/2012 01:11 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Turn on autoanalyze mode to help the query planner avoid sequential scans on tables that now contain data. (Don't do this in row-based import because it creates too much overhead per insert.)
3195	07/02/2012 02:24 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Moved in_row_ct updating to Subsetting section so the cursor's rowcount can be used directly
3194	07/02/2012 02:18 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting in_table: Don't count # rows because this takes awhile for large datasets. Instead, use the chunking algorithm in digir_client, which ends the loop when a partial or empty partition is encountered.
3184	07/02/2012 09:40 AM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Set db.src to help identify the data source in pg_stat_activity
3175	06/29/2012 08:09 AM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting in_table: Prepend schema to subset table name so that in pg_stat_activity, it's clear which datasource a particular query is from
3145	06/28/2012 10:21 PM	Aaron Marcuse-Kubitza	db_xml.py: partition_size: Turning partitioning back on (with a larger limit), since the largest datasources' temp tables are still too big
3143	06/28/2012 09:01 PM	Aaron Marcuse-Kubitza	db_xml.py: partition_size: Set to sys.maxint to disable partitioning. The last bugfix, which avoided returning a large result set to the client which was never read, seems to have fixed the disk space leak, so it's worth reattempting a full simultaneous import.
3142	06/28/2012 08:30 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting in_table: Truncate in_table when finished with it, to avoid temp table disk space leaks
3130	06/27/2012 08:27 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Work around PostgreSQL's temp table disk space leak by reconnecting to the DB after every partition
3123	06/27/2012 07:31 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): put_table_(): Removed no longer used limit, start params
3122	06/27/2012 07:23 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Merged partitioning and subsetting into same section for simplicity, to avoid creating extra temp tables, and to later allow the connection to be closed and reopened between partitions. partition_size: Expressed value without exponent notation to ensure that it's an integer.
3121	06/27/2012 07:11 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Partitioning in_table: Adjust bounds of last partition to actual row #s included
3115	06/27/2012 06:07 PM	Aaron Marcuse-Kubitza	db_xml.py: partition_size: Set to just more than the size of the largest data source that was successfully imported in simultaneous import
3114	06/27/2012 05:32 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Partition in_table if larger than a threshold. The threshold is initially set to disable partitioning. Partitioning will hopefully eliminate the excessive disk usage for large input tables, which has caused the system to run out of disk space due to what may be a bug in PostgreSQL.
3113	06/27/2012 05:27 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Set in_table's default srcs to in_table itself instead of sql_gen.src_self, so that any copies of in_table will inherit the same srcs instead of being treated as a top-level table. This ensures that the top-level table's errors table will always be used.
3107	06/27/2012 03:44 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting in_table: Call put_table() recursively using put_table_() to ensure that limit and start are reset to their default values, in case the table gets partitioned (which needs up-to-date limit and start values)
3077	06/26/2012 04:46 PM	Aaron Marcuse-Kubitza	Moved Heuristic queries from sql.py to new sql_io.py
2975	06/20/2012 07:11 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Removed `if not db.debug_temp` check because that is done by sql.empty_temp()
2966	06/20/2012 02:16 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Empty unneeded temp tables to free up memory
2928	06/18/2012 05:38 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Removed no longer needed commit param
2925	06/18/2012 05:13 PM	Aaron Marcuse-Kubitza	Removed unnecessary db.db.commit() calls because commits are now done automatically by DbConn's autocommit mode
2813	06/13/2012 03:02 PM	Aaron Marcuse-Kubitza	db_xml.py: Removed into_table_name() because this functionality is now handled by sql.into_table_name()
2807	06/12/2012 10:08 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Pass on_error to sql.put_table()
2806	06/12/2012 10:07 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Take on_error param like row-based put()
2786	06/12/2012 08:22 PM	Aaron Marcuse-Kubitza	sql.py: mk_select(): Return just the query instead of the query plus empty params
2780	06/12/2012 07:57 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting in_table: Add pkey to created temp table to facilitate joining it with intermediate tables
2714	06/08/2012 08:19 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Track in_table's source so its original schema can be obtained and auxiliary tables located
2611	06/05/2012 07:09 AM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Removed no longer accurate comment about handling _simplifyPath
2605	06/04/2012 03:06 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting in_table: Fixed bug where in_table was not being ordered by the row_num, because order_by was set to None when it should have been omitted so it would default to the pkey
2581	06/04/2012 08:32 AM	Aaron Marcuse-Kubitza	db_xml.py: into_table_name(): Removed no longer necessary handling of simple functions, which is now done by sql.into_table_name(). Ensure that rank params in functions (not tables) are not treated specially as hierarchical.
2553	06/01/2012 04:33 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Pass is_func to sql.put_table()
2551	06/01/2012 04:09 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Treat every node name that starts with "_" as a function, not just members of put_table_special_funcs. This ensures that DB function args are always treated as values, not children with fkeys to parent.
2549	06/01/2012 03:39 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Make special_funcs externally available as module constant put_table_special_funcs
2515	05/31/2012 09:20 AM	Aaron Marcuse-Kubitza	sql.py, db_xml.py: Removed unnecessary calls to sql_gen.clean_name() now that str() handles this automatically

Project

General

Profile