moved everything into /trunk/ to create the standard svn layout, for use with tools that require this (eg. git-svn). IMPORTANT: do NOT do an `svn up`. instead, re-use your working copy's existing files with `svn switch` (http://svnbook.red-bean.com/en/1.6/svn.ref.svn.c.switch.html).
bugfix: lib/db_xml.py: put_table(): turned off db.autoanalyze, since forcing an ANALYZE after every bulk insert is inefficient for small datasources. the default autovacuum settings in schemas/postgresql.conf should be fine; however, the frequency and/or threshold may need to be increased if autovacuum does not ANALYZE frequently enough to replace db.autoanalyze.
lib/db_xml.py: put_table(): Fixed bug where command to advance start to fetch next set was unintentionally deleted when removing the is_view check
lib/db_xml.py: put_table(): Fixed bug where also need to advance start to fetch next set when table is a view, because the views that are now being used with the import (inputs/FIA/occurrence_all/) are static rather than dynamic and do not return different rows after the previous set of rows has been imported
db_xml.py: put_table(): Fixed bug where for views, shouldn't advance start (OFFSET clause) after each chunk, because views are typically dynamic and will contain a new set of rows after the first set is imported
db_xml.py: put_table(): Support None in_table by calling put() directly
db_xml.py: put(): _setDefault(): Delay the evaluation of each col_default's value until the col_default is actually retrieved. This fixes a bug in the source table mappings where the explicit source entry was being created after the col_default source entry, causing the initial entry, which did not have the additional fields populated, to be used instead.
db_xml.py: put(): _setDefault(): Fixed bug where need to copy col_defaults before calling update() on it, to avoid modifying the input value (which may be reused by the caller, expecting it to be unmodified)
db_xml.py: put(): col_defaults param: Fixed bug where need to use None as default value, because col_defaults will be modified by put() and the {} default value is a global instance
db_xml.py: put(): Pass through the values of nodes which are text nodes
db_xml.py: put(): put_(): Support setDefault() values which are text nodes, by passing text strings through when put() is run on all col_defaults entries
db_xml.py: put(): _setDefault(): Support setting multiple col_defaults at once by using the param names themselves as the column names
db_xml.py: put(): Added _setDefault() built-in function, which adds an entry to col_defaults
db_xml.py: put(): Indicate no parent_ids_loc using no_parent_ids_loc sentinel instead of None to support parent_ids_locs that are equal to None (e.g. if the parent node had an error). Always forward parent_ids_loc to children with fkeys to parent, even on error, because the parent table may not be required for the child tables to be valid, such as for taxonomic-data-only datasets that nevertheless have nodes for the non-taxonomic tables in their mappings.
db_xml.py: put(): Inserting children with fkeys to parent: Don't do this if this node had an error and sql_io.put_table() returned None as the generated pkey. This fixes a bug where a node with an error will still try to create children with fkeys to parent, but pass None as the fkey to parent, which the recursive put() call will then incorrectly treat as there being no field with an fkey to parent at all rather than a field whose value is NULL. This causes function overload resolution to be unable to find the intended function, because it is missing a parameter.
inputs/.NCBI/: Renamed higher_taxa to nodes because it currently doesn't just contain the higher taxa
sql.select() calls: Removed order_by=None everywhere that a stable row order is required (i.e. consistent between selects, or consistent between table transformations). This causes several tests to return different inserted row counts, because the input table is now being accessed in pkey order instead of in table order. This fixes a bug where tables with more rows than ~100 would return different results for repeated calls of the same non-ordered select.
db_xml.py: partition_size: Increased to 1,000,000 (>= NCBI.higher_taxa's size) so NCBI.higher_taxa can be imported completely in one partition. This is necessary because NCBI's taxonconcepts are not in dependency order (parents first), so a later partition cannot rely on the parents of its taxonconcepts having already been imported. Instead, all taxonconcepts must be imported at once and then separately, the parents of all taxonconcepts must be set.
db_xml.py: put(): Treat a child node which is a function (starts with _) as a child with fkey to parent rather than as a field in the table. Such a function accepts the table's pkey as one of its arguments.
sql.py: Renamed pkey() to pkey_name()
db_xml.py: cleanup_table(): Inline sql.pkey_col ('row_num') because this is the only place it's used
db_xml.py: put(): Fixed bug where needed to avoid truncating the pkeys_loc table, in case it's the same as one of the in_tables. This occurs now that sql_io.put_table() passes through the actual input column instead of the joined-together input table's column when ignoring all rows.
db_xml.py: put_table(): Fixed bug where pkeys_loc needed to be initialized. Note that this bug was only triggered when importing a table with zero rows (in this case, the initial empty TNRS.tnrs table), because otherwise it would be set in the loop.
db_xml.py: put(): Added _alt optimization that just returns the first arg if it's non-NULL
db_xml.py: put(): Parse input columns and process values in separate loops, so that structural XML function optimization code can be inserted between them
bin/map, db_xml.put_table() (row-based and column-based import): Don't sort the input table by its pkey, in order to support input tables with no pkey. Note that reading the input table in table order and having this match the input flat file's order is only possible with sql_io.import_csv()'s truncation of the table on a failed import, which ensures that the rows will be stored in inserted order.
db_xml.py: put_table(): Subsetting in_table: Add a row number column if in_table does not already have a pkey
db_xml.py: put_table(): Subsetting in_table: Copy all of in_table's structure, rather than just the column types, by using sql.copy_table_struct() and sql.insert_select(). This preserves pkeys and NOT NULL constraints, which are useful for column-based import.
db_xml.py: put_table(): Subsetting in_table: Create in_table as a completely new sql_gen.Table instead of copying full_in_table and relying on sql.run_query_into() to set is_temp and remove the schema
db_xml.py: put_table(): Removed no longer accurate comment that full_in_table will be shadowed (hidden) by the created temp table. (The temp table is now named differently, so the shadowing does not occur.)
db_xml.py: put_table(): Replaced no longer accurate Recurse comment with Import data. Rewrapped lines.
db_xml.py: put_table(): Allow caller to specify custom partition_size
Replaced str() with strings.ustr() (or equivalent) everywhere needed, to avoid future UnicodeEncodeErrors
db_xml.py: put_table(): Create errors table if it doesn't exist
db_xml.py: input_col_prefix: Use value of xml_func.var_name_prefix, which is now the place where this value is configured
db_xml.py: Moved input_col_prefix above the put() function that uses it
db_xml.py: put(): _simplifyPath() built-in function: Removed `require` param, which is not used by this _simplifyPath() implementation because the database constraints handle this
db_xml.py: put(): wrap_e(): Call augment_error() to add the current node to the error message
db_xml.py: put(): Raise an error if there are multiple fields with the same name, instead of silently overwriting the first with the second. This generally indicates the need to use `:[@merge=1]` on the fields in question.
db_xml.py: put(): Fixed bug where strings starting with "$" were interpreted as input columns in row-based mode (this should only apply to column-based mode). Explicitly store whether in row-based mode in is_literals var (similar to is_literals in sql_io.put_table()).
db_xml.py: put(): put_(): Removed no longer needed in_row_ct_ref param, which is only used by put_table(). Rewrapped function body.
db_xml.py: put(): Don't suppress exceptions thrown by sql_io.put_table() by passing them to on_error(), because some exceptions indicate unrecoverable database connection problems such as a broken connection, which should abort the import
db_xml.py: put(): Support datasets with no rows, where root.firstChild == None. Documented that to use an entire XML document, you need to pass root.firstChild rather than root.
Moved importing of col_defaults from db_xml.put_table() to bin/map, so that it also happens in row-based mode. Note that this causes a DB entry for the datasource to always be created, even if the datasource has no mappings or no rows.
Use new exc.reraise() where exc.raise_() was used, so that the stack trace is preserved when the exception is rethrown
db_xml.py: put(): Inserting node: Wrap sql_io.put_table() call in catch-all exception handler that calls on_error_() (wrapper for error handler provided by caller) and returns None. This both adds additional debugging info to the exception (in on_error_()) and allows recovery from arbitrary exceptions that happen in sql_io.put_table(), so that an exception does not abort the import.
db_xml.py: Renamed put_table_special_funcs to put_special_funcs because it is now used by put() as well
db_xml.py: Moved put() before the functions that use it
db_xml.py: Renamed _put_table_part() to put(), replacing the existing put() whose functionality it now performs
db_xml.py: _put_table_part(): Reordered params to match put(), so that it can eventually be substituted for it
db_xml.py: _put_table_part(): Allow being invoked directly by adding defaults for parameters
db_xml.py: put(): Use _put_table_part(). This will ensure that all the put-related functionality is in one place, rather than duplicated.
db_xml.py: _put_table_part(): Append the node to errors handled with on_error()
sql_io.py: put_table(): Removed into param to set a custom into table name because put_table() now has all the info it needs to generate this name automatically, and callers are no longer providing it
db_xml.py: put_table(): Import col_defaults to translate nodes to pkeys
db_xml.py: _put_table_part(): Support no in_table, for iterations with only literal values
db_xml.py: put_table(): Removed parent_ids_loc and next params since these are only used in the recursion
db_xml.py: put_table(): Split into an outer function that sets up the database environment and subsets in_table, and a (recursive) inner function that imports the data
db_xml.py: put_table(): Subsetting and partitioning in_table: Documented that it's OK to do this even if table already the right size because it takes <1 sec
sql_io.py: put_table(): Use is_function where caller-provided is_func was used, since is_function determines whether something is a function based on whether it actually exists as a SQL function instead of just whether its name starts with "_". Removed now-unneeded is_func param.
sql_io.py: put_table(): Added col_defaults param and use it if there's a missing mapping for a NOT NULL column. This requires callers passing arguments by position to add an empty value for this parameter.
db_xml.py: put_table(): Adding fkey to parent: Fixed bug where should only add parent_ids_loc table to list of tables not to truncate if it's a column, because it is sometimes just a pkey value when that iteration contained only literals
db_xml.py: put_table(): Divide fields into input columns and literal values: Translate values: Allow literal values other than strings or None (from the XML parsing), because sql_io.put_table() is getting an optimization for iterations containing only literal values, which just returns the pkey of the single row for these values (which is usually an integer) instead of a temp table with the same value in each row
db_xml.py: Don't remove any explicit pkey because the output table may be a SQL function, which does not have a pkey. This feature only existed to support importing VegBank XML exports, which we don't use (and which would be incompatible with the schema anyway).
sql.run_query_into() calls: Use new add_pkey_ param instead of manually calling sql.add_pkey()
sql.py with_explain_comment(), DbConn: Fixed bug where with_explain_comment() was being run in per-row imports (row-based import and csv2db with INSERT), causing the overhead of an EXPLAIN query for every single INSERT and filling up the cache with EXPLAIN query results, by adding autoexplain mode, only running with_explain_comment() in autoexplain mode, and only enabling autoexplain mode for column-based import
db_xml.py: put_table(): Turn on autoanalyze mode to help the query planner avoid sequential scans on tables that now contain data. (Don't do this in row-based import because it creates too much overhead per insert.)
db_xml.py: put_table(): Moved in_row_ct updating to Subsetting section so the cursor's rowcount can be used directly
db_xml.py: put_table(): Subsetting in_table: Don't count # rows because this takes awhile for large datasets. Instead, use the chunking algorithm in digir_client, which ends the loop when a partial or empty partition is encountered.
db_xml.py: put_table(): Set db.src to help identify the data source in pg_stat_activity
db_xml.py: put_table(): Subsetting in_table: Prepend schema to subset table name so that in pg_stat_activity, it's clear which datasource a particular query is from
db_xml.py: partition_size: Turning partitioning back on (with a larger limit), since the largest datasources' temp tables are still too big
db_xml.py: partition_size: Set to sys.maxint to disable partitioning. The last bugfix, which avoided returning a large result set to the client which was never read, seems to have fixed the disk space leak, so it's worth reattempting a full simultaneous import.
db_xml.py: put_table(): Subsetting in_table: Truncate in_table when finished with it, to avoid temp table disk space leaks
db_xml.py: put_table(): Work around PostgreSQL's temp table disk space leak by reconnecting to the DB after every partition
db_xml.py: put_table(): put_table_(): Removed no longer used limit, start params
db_xml.py: put_table(): Merged partitioning and subsetting into same section for simplicity, to avoid creating extra temp tables, and to later allow the connection to be closed and reopened between partitions. partition_size: Expressed value without exponent notation to ensure that it's an integer.
db_xml.py: put_table(): Partitioning in_table: Adjust bounds of last partition to actual row #s included
db_xml.py: partition_size: Set to just more than the size of the largest data source that was successfully imported in simultaneous import
db_xml.py: put_table(): Partition in_table if larger than a threshold. The threshold is initially set to disable partitioning. Partitioning will hopefully eliminate the excessive disk usage for large input tables, which has caused the system to run out of disk space due to what may be a bug in PostgreSQL.
db_xml.py: put_table(): Set in_table's default srcs to in_table itself instead of sql_gen.src_self, so that any copies of in_table will inherit the same srcs instead of being treated as a top-level table. This ensures that the top-level table's errors table will always be used.
db_xml.py: put_table(): Subsetting in_table: Call put_table() recursively using put_table_() to ensure that limit and start are reset to their default values, in case the table gets partitioned (which needs up-to-date limit and start values)
Moved Heuristic queries from sql.py to new sql_io.py
db_xml.py: put_table(): Removed `if not db.debug_temp` check because that is done by sql.empty_temp()
db_xml.py: put_table(): Empty unneeded temp tables to free up memory
db_xml.py: put_table(): Removed no longer needed commit param
Removed unnecessary db.db.commit() calls because commits are now done automatically by DbConn's autocommit mode
db_xml.py: Removed into_table_name() because this functionality is now handled by sql.into_table_name()
db_xml.py: put_table(): Pass on_error to sql.put_table()
db_xml.py: put_table(): Take on_error param like row-based put()
sql.py: mk_select(): Return just the query instead of the query plus empty params
db_xml.py: put_table(): Subsetting in_table: Add pkey to created temp table to facilitate joining it with intermediate tables
db_xml.py: put_table(): Track in_table's source so its original schema can be obtained and auxiliary tables located
db_xml.py: put_table(): Removed no longer accurate comment about handling _simplifyPath
db_xml.py: put_table(): Subsetting in_table: Fixed bug where in_table was not being ordered by the row_num, because order_by was set to None when it should have been omitted so it would default to the pkey
db_xml.py: into_table_name(): Removed no longer necessary handling of simple functions, which is now done by sql.into_table_name(). Ensure that rank params in functions (not tables) are not treated specially as hierarchical.
db_xml.py: put_table(): Pass is_func to sql.put_table()
db_xml.py: put_table(): Treat every node name that starts with "_" as a function, not just members of put_table_special_funcs. This ensures that DB function args are always treated as values, not children with fkeys to parent.
db_xml.py: put_table(): Make special_funcs externally available as module constant put_table_special_funcs
sql.py, db_xml.py: Removed unnecessary calls to sql_gen.clean_name() now that str() handles this automatically