bugfix: lib/sql_io.py: put_table(): handle_MissingCastException(): when updating join_cols, don't add new entry for join_cols[out_col], only update existing one. this fixes #902 (import bug), and with #902 fixed, #887 (disk space leak) should no longer occur.
fix: lib/sql_io.py: put_table(): don't warn if can't create pkey, because this just indicates that a set-returning function was used. this should get rid of the last of the confusing benign warnings in the test output.
bugfix: lib/sql_io.py: put_table(): is_literals: `return sql.value(cur): need to use sql.value_or_none() instead to support multi-row functions, such as _split() used in specimens data`
lib/sql_io.py: automatic handling of input/output column type mismatches: also do this for identifying columns, which first cause an error in a join in sql.distinct_table() rather than in the main insert (and thus were not handled by the existing error handling). previously, the user would have had to manually cast the input column in postprocess.sql. this involves getting handle_MissingCastException() to update join_cols as well as mapping.
lib/sql_io.py: put_table(): main loop MissingCastException handler: factored out into nested function so that it can also be used elsewhere
moved everything into /trunk/ to create the standard svn layout, for use with tools that require this (eg. git-svn). IMPORTANT: do NOT do an `svn up`. instead, re-use your working copy's existing files with `svn switch` (http://svnbook.red-bean.com/en/1.6/svn.ref.svn.c.switch.html).
lib/sql_io.py: put_table(): default param: documented that this will be used for all missing rows, regardless of which error caused them not to be inserted. this means that auto-forwarding (wiki.vegpath.org/Auto-forwarding) can be used with any type of constraint violation, not just NOT NULL constraints (which it is typically used with).
lib/sql_io.py: put_table(): added link to new INSERT ON DUPLICATE SELECT wiki page, which now contains the explanation in the doc comment
bugfix: lib/sql_io.py: put_table(): Getting output table pkeys of existing/inserted rows: need to include the index cond in the join condition here, too (using var join_custom_cond), so that an index scan can be used instead of a much slower full-table sort
bugfix: lib/sql_io.py: put_table(): DuplicateKeyException: need to include any index cond in the join condition, so that an index scan can be used instead of a much slower full-table sort (otherwise the query planner will not know that it can restrict results to rows satisfying the index cond)
lib/sql_io.py: ensure_cond(): documented meaning of passed, failed params (at least one row passed/failed the constraint)
lib/sql_io.py: put_table(): documented that PostgreSQL 9.1+ now provides a way to implement insert/on duplicate select just once for each table (instead of dynamically for each insert) using the new INSTEAD OF triggers (http://www.postgresql.org/docs/9.1/static/plpgsql-trigger.html). INSTEAD OF triggers were not used when put_table() was developed, because it was necessary to support PostgreSQL 9.0, which was installed on the Mac and not easily upgradeable. it was eventually upgraded to add PostGIS, which required a complete reinstall of the DB from the staging tables, with the associated staging table reload bugs, as well as complete removal of the old Postgres version.
lib/sql_io.py: cleanup_table(): added assertion that the table exists, so that if it doesn't, the error will occur as part of an assertion rather than as part of the util.table_nulls_mapped__get() call, which might confusingly lead users to believe that this is a bug in util.table_nulls_mapped__get() when in fact the problem is that the table is not installed
lib/sql_io.py: cleanup_table(): don't run the slow ALTER TABLE statement again if the table has already been cleaned up. documented that it is idempotent (and actually was before this change as well).
lib/sql_io.py: added table_nulls_mapped__set(), "__get() wrappers around the corresponding util schema functions
lib/sql_io.py: put_table() (column-based import): complexity note: clarified that INSERT RETURNING throws an error on duplicate instead of returning the existing row. added blank line after ΒΆ for readability.
lib/sql_io.py: put_table() (column-based import): warning about triggers populating unique constraint-covered columns: corrected limitation to include only the unique constraint used to do the DISTINCT ON, since other unique constraints are not affected by column-based import. note that the primary key will normally not be the DISTINCT ON constraint, so trigger-populated natural keys are supported unless the input table contains duplicate rows for some generated keys.
lib/sql_io.py: append_csv(): support importing CSVs whose columns are a subset of the full table and/or in a different order. when the header exactly matches the columns, the explicit column list will still be omitted as an optimization. this uses code from r4927.
lib/sql_io.py: put_table(): Calling wrapper function: adding pkey or index on the resulting table: don't display warning if a pkey can't be added, because this is actually a legitimate situation when the called function is set-returning and can return multiple rows for one input. having this as a warning results in spurious warnings in the automated tests (which look confusingly like ignored errors because Python warnings include debugging context information). e.g. `make inputs/Madidi/IndividualObservation/test.by_col.xml` causes this error in the sourcelist->sourcename splitting step (which of course can produce multiple specimenholder institutions)....
lib/sql_io.py: mk_errors_table(): Create a unique index on the MD5 of the value and error instead of on the values directly, because some strings are too long to index (e.g. row 2537268 of MO.Specimen causes an error "index row size 3032 exceeds maximum 2712 for index [...] Values larger than 1/3 of a buffer page cannot be indexed")
sql_io.py: put_table(): Documented that much of the complexity of the normalizing algorithm is due to PostgreSQL not having a native command for insert/on duplicate select
sql_io.py: put_table(): Corrected "insert/if not exists get" to "insert/on duplicate select"
sql_io.py: put_table(): Removed no longer applicable requirement that it be run at the beginning of a transaction, which was only required when the output table was locked during the function call
sql_io.py: put_table(): Documented that the function's insert/if not exists get algorithm does not support database triggers that populate fields covered by a unique constraint
sql_io.py: put_table(): DuplicateKeyException: Uniquifying input table to avoid internal duplicate keys: Also filter out duplicate rows in the out_table, so that they don't create duplicate key errors and the resulting index holes
sql_io.py: put_table(): ensure_cond(): Fixed bug where need to wrap strings used in the tracked error message in strings.ustr()
sql_io.py: put_table(): is_function: Fixed bug where need to add the pkeys table's test pkey constraint after the data is added rather than when the empty table is created, to avoid adding a pkey constraint that will later be violated by data which returns multiple output rows for an input row (such as calls to _split())
sql_io.py: put_table(): insert_into_pkeys(): Allow callers to override run_query_into()'s add_pkey_ param in case the initial version of the pkeys table should not yet have the test pkey constraint (e.g. because data is added after the table is created)
sql_io.py: cast(): Use sql_gen.Cast() to generate the cast, in order to take advantage of its support for casts to unknown
sql_io.py: put_table(): Special handling for functions with hstore params: Fixed bug where need to unwrap literal values of mapping, which might be sql_gen.Literal objects
sql_io.py: put_table(): Added special handling for functions with hstore params. Note that although _map() doesn't exist yet as a DB function, this code must be in place before _map() is created to avoid param type mismatch errors.
sql_io.py: put_table(): Removed assertion that into's row count be at least full_in_table's row count, because now that DISTINCT ON is used to satisfy the into table pkey, this is no longer necessarily true
sql_io.py: put_table(): Setting pkeys of missing rows: Fixed bug in column-based import where when input rows match multiple output rows in one of this iteration's input tables, the into table's pkey constraint is violated because full_in_table contains multiple entries for an input pkey
sql_io.py: put_table(): Switched back to using run_query_into()'s add_pkey_ option now that it uses sql.add_pkey_or_index() instead of sql.add_pkey()
sql_io.py: put_table(): ensure_cond(): When deleting rows rows that do not satisfy the condition, handle sql.DoesNotExistExceptions caused by columns in the condition that were not replaced with NULL. These occur when out_table is a function, and the columns of the table the condition relates to therefore can't be found using out_table.
sql_io.py: put_table(): Calling function: Do not cache the function call, because it may be retried after error handling
sql_io.py: put_table(): sql.DatabaseErrors: Factored exception-handling code out into handle_unknown_exc(), for use by other exception handlers
sql_io.py: put_table(): ensure_cond(): is_function: Fixed bug where can't replace out_table_cols with NULL because out_table is a function, not a table
sql_io.py: put_table(): ensure_cond(): Fixed bug where test if any rows failed cond did not check if cur != None (which is the case when cond == sql_gen.true_expr) before checking cur.rowcount
sql_io.py: put_table(): Assertion that into and full_in_table have the same row count: Allow into to have more rows than full_in_table, in case an input row matched multiple output rows. This should not happen for a properly-configured database, but seems to happen periodically nevertheless (currently, to the MO datasource) and should not abort the import when it does.
sql_io.py: put_table(): ensure_cond(): When adding the failed condition to the errors table, also include the original, untranslated condition from the DB schema in addition to the translation of the condition into the input schema
sql_io.py: track_data_error(): Fixed bug where errors whose column had no srcs (indicated by () ) were incorrectly being ignored. This affected NOT NULL exceptions where the column was not provided by the dataset.
sql_io.py: track_data_error(): Support errors with no columns by inserting a single entry with column set to NULL
sql_io.py: mk_errors_table(): Made "column" column nullable, because some errors (such as check constraint violations) don't have any corresponding columns if its columns weren't provided in the input data
sql_io.py: cast(): Set the created function's value param type to anyelement to support any input type, not just text
sql_io.py: put_table(): insert_into_pkeys(): Use new sql.add_pkey_or_index() instead of sql.add_pkey() in order to just print a warning if for some reason there were duplicate entries for an input row in the iteration's pkeys table. This should provide a workaround for bugs (often in the schema itself, related to its unique indexes) that cause an input row to match multiple output rows when joining on the output table using the unique constraint's columns.
sql_io.py: put_table(): is_function: Moved definition of wrapper function inside try block of main loop because the creation of the empty pkeys table (whose row type is needed for the wrapper function) can itself produce MissingCastExceptions, which must be thrown inside the loop in order to be handled properly
sql_io.py: put(): Pass on_error through to put_table()
sql_io.py: put_table(): log_exc(): Return False if removing all rows and have callers break the main loop so that no further exception-handling code is processed before the main loop is exited
sql_io.py: import_csv(): Add a row_num column at the beginning of the table, which is autopopulated by csvs.RowNumFilter (it cannot be autopopulated by the serial datatype, because this does not support COPY FROM with a NULL-equivalent value in the serial field). This fixes a bug in csv2db where rows would not stay in inserted order upon querying the table, and would be returned in a different order each query, which prevented LIMIT/OFFSET based subsetting from returning consistent, nonoverlapping results. This occurs because PostgreSQL unfortunately does not return rows in inserted order (or any stable order: "If sorting is not chosen, the rows will be returned in an unspecified order [which] must not be relied on" <http://www.postgresql.org/docs/8.3/static/queries-order.html>), so an explicit ORDER BY is always needed to ensure staging table rows are retrievable in the order they were inserted.
sql_io.py: import_csv(): Take a reader and header rather than a stream to allow callers to pass in a wrapped CSV reader for filtering, etc.
sql_io.py: append_csv(): Take a reader and header rather than a stream_info and stream to allow callers to use the simpler csvs.reader_and_header() function. This also allows callers to pass in a wrapped CSV reader for filtering, etc.
sql_io.py: append_csv(): Wrap input stream in a ProgressInputStream that reports rows (rather than lines) read
sql_io.py: append_csv(): Removed no longer used INSERT mode, since all callers now use the default COPY FROM
sql_io.py: import_csv(): Removed no longer needed manual setting of use_copy_from, which defaults to True in append_csv()
sql_io.py: append_csv(): Parse any exceptions generated by the COPY FROM using new sql.parse_exception()
sql_io.py: append_csv(): Don't disable COPY FROM for TSVs, which are now supported using csvs.InputRewriter
sql_io.py: append_csv(): COPY FROM: Wrap provided stream in standardizing stream to fix ragged rows (with unequal # columns) and nonstandard CSV dialects (such as TSV with \-escaped newlines)
sql_io.py: row_num_col_def: Changed type to integer so the row_num can be populated directly by the insert process
sql_io.py: Added row_num_col_def for use by import_csv(). The row_num column will be necessary again because PostgreSQL unfortunately does not return rows in inserted order (or any stable order: "If sorting is not chosen, the rows will be returned in an unspecified order [which] must not be relied on" <http://www.postgresql.org/docs/8.3/static/queries-order.html>), so an explicit ORDER BY is always needed to ensure staging table rows are retrievable in the order they were inserted.
sql_io.py: put_table(): Ensuring into's out_pkey is different from in_pkey: Prepend "out." instead of out_table to avoid long column names for the output pkey
sql_io.py: put_table(): full_in_table: Create it using new sql.copy_table() instead of sql.run_query_into()
sql.mk_select() calls: Removed no longer needed order_by=None when limit=0
sql.select() calls: Removed order_by=None everywhere that a stable row order is required (i.e. consistent between selects, or consistent between table transformations). This causes several tests to return different inserted row counts, because the input table is now being accessed in pkey order instead of in table order. This fixes a bug where tables with more rows than ~100 would return different results for repeated calls of the same non-ordered select.
sql_io.py: put_table(): ensure_cond(): track_data_error(): Concatenate the columns in the constraint together using , rather than adding a separate entry for each column, because the constraint is applicable to all columns together rather than to each column separately
sql_io.py: put_table(): Renamed ignore_cond() to ensure_cond() for clarity
sql_io.py: put_table(): DuplicateKeyException: Fixed bug where indexes with conditions needed to have the input rows filtered by the condition, to prevent trying to retrieve an existing/inserted row using a join on the index columns when the index in fact does not apply. This fixes a bug in the import of taxonconcept where the taxonconcept_0_unique_identifying_name unique index has a condition which was not satisfied for input rows with no identifyingtaxonomicname, causing any input row with NULL in this column to match all taxonconcepts with a NULL identifyingtaxonomicname. This uses ignore_cond()'s new support for constraints that did not fail at least once.
sql_io.py: put_table(): ignore_cond(): Added support for constraints that did not fail at least once, and therefore should not be required to simplify to a non-false value. As part of this, only track the failed constraint in the errors table if it actually failed at least once based on the deleted row count or the `failed` param.
sql_io.py: put_table(): insert_into_pkeys(): Take a query as the param instead of sql.mk_select()'s params, to allow the caller to pass in any query without needing insert_into_pkeys() to manually pass through those args
sql_io.py: put_table(): ignore_cond(): Log message: Replaced don't with do not so it wouldn't mess up syntax highlighting when viewing the log file in a text editor
sql_io.py: cleanup_table(): Use sql.table_pkey_col() instead of sql.pkey_col() so that only an actual pkey column is removed from the list of columns to clean. This fixes a bug where the first column in the table was not cleaned up if there was no pkey. Note that this bug only affected newly re-created staging tables, because staging tables previously had a special row_num pkey column added if they did not already have a pkey. The row_num column is now added by column-based import instead.
sql.py: Renamed pkey() to pkey_name()
sql.py: Renamed pkey_col_() to pkey_col()
cleanup_table(): Use new sql.table_cols() instead of sql.table_col_names()
sql_io.py: put_table(): Resolving default value column: If ignoring all rows, use input cols directly instead of cols from joined-together input table. In addition to being simpler, this prevents the returned column's name from growing longer and longer as each iteration prepends its input table table name to the default value column name.
sql_io.py: put_table(): Moved changing the table of the default value column from Resolving the default value column to Setting pkeys of missing rows, because the table change is only needed in this section
sql_io.py: put_table(): Resolving default value column: Always call sql_gen.remove_col_rename() because it will just pass the value through if it's not a column
sql_io.py: put_table(): Replaced limit_ref integer with ignore_all_ref boolean, because it is no longer used as a select statement limit
sql_io.py: put_table(): remove_all_rows(): Corrected "just create an empty pkeys table" comment to "just return the default value column"
sql_io.py: put_table(): mk_main_select(): Removed setting limit to limit_ref0, because an empty pkeys table is no longer created when ignoring all rows
sql_io.py: put_table(): Setting pkeys of missing rows: Removed "limit_ref0 == 0" check because this code is never reached in that case
sql_io.py: put_table(): Ignoring all rows for unrecoverable errors: Even in multi-row mode, just return whatever the default value or column was, instead of creating an output table containing the default value filled in for every row. This also assists the optimization to skip empty levels of taxonconcepts, because it folds the empty level to that level's parent level rather than creating a whole new temp table with ultimately the same contents.
sql_io.py: put_table(): ignore_cond(): Changed "Ignoring rows where" message with the negated (filter-out) condition to "Ignoring rows that don't satisfy" with the filter condition for clarity
sql_io.py: put_table(): ignore_cond(): If cond simplifies to false, remove all rows instead of filtering out individual rows which will all be filtered out. This optimization should improve import times of tables, such as taxonconcept, which use a check constraint instead of NOT NULL constraints to prevent empty rows. The taxonomic schema refactoring caused the creation of many more levels of taxonconcepts, many of which (such as variety, forma, cultivar) are empty for most datasources, so this optimization should also reduce overall import times for datasources that have any empty levels of taxonconcept. Note that this optimization is only possible now that sql_gen.simplify_expr() is able to simplify all the way to a single boolean value for the taxonconcept_required_key constraint.
Moved expression transforming functions from sql.py to sql_gen.py because they do not manipulate an actual database and merely generate SQL
sql.py: Renamed table_cols() to table_col_names() for clarity, because it does not return sql_gen.Col objects
sql_io.py: put_table(): Resolving default value column: Fixed bug where the default value col needed to have its table changed from in_table to full_in_table if it's a table column, and needed to have any column rename removed if it's a literal value
sql_io.py: put_table(): Resolve default value column after the main loop (inserts and selects), so that the default value column can refer to an output column that is not in the original mapping but is added to the mapping from a col_defaults entry. This requires deferring the "Missing mapping for NOT NULL column" warning until the default value column is resolved, and including all columns in the full_in_table since the default value input column is not yet known.
sql_io.py: put_table(): Fixed bug where row_ct_ref was incorrectly being incremented when the iteration is a function call. This bug only occurred in row-based mode, because the DB cursor for a function call is not stored in column-based mode.
sql_io.py: append_csv(): In INSERT mode, print # rows read (different from # lines read if some fields contained embedded newlines) and # rows inserted (different from # rows read if some violated a constraint)
sql_io.py: cleanup_table(): Don't clean up the pkey, because the canonicalization involved may produce collisions (as it does for TNRS.tnrs)
sql_io.py: put_table(): Removed comment that can support in_tables of any fixed-size iterable type, because the iterable must be ordered so that the first table can be treated specially
sql_io.py: put_table(): Support in_tables of any fixed-size iterable type
sql_io.py: append_csv(): Using INSERT: Use ignore mode to support inserting rows into a table with a unique constraint
sql_io.py: append_csv(): Check that the CSV's header matches the table's columns
sql_io.py: import_csv(): Don't add a row number column to the created table because it is now added automatically to the temp table by column-based import (row-based import now also does not require a pkey for DB inputs)
sql_io.py: import_csv(): Only do the import in a savepoint if using COPY FROM, to allow autocommits after each insert and thus make rows visible immediately after they are inserted
sql_io.py: import_csv(): Factored insertion code out into new append_csv()
sql_io.py: import_csv(): Documented that sql.truncate() MUST be run so that the rows will be stored in inserted order, and the row_num added after import will match up with the CSV's row order
sql_io.py: import_csv(): Free memory used by deleted rows from any failed import. Documented that sql.create_table() is not rolled back if the import fails, but instead is cached, and will not be re-run if the import is retried.