input.Makefile: Added import/steps.by_col.sql to generate a Redmine-formatted list of steps for column-based import
bin/map: Optimized default verbosities for the mode: automated tests should not be verbose, column-based import should show all queries to assist profiling, and row-based import should just show row progress
sql_io.py: put(): Run data import queries with log_level=3.5 so they don't clutter the output at the normal import verbosity of 3
db_xml.py: put_table(): Work around PostgreSQL's temp table disk space leak by reconnecting to the DB after every partition
sql.py: mk_select(): Also support limit and start values of type long
sql_gen.py: suffixed_table(): Fixed bug where needed to copy all table attrs, such as is_temp status
sql.py: create_table(): Fixed bug where needed to run query in recover mode in case the table exists and was created before the current connection, such that the CREATE TABLE statement would not have been cached
sql.py: create_table(): Removed final newline after query because that's added by the logging mechanism
sql.py: Added reconnect()
sql.py: DbConn._reset(): Assert that _savepoint is 0 instead of setting it to 0
db_xml.py: put_table(): put_table_(): Removed no longer used limit, start params
db_xml.py: put_table(): Merged partitioning and subsetting into same section for simplicity, to avoid creating extra temp tables, and to later allow the connection to be closed and reopened between partitions. partition_size: Expressed value without exponent notation to ensure that it's an integer.
db_xml.py: put_table(): Partitioning in_table: Adjust bounds of last partition to actual row #s included
sql.py: DbConn: Added _ to reset() to indicate that it's a protected method and users should not call it directly
sql.py: DbConn.close(): Reset the connection completely using reset()
sql.py: DbConn: Added clear_cache() and reset() and use reset() in init()
bin/map: Use new DbConn.close()
sql.py: DbConn: Added close()
db_xml.py: partition_size: Set to just more than the size of the largest data source that was successfully imported in simultaneous import
db_xml.py: put_table(): Partition in_table if larger than a threshold. The threshold is initially set to disable partitioning. Partitioning will hopefully eliminate the excessive disk usage for large input tables, which has caused the system to run out of disk space due to what may be a bug in PostgreSQL.
db_xml.py: put_table(): Set in_table's default srcs to in_table itself instead of sql_gen.src_self, so that any copies of in_table will inherit the same srcs instead of being treated as a top-level table. This ensures that the top-level table's errors table will always be used.
sql_io.py: cast(): Always convert exceptions to warnings if the input is a column or expression, even if there is no place to save the errors, so that invalid data does not need to be handled by the caller in a (much slower) extra exception-handling loop
sql_io.py: put_table(): MissingCastException: When casting, handle InvalidValueException by filtering out invalid values with invalid2null() in a loop
sql_io.py: cast_temp_col(): Run sql.update() in recover mode in case expr produces errors. Don't cache sql.update() in case this function will be called again after error recovery.
sql.py: Generalized FunctionValueException to InvalidValueException so that it will match all invalid-value errors, not just those occurring in user-defined functions
sql_io.py: put_table(): Removed no longer used sql.FunctionValueException handling, because type casting functions now do their own invalid value handling
db_xml.py: put_table(): Subsetting in_table: Call put_table() recursively using put_table_() to ensure that limit and start are reset to their default values, in case the table gets partitioned (which needs up-to-date limit and start values)
sql_io.py: put_table(): mk_main_select(): Fixed bug where the table of each cond needed to be changed to insert_in_table because mk_main_select() uses the distinct table rather than the full input table
sql_gen.py: with_table(): Support columns that are wrapped in a FunctionCall object
sql_gen.py: index_cols: Store just the name of the index column, and add the table in index_col(), in case the table is ever copied and renamed
Moved error tracking from sql.py to sql_io.py
sql_io.py: put_table(): Use sql.distinct_table() to uniquify input table, instead of DISTINCT ON. This avoids letting PostgreSQL create a sort temp table to store the output of the DISTINCT ON, which is not automatically removed until the end of the connection, causing database bloat that can use up the available disk space.
sql_gen.py: suffixed_table(): Use concat()
sql_gen.py: with_default_table(): Remove no longer used overwrite param
sql.py: distinct_table(): Return new table instead of renaming input table so that columns that use input table will continue to work correctly
sql_gen.py: Moved NamedCol check from with_default_table() to with_table()
sql.py: distinct_table(): Fixed bug where empty distinct_on cols needed to create a table with one sample row, instead of returning the original table, because this indicates that the full set of distinct_on columns are all literal values and should only occur once
sql.py: run_query(): DuplicateKeyException: Fixed bug where only constraint names matching a certain format were interpreted as DuplicateKeyExceptions. Support constraint names with the name and table separated by ".", not just "_".
sql.py: run_query(): Exception parsing: Match patterns only at the beginning of the exception message to avoid matching embedded messages in causes and literal values
sql.py: Added distinct_table()
sql_gen.py: Added with_table() and use it in with_default_table()
sql.py: mk_insert_select(): ignore mode: Support inserting all columns when cols == None
sql_gen.py: Col, Table: Support non-string names
sql_gen.py: row_count: Use new all_cols
sql_gen.py: Added all_cols
sql_gen.py: Use new as_Name() instead of db.esc_name()
sql_gen.py: Name: Truncate the input name
sql_gen.py: Added Name class and associated functions
sql.py: create_table(): Support creating temp tables. This fixes a bug in copy_table_struct() where the created table was not a temp table if the source table was. copy_table_struct(): Removed no longer needed versioning because that is now handled by create_table().
sql.py: Added copy_table_struct()
sql.py: Moved add_indexes() to Indexes subsection
sql.py: create_table(): Support LIKE table
Moved Data cleanup from sql.py to sql_io.py
sql.py: Organized Database structure introspection and Structural changes functions into subsections
Moved Heuristic queries from sql.py to new sql_io.py
Added top-level analysis dir for range modeling
sql.py: run_query_into(): Documented why analyze() must be run manually on newly populated temp tables
sql.py: DbConn: Added autoanalyze mode. Added autoanalyze() which runs analyze() only if in autoanalyze mode. Use new autoanalyze() in functions that change a table's contents.
sql.py: run_query_into(): analyze() the created table to ensure the query planner's initial stats are accurate
inputs/SpeciesLink/src: Added custom header that overwrites existing header so that column names will not be too long for the staging table
cat_csv: Support overwriting the existing header using a separate header file
schemas/vegbien.sql: Added location.location_coords index to speed up large imports by providing an index for merge joins
csv2db: Reanalyze table, so that query planner stats are up to date even though the table doesn't need to be vacuumed anymore
sql.py: Added analyze()
csv2db: Removed no longer needed table vacuum (cleanup_table() now avoids creating dead rows)
sql.py: cleanup_table(): Use update()'s new in_place mode to avoid needing to vacuum the table
sql.py: mk_update(): in_place: Support updating multiple columns at once
sql.py: update() calls: Use in_place where possible to avoid creating dead rows, which bloats table size
sql.py: DbConn.col_info(): Support user-defined types
sql_gen.py: Added Nullif
sql_gen.py: Added Coalesce class and use it in EnsureNotNull
sql_gen.py: Added coalesce and use it in EnsureNotNull
sql.py: DbConn.col_info(): Don't cache the structure query because the column type, etc. may change between calls. This fixes a bug in mk_update() where the column type would be retrieved before a NOT NULL constraint was added, causing the NOT NULL constraint not to be in the cache info about the column.
sql.py: mk_update(): Implemented in_place mode
sql.py: mk_update(): Factored out filtering of input values so only `.to_str(db)` is used inline in the creation of the query
sql.py: mk_update(): Added in_place param
csvs.py: TsvReader: Prevent "new-line character seen in unquoted field" errors by replacing '\r' with '\n'
csv2db: Adding indexes: Fixed bug where sql.add_index()'s ensure_not_null param needed to be renamed to ensure_not_null_
sql.py: cast(): columns values clause: Use start=0 to avoid "SELECT statement missing a WHERE, LIMIT, or OFFSET clause" warning
inputs/import.stats.xls: Updated for most recent run
sql.py: Removed no longer needed mk_track_data_error()
sql.py: track_data_error(): Use for loop and insert() (ignoring DuplicateKeyException) to insert entries into the errors table, to get the same optimization benefits this change provides in other filter-out contexts, and to improve clarity
sql.py: cast(): Use FOR loop with EXCEPTION block instead of CROSS JOIN with LEFT JOIN to insert entries into the errors table, to get the same optimization benefits this change provides in other filter-out contexts, and to improve clarity
sql_gen.py: NamedValues: Support None cols param for no named columns
sql.py: ensure_not_null(): Made the use of index columns configurable based on the # of rows in the table, because for small datasources, they seem to add 6-25% to the total import time
xml_func.py: _noCV: Fixed bug where assumed items was an iterator when it's now a list
sql.py: add_index_col(), cast_temp_col(): Cache add_col() by providing new comment param to distinguish new columns of the same (colliding) suffixed name but from different source columns
sql.py: add_index_col(), cast_temp_col(): Cache the update that fills in the new column, since it's idempotent
sql.py: update(): Pass cacheable to run_query()
sql.py: add_col(): Added comment param which can be used to distinguish columns of the same name from each other when they contain different data, to allow the ADD COLUMN query (and related queries, such as adding indexes) to be cached
sql_gen.py: Added esc_comment()
sql.py: DbConn.DbCursor.execute(): Allow ADD COLUMN to be cached if it has a distinguishing comment, because then the rest of query will be unique in the face of name collisions
sql.py: add_col(): Document that name may be versioned, so caller needs to propagate any renaming back to any source column for the TypedCol
sql.py: add_col() callers: Removed column name versioning because that is now handled by add_col()
sql.py: add_col() callers: Fixed bug where needed to propagate any renaming of typed column back to regular column
sql.py: add_col(): Version column names to avoid collisions. (Previously, callers were required to do this themselves.)
sql.py: cast_temp_col(): Handle column name collisions like add_index_col()
sql.py: mk_insert_select(): INSERT IGNORE: Switched to using FOR loop rather than cursors because cursors are only needed if you want to process multiple rows in the same EXCEPTION block (which you can't do because then all the previous inserts will be rolled back if one row is a duplicate key)