/ - Changes - BIEN 3 - NCEAS Projects

root @ 3114

#	Date	Author	Comment
3114	06/27/2012 05:32 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Partition in_table if larger than a threshold. The threshold is initially set to disable partitioning. Partitioning will hopefully eliminate the excessive disk usage for large input tables, which has caused the system to run out of disk space due to what may be a bug in PostgreSQL.
3113	06/27/2012 05:27 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Set in_table's default srcs to in_table itself instead of sql_gen.src_self, so that any copies of in_table will inherit the same srcs instead of being treated as a top-level table. This ensures that the top-level table's errors table will always be used.
3112	06/27/2012 05:17 PM	Aaron Marcuse-Kubitza	sql_io.py: cast(): Always convert exceptions to warnings if the input is a column or expression, even if there is no place to save the errors, so that invalid data does not need to be handled by the caller in a (much slower) extra exception-handling loop
3111	06/27/2012 04:47 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): MissingCastException: When casting, handle InvalidValueException by filtering out invalid values with invalid2null() in a loop
3110	06/27/2012 04:45 PM	Aaron Marcuse-Kubitza	sql_io.py: cast_temp_col(): Run sql.update() in recover mode in case expr produces errors. Don't cache sql.update() in case this function will be called again after error recovery.
3109	06/27/2012 04:40 PM	Aaron Marcuse-Kubitza	sql.py: Generalized FunctionValueException to InvalidValueException so that it will match all invalid-value errors, not just those occurring in user-defined functions
3108	06/27/2012 04:22 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Removed no longer used sql.FunctionValueException handling, because type casting functions now do their own invalid value handling
3107	06/27/2012 03:44 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting in_table: Call put_table() recursively using put_table_() to ensure that limit and start are reset to their default values, in case the table gets partitioned (which needs up-to-date limit and start values)
3106	06/27/2012 03:14 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): mk_main_select(): Fixed bug where the table of each cond needed to be changed to insert_in_table because mk_main_select() uses the distinct table rather than the full input table
3105	06/27/2012 03:12 PM	Aaron Marcuse-Kubitza	sql_gen.py: with_table(): Support columns that are wrapped in a FunctionCall object
3104	06/27/2012 02:47 PM	Aaron Marcuse-Kubitza	sql_gen.py: index_cols: Store just the name of the index column, and add the table in index_col(), in case the table is ever copied and renamed
3103	06/26/2012 11:06 PM	Aaron Marcuse-Kubitza	Moved error tracking from sql.py to sql_io.py
3102	06/26/2012 11:04 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Use sql.distinct_table() to uniquify input table, instead of DISTINCT ON. This avoids letting PostgreSQL create a sort temp table to store the output of the DISTINCT ON, which is not automatically removed until the end of the connection, causing database bloat that can use up the available disk space.
3101	06/26/2012 10:36 PM	Aaron Marcuse-Kubitza	sql_gen.py: suffixed_table(): Use concat()
3100	06/26/2012 10:34 PM	Aaron Marcuse-Kubitza	sql_gen.py: with_default_table(): Remove no longer used overwrite param
3099	06/26/2012 10:33 PM	Aaron Marcuse-Kubitza	sql.py: distinct_table(): Return new table instead of renaming input table so that columns that use input table will continue to work correctly
3098	06/26/2012 10:31 PM	Aaron Marcuse-Kubitza	sql_gen.py: Moved NamedCol check from with_default_table() to with_table()
3097	06/26/2012 09:39 PM	Aaron Marcuse-Kubitza	sql.py: distinct_table(): Fixed bug where empty distinct_on cols needed to create a table with one sample row, instead of returning the original table, because this indicates that the full set of distinct_on columns are all literal values and should only occur once
3096	06/26/2012 09:12 PM	Aaron Marcuse-Kubitza	sql.py: run_query(): DuplicateKeyException: Fixed bug where only constraint names matching a certain format were interpreted as DuplicateKeyExceptions. Support constraint names with the name and table separated by ".", not just "_".
3095	06/26/2012 09:10 PM	Aaron Marcuse-Kubitza	sql.py: run_query(): Exception parsing: Match patterns only at the beginning of the exception message to avoid matching embedded messages in causes and literal values
3094	06/26/2012 08:46 PM	Aaron Marcuse-Kubitza	sql.py: Added distinct_table()
3093	06/26/2012 08:46 PM	Aaron Marcuse-Kubitza	sql_gen.py: Added with_table() and use it in with_default_table()
3092	06/26/2012 07:52 PM	Aaron Marcuse-Kubitza	sql.py: mk_insert_select(): ignore mode: Support inserting all columns when cols == None
3091	06/26/2012 07:47 PM	Aaron Marcuse-Kubitza	sql_gen.py: Col, Table: Support non-string names
3090	06/26/2012 07:25 PM	Aaron Marcuse-Kubitza	sql_gen.py: row_count: Use new all_cols
3089	06/26/2012 07:24 PM	Aaron Marcuse-Kubitza	sql_gen.py: Added all_cols
3088	06/26/2012 07:17 PM	Aaron Marcuse-Kubitza	sql_gen.py: Use new as_Name() instead of db.esc_name()
3087	06/26/2012 07:12 PM	Aaron Marcuse-Kubitza	sql_gen.py: Name: Truncate the input name
3086	06/26/2012 07:11 PM	Aaron Marcuse-Kubitza	sql_gen.py: Added Name class and associated functions
3085	06/26/2012 06:46 PM	Aaron Marcuse-Kubitza	sql.py: create_table(): Support creating temp tables. This fixes a bug in copy_table_struct() where the created table was not a temp table if the source table was. copy_table_struct(): Removed no longer needed versioning because that is now handled by create_table().
3084	06/26/2012 06:33 PM	Aaron Marcuse-Kubitza	sql.py: Added copy_table_struct()
3083	06/26/2012 06:32 PM	Aaron Marcuse-Kubitza	sql.py: Moved add_indexes() to Indexes subsection
3082	06/26/2012 06:30 PM	Aaron Marcuse-Kubitza	sql.py: create_table(): Support LIKE table
3081	06/26/2012 05:18 PM	Aaron Marcuse-Kubitza	Moved Data cleanup from sql.py to sql_io.py
3080	06/26/2012 05:18 PM	Aaron Marcuse-Kubitza	Moved error tracking from sql.py to sql_io.py
3079	06/26/2012 05:12 PM	Aaron Marcuse-Kubitza	sql.py: Organized Database structure introspection and Structural changes functions into subsections
3078	06/26/2012 04:56 PM	Aaron Marcuse-Kubitza	Moved error tracking from sql.py to sql_io.py
3077	06/26/2012 04:46 PM	Aaron Marcuse-Kubitza	Moved Heuristic queries from sql.py to new sql_io.py
3076	06/26/2012 04:32 PM	Aaron Marcuse-Kubitza	Added top-level analysis dir for range modeling
3075	06/26/2012 04:02 PM	Aaron Marcuse-Kubitza	sql.py: run_query_into(): Documented why analyze() must be run manually on newly populated temp tables
3074	06/26/2012 03:57 PM	Aaron Marcuse-Kubitza	sql.py: DbConn: Added autoanalyze mode. Added autoanalyze() which runs analyze() only if in autoanalyze mode. Use new autoanalyze() in functions that change a table's contents.
3073	06/26/2012 03:52 PM	Aaron Marcuse-Kubitza	sql.py: run_query_into(): analyze() the created table to ensure the query planner's initial stats are accurate
3072	06/25/2012 09:46 PM	Aaron Marcuse-Kubitza	inputs/SpeciesLink/src: Added custom header that overwrites existing header so that column names will not be too long for the staging table
3071	06/25/2012 09:35 PM	Aaron Marcuse-Kubitza	cat_csv: Support overwriting the existing header using a separate header file
3070	06/25/2012 08:49 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: Added location.location_coords index to speed up large imports by providing an index for merge joins
3069	06/25/2012 08:43 PM	Aaron Marcuse-Kubitza	csv2db: Reanalyze table, so that query planner stats are up to date even though the table doesn't need to be vacuumed anymore
3068	06/25/2012 08:42 PM	Aaron Marcuse-Kubitza	sql.py: Added analyze()
3067	06/25/2012 08:11 PM	Aaron Marcuse-Kubitza	csv2db: Removed no longer needed table vacuum (cleanup_table() now avoids creating dead rows)
3066	06/25/2012 08:10 PM	Aaron Marcuse-Kubitza	sql.py: cleanup_table(): Use update()'s new in_place mode to avoid needing to vacuum the table
3065	06/25/2012 08:02 PM	Aaron Marcuse-Kubitza	sql.py: mk_update(): in_place: Support updating multiple columns at once
3064	06/25/2012 07:44 PM	Aaron Marcuse-Kubitza	sql.py: update() calls: Use in_place where possible to avoid creating dead rows, which bloats table size
3063	06/25/2012 07:37 PM	Aaron Marcuse-Kubitza	sql.py: DbConn.col_info(): Support user-defined types
3062	06/25/2012 07:33 PM	Aaron Marcuse-Kubitza	sql_gen.py: Added Nullif
3061	06/25/2012 07:27 PM	Aaron Marcuse-Kubitza	sql_gen.py: Added Coalesce class and use it in EnsureNotNull
3060	06/25/2012 07:15 PM	Aaron Marcuse-Kubitza	sql_gen.py: Added coalesce and use it in EnsureNotNull
3059	06/25/2012 07:00 PM	Aaron Marcuse-Kubitza	sql.py: DbConn.col_info(): Don't cache the structure query because the column type, etc. may change between calls. This fixes a bug in mk_update() where the column type would be retrieved before a NOT NULL constraint was added, causing the NOT NULL constraint not to be in the cache info about the column.
3058	06/25/2012 06:56 PM	Aaron Marcuse-Kubitza	sql.py: mk_update(): Implemented in_place mode
3057	06/25/2012 06:39 PM	Aaron Marcuse-Kubitza	sql.py: mk_update(): Factored out filtering of input values so only `.to_str(db)` is used inline in the creation of the query
3056	06/25/2012 06:32 PM	Aaron Marcuse-Kubitza	sql.py: mk_update(): Added in_place param
3055	06/25/2012 06:13 PM	Aaron Marcuse-Kubitza	csvs.py: TsvReader: Prevent "new-line character seen in unquoted field" errors by replacing '\r' with '\n'
3054	06/25/2012 06:12 PM	Aaron Marcuse-Kubitza	csv2db: Adding indexes: Fixed bug where sql.add_index()'s ensure_not_null param needed to be renamed to ensure_not_null_
3053	06/25/2012 05:16 PM	Aaron Marcuse-Kubitza	sql.py: cast(): columns values clause: Use start=0 to avoid "SELECT statement missing a WHERE, LIMIT, or OFFSET clause" warning
3052	06/25/2012 02:45 PM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated for most recent run
3051	06/22/2012 07:46 PM	Aaron Marcuse-Kubitza	sql.py: Removed no longer needed mk_track_data_error()
3050	06/22/2012 07:45 PM	Aaron Marcuse-Kubitza	sql.py: track_data_error(): Use for loop and insert() (ignoring DuplicateKeyException) to insert entries into the errors table, to get the same optimization benefits this change provides in other filter-out contexts, and to improve clarity
3049	06/22/2012 07:35 PM	Aaron Marcuse-Kubitza	sql.py: cast(): Use FOR loop with EXCEPTION block instead of CROSS JOIN with LEFT JOIN to insert entries into the errors table, to get the same optimization benefits this change provides in other filter-out contexts, and to improve clarity
3048	06/22/2012 07:31 PM	Aaron Marcuse-Kubitza	sql_gen.py: NamedValues: Support None cols param for no named columns
3047	06/22/2012 06:55 PM	Aaron Marcuse-Kubitza	sql.py: ensure_not_null(): Made the use of index columns configurable based on the # of rows in the table, because for small datasources, they seem to add 6-25% to the total import time
3046	06/22/2012 06:12 PM	Aaron Marcuse-Kubitza	xml_func.py: _noCV: Fixed bug where assumed items was an iterator when it's now a list
3045	06/22/2012 06:02 PM	Aaron Marcuse-Kubitza	sql.py: add_index_col(), cast_temp_col(): Cache add_col() by providing new comment param to distinguish new columns of the same (colliding) suffixed name but from different source columns
3044	06/22/2012 05:54 PM	Aaron Marcuse-Kubitza	sql.py: add_index_col(), cast_temp_col(): Cache the update that fills in the new column, since it's idempotent
3043	06/22/2012 05:52 PM	Aaron Marcuse-Kubitza	sql.py: update(): Pass cacheable to run_query()
3042	06/22/2012 05:48 PM	Aaron Marcuse-Kubitza	sql.py: add_col(): Added comment param which can be used to distinguish columns of the same name from each other when they contain different data, to allow the ADD COLUMN query (and related queries, such as adding indexes) to be cached
3041	06/22/2012 05:46 PM	Aaron Marcuse-Kubitza	sql_gen.py: Added esc_comment()
3040	06/22/2012 05:37 PM	Aaron Marcuse-Kubitza	sql.py: DbConn.DbCursor.execute(): Allow ADD COLUMN to be cached if it has a distinguishing comment, because then the rest of query will be unique in the face of name collisions
3039	06/22/2012 05:29 PM	Aaron Marcuse-Kubitza	sql.py: add_col(): Document that name may be versioned, so caller needs to propagate any renaming back to any source column for the TypedCol
3038	06/22/2012 05:27 PM	Aaron Marcuse-Kubitza	sql.py: add_col() callers: Removed column name versioning because that is now handled by add_col()
3037	06/22/2012 05:22 PM	Aaron Marcuse-Kubitza	sql.py: add_col() callers: Fixed bug where needed to propagate any renaming of typed column back to regular column
3036	06/22/2012 05:19 PM	Aaron Marcuse-Kubitza	sql.py: add_col(): Version column names to avoid collisions. (Previously, callers were required to do this themselves.)
3035	06/22/2012 05:13 PM	Aaron Marcuse-Kubitza	sql.py: cast_temp_col(): Handle column name collisions like add_index_col()
3034	06/22/2012 04:49 PM	Aaron Marcuse-Kubitza	sql.py: mk_insert_select(): INSERT IGNORE: Switched to using FOR loop rather than cursors because cursors are only needed if you want to process multiple rows in the same EXCEPTION block (which you can't do because then all the previous inserts will be rolled back if one row is a duplicate key)
3033	06/22/2012 04:40 PM	Aaron Marcuse-Kubitza	sql.py: mk_insert_select(): INSERT IGNORE: Moved FETCH FROM cursor outside EXCEPTION block since only the insert needs to have unique_violations ignored
3032	06/22/2012 04:33 PM	Aaron Marcuse-Kubitza	sql.py: put_table(): Removed no longer needed EXCLUSIVE lock on the output table
3031	06/22/2012 04:25 PM	Aaron Marcuse-Kubitza	add_index_col(): Run all operations with log_level=3 because it is the equivalent of creating an index, and that has log_level=3 to be excluded from the primary algorithm steps. add_not_null(): Run query with log_level=3 because it's an internal operation and for add_index_col() above.
3030	06/22/2012 04:23 PM	Aaron Marcuse-Kubitza	sql.py: update(): Pass any log_level to run_query()
3029	06/22/2012 04:11 PM	Aaron Marcuse-Kubitza	xml_func.py: process(): Added pass-through optimization for aggregating functions with one arg
3028	06/22/2012 04:02 PM	Aaron Marcuse-Kubitza	xml_func.py: process(): Call list() on the items' iterator so that its length can be determined wherever it's used without needing to call list() on it separately
3027	06/22/2012 03:43 PM	Aaron Marcuse-Kubitza	sql.py: mk_select(): Put '*' on same line as 'SELECT' if there is no DISTINCT ON, to avoid extra newlines
3026	06/22/2012 03:38 PM	Aaron Marcuse-Kubitza	sql_gen.py: NamedTable.to_str(): Put AS clause on same line as code being named if it didn't contain a newline, to avoid extra newlines
3025	06/22/2012 03:29 PM	Aaron Marcuse-Kubitza	sql.py: run_query(): Always parse exceptions, whether recover is set or not, to avoid making the caller set recover just to parse exceptions. If recover is not set, just don't run any queries when generating the parsed exception and return an empty value for the needed information. (A partial parsed exception is better than an unparsed one.)
3024	06/22/2012 03:18 PM	Aaron Marcuse-Kubitza	sql_gen.py: is_temp_col(): Use is_table_col() so that the type of the parameter is also checked
3023	06/22/2012 03:15 PM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Bolded the ms/row totals. Added note that non-bolded totals are not directly comparable because different methods were run with different numbers of rows.
3022	06/22/2012 03:10 PM	Aaron Marcuse-Kubitza	sql.py: put_table(): Inserting new rows: Removed no longer needed filter-out LEFT JOIN, since the INSERT IGNORE replacement handles duplicate keys
3021	06/22/2012 03:02 PM	Aaron Marcuse-Kubitza	sql.py: put_table(): Inserting new rows: Use insert_select()'s (mk_insert_select()'s) new ignore option to ignore duplicate keys
3020	06/22/2012 02:59 PM	Aaron Marcuse-Kubitza	sql_gen.py: to_name_only_col(): Consider any non-table column, including NamedCol, to be already name-only. This fixes a bug in sql.mk_insert_select() where the value of a NamedCol was removed by to_name_only_col() even though it was needed.
3019	06/22/2012 02:30 PM	Aaron Marcuse-Kubitza	sql.py: mk_insert_select(): INSERT IGNORE: Use an EXCEPTION block for each individual row because "When an error is caught by an EXCEPTION clause, [...] all changes to persistent database state within the block are rolled back." (http://www.postgresql.org/docs/8.3/static/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING). Documented that cursor stays at current position in spite of automatic ROLLBACK on EXCEPTION.
3018	06/22/2012 02:11 PM	Aaron Marcuse-Kubitza	sql.py: mk_insert_select(): INSERT IGNORE: Use RETURN QUERY on the insert query instead of RETURN NEXT on a local var to forward the RETURNING value
3017	06/22/2012 02:08 PM	Aaron Marcuse-Kubitza	sql.py: mk_insert_select(): INSERT IGNORE: Support RETURNING clause. Always return something to set the correct rowcount.
3016	06/22/2012 02:06 PM	Aaron Marcuse-Kubitza	sql_gen.py: Use an as_*() function instead of manually checking the type wherever possible
3015	06/22/2012 01:18 PM	Aaron Marcuse-Kubitza	sql.py: mk_insert_select(): INSERT IGNORE: Added duplicate key handling by using EXCEPTION block to catch unique_violation

Project

General

Profile

root @ 3114