Project

General

Profile

Statistics
| Revision:

# Date Author Comment
3161 06/29/2012 01:54 AM Aaron Marcuse-Kubitza

sql.py: DbConn._db(): Record that a transaction is already open before setting the search_path so that a query is never run with an _savepoint value less than 1 (manual transactions are not supported yet)

3160 06/29/2012 01:52 AM Aaron Marcuse-Kubitza

sql.py: DbConn.with_savepoint(): Increment _savepoint before running queries so they don't get autocommitted

3159 06/29/2012 01:10 AM Aaron Marcuse-Kubitza

sql.py: empty_temp(): Empty temp tables even in debug_temp mode, so that it can be seen which tables have been garbage collected and disk space leaks can be detected. This will not affect the external re-runnability of slow queries in debug_temp mode, as long as the user aborts the debug_temp import while the slow query is still running.

3158 06/29/2012 01:07 AM Aaron Marcuse-Kubitza

sql_gen.py: ColDict: Use OrderedDict so that order of keys in input dict (if ordered) will be preserved. This should ensure that tempt table unique indexes have their columns in the same order as the output table, so that a merge join can be used.

3157 06/29/2012 01:01 AM Aaron Marcuse-Kubitza

util.py: dict_subset(): Use OrderedDict so that order of keys in input dict (if ordered) will be preserved

3156 06/29/2012 12:55 AM Aaron Marcuse-Kubitza

main Makefile: python-Darwin: Added pip installation instructions. python-Linux: Added ordereddict.

3155 06/29/2012 12:04 AM Aaron Marcuse-Kubitza

sql.py: DbConn.col_info(): cacheable param defaults to True now that callers explicitly turn off cacheable when needed

3154 06/29/2012 12:00 AM Aaron Marcuse-Kubitza

sql.py: add_index_col(): Explicitly set update()'s col_info caching depending on whether col_info will be changed later by add_not_null()

3153 06/28/2012 11:55 PM Aaron Marcuse-Kubitza

sql.py: mk_update(): Renamed cacheable param to cacheable_ so it wouldn't conflict with update()'s cacheable param

3152 06/28/2012 11:54 PM Aaron Marcuse-Kubitza

sql.py: mk_update(): Added cacheable param to set whether column structure information used to generate the query can be cached

3151 06/28/2012 11:40 PM Aaron Marcuse-Kubitza

sql.py: add_index_col(): Explicitly set col_info()'s caching depending on whether col_info will be changed later by add_not_null()

3150 06/28/2012 11:35 PM Aaron Marcuse-Kubitza

sql.py: DbConn.col_info(): Allow caller to specify whether query is cacheable

3149 06/28/2012 11:22 PM Aaron Marcuse-Kubitza

csv2db: Fixed bug where CREATE TABLE statement was cached, causing it not to be re-executed after a rollback due to a failed COPY FROM. Avoid re-creating the table after a failed COPY FROM, and instead just remove any existing rows.

3148 06/28/2012 11:09 PM Aaron Marcuse-Kubitza

sql.py: add_index(): Don't generate a unique name for the index because the database does that automatically

3147 06/28/2012 11:00 PM Aaron Marcuse-Kubitza

csv2db: Vacuum table instead of just reanalyzing it because for some reason reanalyzing it isn't enough to fix the cached row count (causing pgAdmin3 to report that the table needs to be vacuumed)

3146 06/28/2012 10:54 PM Aaron Marcuse-Kubitza

csv2db: Don't add indexes on the created table because they use up more disk space than the table itself and currently aren't used. (The import process adds indexes on each iteration's column subset instead.)

3145 06/28/2012 10:21 PM Aaron Marcuse-Kubitza

db_xml.py: partition_size: Turning partitioning back on (with a larger limit), since the largest datasources' temp tables are still too big

3144 06/28/2012 10:20 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): Fixed bug where if there were multiple unique constraints that were violated, only the distinct temp table for the last one would get garbage-collected

3143 06/28/2012 09:01 PM Aaron Marcuse-Kubitza

db_xml.py: partition_size: Set to sys.maxint to disable partitioning. The last bugfix, which avoided returning a large result set to the client which was never read, seems to have fixed the disk space leak, so it's worth reattempting a full simultaneous import.

3142 06/28/2012 08:30 PM Aaron Marcuse-Kubitza

db_xml.py: put_table(): Subsetting in_table: Truncate in_table when finished with it, to avoid temp table disk space leaks

3141 06/28/2012 07:56 PM Aaron Marcuse-Kubitza

sql.py: insert_select(): If caller is only interested in the rowcount (if returning == None), keep the NULL rows for each insert on the server using CREATE TABLE AS. (CREATE TABLE AS sets rowcount to # rows in query, so rowcount will still be set correctly.)

3140 06/28/2012 04:59 PM Aaron Marcuse-Kubitza

top-level map: Added support for custom public schema, to be able to run imports and tests simultaneously (e.g. on a dev machine)

3139 06/27/2012 10:56 PM Aaron Marcuse-Kubitza

csv2db: Fixed bug where table needed to be a sql_gen.Table object with the proper schema, so that errors_table would be created in the correct schema. Removed no longer needed changing of the search_path.

3138 06/27/2012 10:55 PM Aaron Marcuse-Kubitza

csv2db: Fixed bug where table needed to be a sql_gen.Table object with the proper schema, so that errors_table would be created in the correct schema. Removed no longer needed changing of the search_path.

3137 06/27/2012 10:12 PM Aaron Marcuse-Kubitza

sql.py: DbConn.with_savepoint(): Open a new transaction if one is not already open

3136 06/27/2012 10:11 PM Aaron Marcuse-Kubitza

sql.py: DbConn: _savepoint starts at 1 because the driver is not in autocommit mode, so a transaction is already open

3135 06/27/2012 10:05 PM Aaron Marcuse-Kubitza

sql.py: DbConn: _savepoint starts at 1 because the driver is not in autocommit mode, so a transaction is already open

3134 06/27/2012 09:31 PM Aaron Marcuse-Kubitza

csv2db: Create errors table first, so that imports can start using it right away

3133 06/27/2012 09:25 PM Aaron Marcuse-Kubitza

input.Makefile: Added import/steps.by_col.sql to generate a Redmine-formatted list of steps for column-based import

3132 06/27/2012 08:56 PM Aaron Marcuse-Kubitza

bin/map: Optimized default verbosities for the mode: automated tests should not be verbose, column-based import should show all queries to assist profiling, and row-based import should just show row progress

3131 06/27/2012 08:43 PM Aaron Marcuse-Kubitza

sql_io.py: put(): Run data import queries with log_level=3.5 so they don't clutter the output at the normal import verbosity of 3

3130 06/27/2012 08:27 PM Aaron Marcuse-Kubitza

db_xml.py: put_table(): Work around PostgreSQL's temp table disk space leak by reconnecting to the DB after every partition

3129 06/27/2012 08:26 PM Aaron Marcuse-Kubitza

sql.py: mk_select(): Also support limit and start values of type long

3128 06/27/2012 08:13 PM Aaron Marcuse-Kubitza

sql_gen.py: suffixed_table(): Fixed bug where needed to copy all table attrs, such as is_temp status

3127 06/27/2012 08:05 PM Aaron Marcuse-Kubitza

sql.py: create_table(): Fixed bug where needed to run query in recover mode in case the table exists and was created before the current connection, such that the CREATE TABLE statement would not have been cached

3126 06/27/2012 07:50 PM Aaron Marcuse-Kubitza

sql.py: create_table(): Removed final newline after query because that's added by the logging mechanism

3125 06/27/2012 07:43 PM Aaron Marcuse-Kubitza

sql.py: Added reconnect()

3124 06/27/2012 07:37 PM Aaron Marcuse-Kubitza

sql.py: DbConn._reset(): Assert that _savepoint is 0 instead of setting it to 0

3123 06/27/2012 07:31 PM Aaron Marcuse-Kubitza

db_xml.py: put_table(): put_table_(): Removed no longer used limit, start params

3122 06/27/2012 07:23 PM Aaron Marcuse-Kubitza

db_xml.py: put_table(): Merged partitioning and subsetting into same section for simplicity, to avoid creating extra temp tables, and to later allow the connection to be closed and reopened between partitions. partition_size: Expressed value without exponent notation to ensure that it's an integer.

3121 06/27/2012 07:11 PM Aaron Marcuse-Kubitza

db_xml.py: put_table(): Partitioning in_table: Adjust bounds of last partition to actual row #s included

3120 06/27/2012 06:43 PM Aaron Marcuse-Kubitza

sql.py: DbConn: Added _ to reset() to indicate that it's a protected method and users should not call it directly

3119 06/27/2012 06:41 PM Aaron Marcuse-Kubitza

sql.py: DbConn.close(): Reset the connection completely using reset()

3118 06/27/2012 06:40 PM Aaron Marcuse-Kubitza

sql.py: DbConn: Added clear_cache() and reset() and use reset() in init()

3117 06/27/2012 06:31 PM Aaron Marcuse-Kubitza

bin/map: Use new DbConn.close()

3116 06/27/2012 06:31 PM Aaron Marcuse-Kubitza

sql.py: DbConn: Added close()

3115 06/27/2012 06:07 PM Aaron Marcuse-Kubitza

db_xml.py: partition_size: Set to just more than the size of the largest data source that was successfully imported in simultaneous import

3114 06/27/2012 05:32 PM Aaron Marcuse-Kubitza

db_xml.py: put_table(): Partition in_table if larger than a threshold. The threshold is initially set to disable partitioning. Partitioning will hopefully eliminate the excessive disk usage for large input tables, which has caused the system to run out of disk space due to what may be a bug in PostgreSQL.

3113 06/27/2012 05:27 PM Aaron Marcuse-Kubitza

db_xml.py: put_table(): Set in_table's default srcs to in_table itself instead of sql_gen.src_self, so that any copies of in_table will inherit the same srcs instead of being treated as a top-level table. This ensures that the top-level table's errors table will always be used.

3112 06/27/2012 05:17 PM Aaron Marcuse-Kubitza

sql_io.py: cast(): Always convert exceptions to warnings if the input is a column or expression, even if there is no place to save the errors, so that invalid data does not need to be handled by the caller in a (much slower) extra exception-handling loop

3111 06/27/2012 04:47 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): MissingCastException: When casting, handle InvalidValueException by filtering out invalid values with invalid2null() in a loop

3110 06/27/2012 04:45 PM Aaron Marcuse-Kubitza

sql_io.py: cast_temp_col(): Run sql.update() in recover mode in case expr produces errors. Don't cache sql.update() in case this function will be called again after error recovery.

3109 06/27/2012 04:40 PM Aaron Marcuse-Kubitza

sql.py: Generalized FunctionValueException to InvalidValueException so that it will match all invalid-value errors, not just those occurring in user-defined functions

3108 06/27/2012 04:22 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): Removed no longer used sql.FunctionValueException handling, because type casting functions now do their own invalid value handling

3107 06/27/2012 03:44 PM Aaron Marcuse-Kubitza

db_xml.py: put_table(): Subsetting in_table: Call put_table() recursively using put_table_() to ensure that limit and start are reset to their default values, in case the table gets partitioned (which needs up-to-date limit and start values)

3106 06/27/2012 03:14 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): mk_main_select(): Fixed bug where the table of each cond needed to be changed to insert_in_table because mk_main_select() uses the distinct table rather than the full input table

3105 06/27/2012 03:12 PM Aaron Marcuse-Kubitza

sql_gen.py: with_table(): Support columns that are wrapped in a FunctionCall object

3104 06/27/2012 02:47 PM Aaron Marcuse-Kubitza

sql_gen.py: index_cols: Store just the name of the index column, and add the table in index_col(), in case the table is ever copied and renamed

3103 06/26/2012 11:06 PM Aaron Marcuse-Kubitza

Moved error tracking from sql.py to sql_io.py

3102 06/26/2012 11:04 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): Use sql.distinct_table() to uniquify input table, instead of DISTINCT ON. This avoids letting PostgreSQL create a sort temp table to store the output of the DISTINCT ON, which is not automatically removed until the end of the connection, causing database bloat that can use up the available disk space.

3101 06/26/2012 10:36 PM Aaron Marcuse-Kubitza

sql_gen.py: suffixed_table(): Use concat()

3100 06/26/2012 10:34 PM Aaron Marcuse-Kubitza

sql_gen.py: with_default_table(): Remove no longer used overwrite param

3099 06/26/2012 10:33 PM Aaron Marcuse-Kubitza

sql.py: distinct_table(): Return new table instead of renaming input table so that columns that use input table will continue to work correctly

3098 06/26/2012 10:31 PM Aaron Marcuse-Kubitza

sql_gen.py: Moved NamedCol check from with_default_table() to with_table()

3097 06/26/2012 09:39 PM Aaron Marcuse-Kubitza

sql.py: distinct_table(): Fixed bug where empty distinct_on cols needed to create a table with one sample row, instead of returning the original table, because this indicates that the full set of distinct_on columns are all literal values and should only occur once

3096 06/26/2012 09:12 PM Aaron Marcuse-Kubitza

sql.py: run_query(): DuplicateKeyException: Fixed bug where only constraint names matching a certain format were interpreted as DuplicateKeyExceptions. Support constraint names with the name and table separated by ".", not just "_".

3095 06/26/2012 09:10 PM Aaron Marcuse-Kubitza

sql.py: run_query(): Exception parsing: Match patterns only at the beginning of the exception message to avoid matching embedded messages in causes and literal values

3094 06/26/2012 08:46 PM Aaron Marcuse-Kubitza

sql.py: Added distinct_table()

3093 06/26/2012 08:46 PM Aaron Marcuse-Kubitza

sql_gen.py: Added with_table() and use it in with_default_table()

3092 06/26/2012 07:52 PM Aaron Marcuse-Kubitza

sql.py: mk_insert_select(): ignore mode: Support inserting all columns when cols == None

3091 06/26/2012 07:47 PM Aaron Marcuse-Kubitza

sql_gen.py: Col, Table: Support non-string names

3090 06/26/2012 07:25 PM Aaron Marcuse-Kubitza

sql_gen.py: row_count: Use new all_cols

3089 06/26/2012 07:24 PM Aaron Marcuse-Kubitza

sql_gen.py: Added all_cols

3088 06/26/2012 07:17 PM Aaron Marcuse-Kubitza

sql_gen.py: Use new as_Name() instead of db.esc_name()

3087 06/26/2012 07:12 PM Aaron Marcuse-Kubitza

sql_gen.py: Name: Truncate the input name

3086 06/26/2012 07:11 PM Aaron Marcuse-Kubitza

sql_gen.py: Added Name class and associated functions

3085 06/26/2012 06:46 PM Aaron Marcuse-Kubitza

sql.py: create_table(): Support creating temp tables. This fixes a bug in copy_table_struct() where the created table was not a temp table if the source table was. copy_table_struct(): Removed no longer needed versioning because that is now handled by create_table().

3084 06/26/2012 06:33 PM Aaron Marcuse-Kubitza

sql.py: Added copy_table_struct()

3083 06/26/2012 06:32 PM Aaron Marcuse-Kubitza

sql.py: Moved add_indexes() to Indexes subsection

3082 06/26/2012 06:30 PM Aaron Marcuse-Kubitza

sql.py: create_table(): Support LIKE table

3081 06/26/2012 05:18 PM Aaron Marcuse-Kubitza

Moved Data cleanup from sql.py to sql_io.py

3080 06/26/2012 05:18 PM Aaron Marcuse-Kubitza

Moved error tracking from sql.py to sql_io.py

3079 06/26/2012 05:12 PM Aaron Marcuse-Kubitza

sql.py: Organized Database structure introspection and Structural changes functions into subsections

3078 06/26/2012 04:56 PM Aaron Marcuse-Kubitza

Moved error tracking from sql.py to sql_io.py

3077 06/26/2012 04:46 PM Aaron Marcuse-Kubitza

Moved Heuristic queries from sql.py to new sql_io.py

3076 06/26/2012 04:32 PM Aaron Marcuse-Kubitza

Added top-level analysis dir for range modeling

3075 06/26/2012 04:02 PM Aaron Marcuse-Kubitza

sql.py: run_query_into(): Documented why analyze() must be run manually on newly populated temp tables

3074 06/26/2012 03:57 PM Aaron Marcuse-Kubitza

sql.py: DbConn: Added autoanalyze mode. Added autoanalyze() which runs analyze() only if in autoanalyze mode. Use new autoanalyze() in functions that change a table's contents.

3073 06/26/2012 03:52 PM Aaron Marcuse-Kubitza

sql.py: run_query_into(): analyze() the created table to ensure the query planner's initial stats are accurate

3072 06/25/2012 09:46 PM Aaron Marcuse-Kubitza

inputs/SpeciesLink/src: Added custom header that overwrites existing header so that column names will not be too long for the staging table

3071 06/25/2012 09:35 PM Aaron Marcuse-Kubitza

cat_csv: Support overwriting the existing header using a separate header file

3070 06/25/2012 08:49 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: Added location.location_coords index to speed up large imports by providing an index for merge joins

3069 06/25/2012 08:43 PM Aaron Marcuse-Kubitza

csv2db: Reanalyze table, so that query planner stats are up to date even though the table doesn't need to be vacuumed anymore

3068 06/25/2012 08:42 PM Aaron Marcuse-Kubitza

sql.py: Added analyze()

3067 06/25/2012 08:11 PM Aaron Marcuse-Kubitza

csv2db: Removed no longer needed table vacuum (cleanup_table() now avoids creating dead rows)

3066 06/25/2012 08:10 PM Aaron Marcuse-Kubitza

sql.py: cleanup_table(): Use update()'s new in_place mode to avoid needing to vacuum the table

3065 06/25/2012 08:02 PM Aaron Marcuse-Kubitza

sql.py: mk_update(): in_place: Support updating multiple columns at once

3064 06/25/2012 07:44 PM Aaron Marcuse-Kubitza

sql.py: update() calls: Use in_place where possible to avoid creating dead rows, which bloats table size

3063 06/25/2012 07:37 PM Aaron Marcuse-Kubitza

sql.py: DbConn.col_info(): Support user-defined types

3062 06/25/2012 07:33 PM Aaron Marcuse-Kubitza

sql_gen.py: Added Nullif