Activity
From 06/11/2012 to 07/10/2012
07/10/2012
- 09:53 PM Revision 3317: input.Makefile: Name logs with extension .log.sql so they are syntax-highlighted for the SQL statements they contain
- 09:24 PM Revision 3316: schemas/vegbien.sql: taxonoccurrence: Added sourceaccessioncode so that taxonoccurrences can be uniquely identified in plots data
- 09:12 PM Revision 3315: schemas/vegbien.sql: locationevent: Removed datasource_id because locationevents are now scoped by their required location, which itself is scoped by datasource
- 09:07 PM Revision 3314: mappings/VegX-VegBIEN.stems.csv: Removed locationevent.datasource_id mappings because locationevents are now scoped by their required location, which itself is scoped by datasource
- 08:42 PM Revision 3313: sql.py: distinct_table(): Don't sort the inserted rows by pkey because they should stay in the table order that they were in. (The select order with no ORDER BY should be the table order. Even if it isn't, it doesn't matter what order they are in for our current application.)
- 08:38 PM Revision 3312: sql_io.py: put_table(): Creating an empty pkeys table: Don't sort the inserted result by pkey because it's empty (limit=0)
- 08:32 PM Revision 3311: sql_io.py: put_table(): ignore(): Fixed bug where in_col's table needed to be changed to insert_in_table, because it's *insert_in_table's* rows that are being modified but mapping (which in_col comes from) qualifies columns by *in_table*
- 08:28 PM Revision 3310: sql_io.py: put_table(): ignore(): Also add an index on in_col if mapping the value to NULL
- 08:28 PM Revision 3309: sql_io.py: put_table(): ignore(): Only delete from the insert_in_table, because the invalid rows only need to be removed from the rows that are actually being inserted into the DB. If there are invalid rows in the full (not uniquified) in_table, that's OK, as they can still get a valid output pkey if the first copy of a row they were considered a duplicate of *is* valid (this is a very unusual situation, so this change should not affect most real data).
- 08:22 PM Revision 3308: sql_io.py: put_table(): ignore(): Merged filter_ var into sql.delete() call because that's the only place it's used
- 08:18 PM Revision 3307: sql_io.py: put_table(): insert_into_pkeys(): Removed no longer used distinct param
- 08:16 PM Revision 3306: sql_io.py: put_table(): Getting output table pkeys of existing/inserted rows: Don't DISTINCT ON the joined rows by input pkey, because this adds sorting overhead. This should not be needed because there generally should not be any duplicate rows for the columns in a unique index (if there are, this is an index configuration problem and should be fixed in the schema). It's possible that partial indexes (with a filter condition) were causing this, but testing without it in place will be needed to determine the cause.
- 08:02 PM Revision 3305: sql.py: flatten(): Auto-add a pkey on the created temp table. This should be standard practice for most temp tables, and for sql_io.put_table() especially this will be useful if we ever want to add back sorting the in_table by row_num (possibly by CLUSTERing on the pkey to avoid pkey index scans).
- 07:54 PM Revision 3304: sql.run_query_into() calls: Use new add_pkey_ param instead of manually calling sql.add_pkey()
- 07:53 PM Revision 3303: sql.py: run_query_into(): Changed add_indexes_ param to add_pkey_ and add just a pkey if it's set. It's no longer necessary to create indexes on every column of a temp table, because the covering indexes for the join columns have been fixed to have columns in the same order as the output table's corresponding index so that they can be used for a merge join.
- 07:41 PM Revision 3302: sql_io.py: put_table(): Add pkey on pkeys table right when it's created, so that any duplicates are detected right away instead of at the end of the iteration. (Duplicates are created as a result of joins matching multiple rows, which often indicates a database misconfiguration.)
- 07:34 PM Revision 3301: sql_io.py: put_table(): Adding pkey on pkeys table: Removed log message because adding an index is considered a low-level operation, which isn't included in the Redmine SQL
- 07:27 PM Revision 3300: schemas/tree_cross-links.sql: Ancestors table: Synced with current definition, which removes unneeded fki_* indexes. Note that the index on ancestor_id might be needed in the future if we ever want to get all the descendants of a plantname/namedplace or perform deletions on plantname/namedplace (which cascade to *_ancestor). For getting all the plantnames/namedplaces (of any rank) for a plantconcept/locationdetermination, though, the *_ancestor_pkey index is sufficient because plantname_id/namedplace_id is the first column in it.
- 07:20 PM Revision 3299: schemas/vegbien.sql: {plantname,namedplace}_update_ancestors(): Fixed slowdown due to removed index on {plantname,namedplace}.parent_id by adding COALESCE() to enable using the plantname_unique index for the lookup instead
- 06:26 PM Revision 3298: sql.mk_select() calls: Removed no longer needed start=0 to turn off missing WHERE, LIMIT, or OFFSET clause warning
- 06:21 PM Revision 3297: sql.py: mk_select(): Don't output warning if there is no WHERE, LIMIT, or OFFSET clause, because column-based import has many queries where this is the case and it's annoying to need to specify start=0 to turn off this warning
- 06:04 PM Revision 3296: sql.py: flatten(): Don't sort the input tables by the pkey because it doesn't matter what order the datasource's rows are inserted in. Note that PostgreSQL doesn't guarantee the order of rows in a table, so it is possible that the rows were being inserted in an unknown order before this change, as well.
- 05:58 PM Revision 3295: sql.py: delete(): Cache deletes by default
- 05:56 PM Revision 3294: sql_io.py: put_table(): Merged remove_rows() and invalid2null() into one ignore() function that chooses the action (map to NULL or delete) depending on the value and whether NULLs have been filtered out of the column
- 05:46 PM Revision 3293: sql_io.py: put_table(): remove_rows(): Delete the rows containing the invalid value instead of filtering them out of each select, so that the filtering can be profiled separately from the insertion. This also requires deleting rows with invalid non-NULL values instead of mapping them to NULL if NULLs have already been filtered out of the column in question.
- 05:34 PM Revision 3292: sql_io.py: put_table(): Main insert: Don't run it inside an extra savepoint, because this will cause the creation of any helper SQL functions to be rolled back if an exception is thrown. If those functions are later re-used, the cache will think they exist when they no longer do. (Calling a function on input rows is now run in recover mode, so that it doesn't need the outer savepoint anymore.)
- 05:30 PM Revision 3291: sql_io.py: put_table(): main_insert(): Moved code that is not part of the main query outside the function, so it wouldn't be subject to the exception handling. Preparing to insert new rows: Only do the preparation code for insert_select() if the out_table is not a function.
- 05:19 PM Revision 3290: sql_io.py: put_table(): main_insert(): is_function: Run insert_into_pkeys() with recover=True so that errors in the function are properly rolled back
- 05:18 PM Revision 3289: sql_io.py: put_table(): insert_into_pkeys(): Support custom query kw_args, such as recover
- 04:41 PM Revision 3288: sql_io.py: put_table(): Use full_in_table in the into table row count assertion, since in_table may have rows deleted
- 04:36 PM Revision 3287: sql_io.py: put_table(): Save default values for all rows in new temp table full_in_table since in_table may have rows deleted
- 04:13 PM Revision 3286: sql.py: Added mk_delete() and delete()
- 03:36 PM Revision 3285: sql_io.py: put_table(): mk_main_select(): Turned off unnecessary ORDER BY to avoid sorting the entire table every time it's used. (PostgreSQL has no concept of reordering a table and re-using that ordering, so it just re-sorts the table each time. Index scans on the pkey do not appear to be used in practice, according to EXPLAIN results from live imports.) Document that we instead assume that identical SELECT queries retrieve rows in the same order.
- 01:56 PM Revision 3284: schemas/vegbien.sql: taxondetermination: Fixed bug where taxondetermination_taxonoccurrence_id_fkey trigger was applied before the NOT NULL constraint on taxonoccurrence_id was checked, causing the trigger to fail on NULL taxonoccurrence_ids, by making it an AFTER trigger. (An AFTER trigger will still roll back the entire insert if it fails, even though it runs after the insert itself.)
07/09/2012
- 05:45 PM Revision 3283: schemas/vegbien.sql: specimenreplicate: institution_id: Fixed typo in comment
- 05:26 PM Revision 3282: inputs/import.stats.xls: Fixed date for most recent import
- 05:26 PM Revision 3281: sql.py: DbConn.run_query(): Put the data source comment on a separate line in the log file instead of using a carriage return, which sometimes had the desired effect of overwriting the src comment with the first line of the query but sometimes the line lengths weren't right and there wasn't enough overlap
- 04:53 PM Revision 3280: schemas/vegbien.ERD.mwb: Synced with schema
- 04:42 PM Revision 3279: schemas/vegbien.sql: Removed per-column indexes, which are no longer needed by either row-based or column-based import because they are able to do a merge join or lookup using the table's UNIQUE INDEX. Instead of forcing the database to build and maintain large indexes (15+ GB!) that are not used, optimization-only (non-UNIQUE) indexes should be added as needed only once the database is actually used for queries. In most cases it will not even be necessary to add additional indexes then, because most UNIQUE indexes can be reused for broad lookups (rather than just duplicate elimination). Even the foreign key covering indexes (fki_*) are not needed because we virtually never delete rows in the DB, and even if we were to start doing that regularly, the cost of maintaining the indexes on import is most likely not worth the speed improvements for cascading deletes.
- 04:32 PM Revision 3278: schemas/py_functions.sql: Removed per-column indexes on relational functions, which are no longer needed by row-based import because it is able to do a merge join-style lookup using the table's UNIQUE INDEX. (Note that column-based import doesn't use the (slower) relational functions at all anymore, and instead calls the corresponding SQL function directly using named arguments.)
- 04:31 PM Revision 3277: schemas/functions.sql: Removed per-column indexes on relational functions, which are no longer needed by row-based import because it is able to do a merge join-style lookup using the table's UNIQUE INDEX. (Note that column-based import doesn't use the (slower) relational functions at all anymore, and instead calls the corresponding SQL function directly using named arguments.)
- 04:26 PM Revision 3276: schemas/vegbien.sql: plantname: plantname_unique UNIQUE INDEX: Moved scope_id to the back so that the index can easily be used for lookup queries (not just column-based import) without having to explicitly specify NULL for that field. This takes advantage of a btree sorting feature where a broader lookup can be done using just the first n columns of the index.
- 04:15 PM Revision 3275: schemas/vegbien.sql: locationevent, specimenreplicate: Turned UNIQUE CONSTRAINTs and UNIQUE INDEXes with nullable fields into partial UNIQUE INDEXes with IS NOT NULL filter conditions, in order to work automatically with sql_gen without requiring a separate covering lookup index. Removed no longer needed covering lookup indexes.
- 03:07 PM Revision 3274: sql_io.py: put_table(): DuplicateKeyException: Fixed bug where combining multiple unique constraints was incorrectly allowed, when in fact the constraints need to be separately applied to the different rows that violate them, which is not currently supported
- 03:02 PM Revision 3273: sql.py: DbConn.with_savepoint(): Log transaction profiling info with level=4 like the rest of the transaction commands, so that it isn't output when the transaction itself should be hidden (e.g. for name versioning or internal commands)
- 02:16 PM Revision 3272: sql.py: DbConn.with_savepoint(): Profile (nested) transactions so that the run time for groups of commands (e.g. csv2db INSERTs) is known
- 02:04 PM Revision 3271: csv2db: verbosity defaults to 3 so that detailed queries with profiling stats are included in the log file, to assist in optimization
- 02:01 PM Revision 3270: csv2db: Don't cache per-row INSERT queries because this bloats the cache (there aren't repeated identical INSERTs that shouldn't be re-run like in row-based import)
- 01:57 PM Revision 3269: sql.py with_explain_comment(), DbConn: Fixed bug where with_explain_comment() was being run in per-row imports (row-based import and csv2db with INSERT), causing the overhead of an EXPLAIN query for every single INSERT and filling up the cache with EXPLAIN query results, by adding autoexplain mode, only running with_explain_comment() in autoexplain mode, and only enabling autoexplain mode for column-based import
- 01:11 PM Revision 3268: db_xml.py: put_table(): Turn on autoanalyze mode to help the query planner avoid sequential scans on tables that now contain data. (Don't do this in row-based import because it creates too much overhead per insert.)
- 12:24 PM Revision 3267: sql.py: Run all EXPLAIN queries with log_level=4 since the EXPLAIN information is now usually generated when the query is generated rather than when it's run, so the log_level is not known
- 12:21 PM Revision 3266: sql.py: Added with_explain_comment() to query generating functions so that nested queries will also have EXPLAIN information
- 12:11 PM Revision 3265: sql.py: Added with_explain_comment() and use it in run_query()
- 12:01 PM Revision 3264: sql.py: run_query(): EXPLAIN output: Run explain() with log_level 1 higher than the query's log_level, so that low-level queries' EXPLAIN queries are not output when the queries themselves are not output. This also ensures that only level 2 (major) queries have the EXPLAIN logged (to introduce the query that is being run), to avoid cluttering the log output.
- 11:54 AM Revision 3263: sql.py: explain(): Support custom log_level
- 11:48 AM Revision 3262: schemas/vegbien.sql: taxondetermination: taxondetermination_taxonoccurrence_id_fkey manual fkey constraint: Fixed bug where needed to raise foreign_key_violation instead of unique_violation
- 11:23 AM Revision 3261: inputs/import.stats.xls: Updated with stats from latest import
07/06/2012
- 04:43 PM Revision 3260: debug2redmine.csv: Remove newline before EXPLAIN comment
- 04:33 PM Revision 3259: debug2redmine.csv: Filter out EXPLAIN comments
- 04:29 PM Revision 3258: sql.py: run_query(): EXPLAIN all explainable queries before they are run, to provide query plans for later profiling and index analysis. At verbosity 3+, this also effectively allows the user to see what query is being run before it's executed.
- 04:26 PM Revision 3257: sql.py: is_explainable(): Fixed bug where needed r'' syntax to escape \ in \b
- 04:23 PM Revision 3256: sql.py: Added explain() and is_explainable()
- 04:19 PM Revision 3255: strings.py: Added join_lines()
- 02:50 PM Revision 3254: mk_rm_indexes: Also include the search_path in the outputted commands
- 02:45 PM Revision 3253: schemas/vegbien.sql: commclass: Fixed bug where commclass_unique needed to be a UNIQUE INDEX
- 02:42 PM Revision 3252: schemas/vegbien.sql: plantname: Removed unneeded indexes on plantname and rank (plantname_unique takes care of joins)
- 02:33 PM Revision 3251: pg_dump_vegbien: Enclose the schema name in "" because pg_dump requires this for schema names with special characters
- 02:09 PM Revision 3250: inputs/import.stats.xls: Updated with stats from 2012-7-3 and 2012-7-5 imports. Note that the 2012-7-5 import was partial, so its stats can't be directly compared.
- 01:28 PM Revision 3249: root Makefile: VegBIEN DB: Schemas: Added schemas/%/rm_indexes
- 01:27 PM Revision 3248: Added mk_rm_indexes
- 11:14 AM Revision 3247: sql.py: Added drop() and use it in drop_table()
- 10:59 AM Revision 3246: debug2redmine: Remove profiling information from the logging output
- 10:43 AM Revision 3245: sql.py: DbConn.run_query(): Only print notices in debug mode, because they are output with a log level higher than the debug verbosity threshold, and this avoid unnecessary overhead
- 10:41 AM Revision 3244: sql.py: DbConn: Added profile_row_ct setting, which is passed to profiler.stop() in run_query()
- 10:38 AM Revision 3243: bin/map: Logging: Raised debug-mode verbosity threshold to 1.5 so that in row-based imports, which have a default verbosity of 1.1, sql.DbConn.run_query() will not profile the query, to avoid unnecessary overhead
- 10:34 AM Revision 3242: sql.py: DbConn.run_query(): Only profile queries in debug mode, to avoid unnecessary overhead when the run time will not be displayed
- 10:29 AM Revision 3241: sql.py: DbConn.run_query(): Profile using the profiling.ItersProfiler class, which pretty-prints the run time
- 10:22 AM Revision 3240: sql.py: DbConn.run_query(): Added profiling of query execution, which is logged with the query
- 09:26 AM Revision 3239: sql.py: DbConn.run_query(): Move log_msg() to where it's used, so that it runs after the query is run and can refer to profiling variables
- 09:21 AM Revision 3238: sql.py: DbConn.run_query(): Use else blocks to avoid applying exception handling to commands run after the main command
- 09:18 AM Revision 3237: sql.py: DbConn.run_query(): Always output or return the log message after the query is run, so that it can be output with profiling statistics in the log message header
- 09:05 AM Revision 3236: sql.py: run_query(): Always output the log message after the query is run, so that it can be output with profiling statistics in the log message header
07/05/2012
- 03:16 PM Revision 3235: Regenerated vegbien.ERD exports
- 03:13 PM Revision 3234: schemas/vegbien.sql: locationevent: Added covering lookup indexes on the unique constraints to enable fast merge joins in column-based import. Removed no longer needed individual-column lookup indexes because the constraint-covering lookup indexes now handle lookups. This also avoids index bloat.
- 03:00 PM Revision 3233: schemas/vegbien.sql: specimenreplicate: Removed no longer needed individual-column lookup indexes because the constraint-covering lookup indexes now handle lookups. This also avoids index bloat.
- 02:57 PM Revision 3232: schemas/vegbien.sql: specimenreplicate: Added covering lookup indexes on the unique constraints to enable fast merge joins in column-based import
- 02:48 PM Revision 3231: schemas/vegbien.sql: specimenreplicate: Added CHECK constraint which ensures that there is at least one key to sufficiently uniquely identify the specimenreplicate
- 02:44 PM Revision 3230: inputs/CTFS/maps/VegX.organisms.csv: Mapped VegX sourceAccessionCode = VegBIEN plantobservation,specimenreplicate.sourceaccessioncode so that specimenreplicate would have a required key
- 02:38 PM Revision 3229: mappings/VegX-VegBIEN.stems.csv: Sort the plantobservation.sourceaccessioncode/specimenreplicate.sourceaccessioncode mapping with the other _ifs so the adjacent node merging works properly and it gets created before _ignore removes voucherType
- 02:34 PM Revision 3228: mappings/VegX-VegBIEN.stems.csv: Also map plantobservation.sourceaccessioncode to specimenreplicate.sourceaccessioncode so specimenreplicate always has a key and will never be underconstrained
- 02:12 PM Revision 3227: xml_func.py: process(): Fixed bug where an evaluated XML function might create a node of the same name as an existing node, but these nodes would not be merged even though they referred to the same object, by merging siblings of a newly-evaluated (replaced) node if they have the same name
- 02:09 PM Revision 3226: xml_dom.py: Added merge() and merge_adjacent()
- 02:08 PM Revision 3225: xml_dom.py: replace_with_text(): Return the new node
- 12:33 PM Revision 3224: mappings/VegX-VegBIEN.stems.csv: Indirect voucher mappings: Removed no longer needed ":[*_id/taxonoccurrence]" because a specimenreplicate *is* a taxonoccurrence, so it doesn't need to *have* an empty taxonoccurrence
- 12:27 PM Revision 3223: mappings/VegX-VegBIEN.stems.csv: Fixing specimenreplicate->taxonoccurrence mapping bug where taxonoccurrence_id is no longer used as an fkey because it's instead a pkey inherited from taxonoccurrence, by instead using the new fkey to plantobservation for direct vouchers. Note that a duplicate aggregateoccurrence is created, because the _if XML function runs after the XPaths have created the initial tree, and thus the nodes it pulls forward do not automatically get merged with adjacent nodes of the same name. This will eventually need to be fixed by auto-merging the nodes.
- 12:00 PM Revision 3222: schemas/vegbien.sql: specimenreplicate: Fixing specimenreplicate->taxonoccurrence mapping bug where taxonoccurrence_id is no longer used as an fkey because it's instead a pkey inherited from taxonoccurrence, by instead adding an fkey to plantobservation for direct vouchers. Also, it makes more sense for a specimenreplicate to directly voucher the plant it came from rather than that plant's taxonoccurrence, because a direct voucher is a closer relationship to the plant.
- 11:22 AM Revision 3221: mappings/VegX-VegBIEN.stems.csv: Map collectiondate to specimenreplicate via voucher when the voucher is indirect, rather than always directly to the taxonoccurrence, because the collectiondate relates to the specimenreplicate, not the taxonoccurrence, and is not necessarily 1:1 with it
- 11:17 AM Revision 3220: mappings: Updated for_review VegX-VegBIEN mappings, which hadn't been auto-updated because of a modification time issue. (mappings/VegX-VegBIEN.stems.csv was replaced with an older version, which did not trigger make to remake files depending on it.)
- 10:28 AM Revision 3219: schemas/vegbien.sql: locationevent: Added sql_gen-compatible indexes on all columns in the locationevent_unique_project_authorcode UNIQUE index: Changed locationevent_project_id index to use COALESCE(). Added index on obsstartdate.
- 10:19 AM Revision 3218: schemas/vegbien.sql: locationevent: Removed no longer needed COALESCE() index on location_id now that location_id is NOT NULL
- 10:16 AM Revision 3217: schemas/vegbien.sql: locationevent: Fixed bug where locationevent_unique_location index was overconstraining locationevent when a sourceaccessioncode or obsstartdate was specified, by combining the locationevent_unique_location, locationevent_unique_accessioncode, and locationevent_unique_location_date indexes into one COALESCE() index on the combined fields of those indexes
- 10:10 AM Revision 3216: schemas/vegbien.sql: locationevent: Made location_id required because every locationevent should have a location, even one with no locationdeterminations. This also avoids the creation of a parent locationevent when subplots are not being used.
- 09:48 AM Revision 3215: mappings/VegX-VegBIEN.stems.csv: Removed _collapse where it's no longer needed because sql_io.put() handles that now. Note that each locationevent will get an empty commclass, whether or not there are any commdeterminations. This can later be used to add new commdeterminations.
- 09:45 AM Revision 3214: schemas/vegbien.sql: commclass: Changed commclass_unique to COALESCE() classnotes so that there is only one commclass for a locationevent when the commclasses are not separately named. (Currently classnotes is used as the class name field, commname being the name of the community itself.)
- 09:33 AM Revision 3213: schemas/vegbien.sql: commdetermination: Made commconcept_id NOT NULL because it doesn't make sense to have a commdetermination on nothing. Note that the commname field in commdetermination is not used for making determinations (and may need to be removed to avoid confusion); commname.commname is used instead.
- 09:28 AM Revision 3212: schemas/vegbien.sql: locationevent: Added COALESCE() index on location_id for use by column-based import
- 09:24 AM Revision 3211: mappings/VegX-VegBIEN.stems.csv: Removed _collapse where it's no longer needed because sql_io.put() handles that now. Note that each plantobservation will get an empty stemobservation, whether or not there are any stemtags. This can later be used to add further stemtags.
- 08:58 AM Revision 3210: mappings/VegX-VegBIEN.stems.csv: Removed _collapse where it's no longer needed because sql_io.put() handles that now
- 08:31 AM Revision 3209: schemas/vegbien.sql: location: Made datasource_id, sourceaccessioncode NOT NULL to ensure that all locations are uniquely identifiable by their datasource's unique key (sourceaccessioncode)
- 08:28 AM Revision 3208: sql_io.py: put(): Handle NullValueExceptions by returning a NULL pkey, just like put_table() (column-based import) does
07/03/2012
- 05:29 PM Revision 3207: VegBIEN: Fixing import issue related to duplicate entries in tables with children, where when a new table entry duplicates an existing entry, the 1:1 tables of that table and those tables' children are not merged, causing them to become orphaned. It is described in detail at <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/Import_issues#Merging-duplicates-with-children>, including the rationale for this solution. Note that this is *not* a bug in column-based import, it applies to row-based import as well. This commit fixes the issue for locationevent->location in plots data, by also mapping locationevent's unique keys to location.sourceaccessioncode and setting location.datasource_id.
- 03:59 PM Revision 3206: sql.py: DbConn.run_query(): Separate the data source comment from the query with a tab in the executed query but a \r in the logged query, so that the query will be shown on the same line as the data source comment in pg_stat_activity, but be hidden by the following line when cating the file and be put on a separate line when viewed in a text editor. This causes the first line of the query to be at the left edge when the log file is viewed, so that it looks more natural.
- 03:15 PM Revision 3205: README.TXT: Data import: Import data into VegBIEN: Added command to use for column-based import
- 02:10 PM Revision 3204: schemas/vegbien.sql: locationevent: Allow a locationevent to be uniquely specified by its location (which is now datasource-scoped) and start date
- 01:26 PM Revision 3203: VegBIEN: Fixing import issue related to duplicate entries in tables with children, where when a new table entry duplicates an existing entry, the 1:1 tables of that table and those tables' children are not merged, causing them to become orphaned. It is described in detail at <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/Import_issues#Merging-duplicates-with-children>, including the rationale for this solution. Note that this is *not* a bug in column-based import, it applies to row-based import as well. This commit fixes the issue for specimenreplicate->...->location, by also mapping specimenreplicate's unique keys to location.sourceaccessioncode and setting location.datasource_id.
- 12:47 PM Revision 3202: schemas/vegbien.sql: locationevent: Allow to be uniquely specified by location_id. This is useful for specimens data where there is one location for every locationevent.
- 12:27 PM Revision 3201: schemas/vegbien.sql: location: Added datasource_id and sourceaccessioncode so locations can be uniquely specified by the input datasource, rather than being created automatically for each locationevent
- 11:45 AM Revision 3200: schemas/filter_ERD.csv: Add back taxondetermination->taxonoccurrence fkey because that has been replaced by a trigger in the SQL
- 11:06 AM Revision 3199: VegBIEN: Fixing import issue related to duplicate entries in tables with children, where when a new table entry duplicates an existing entry, the 1:1 tables of that table and those tables' children are not merged, causing them to become orphaned. It is described in detail at <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/Import_issues#Merging-duplicates-with-children>, including the rationale for this solution. Note that this is *not* a bug in column-based import, it applies to row-based import as well. This commit fixes the issue for specimenreplicate->taxonoccurrence.
- 07:40 AM Revision 3198: inputs/import.stats.xls: Updated with remaining stats from most recent import
07/02/2012
- 03:52 PM Revision 3197: PostgreSQL-MySQL.csv: Remove INHERITS clauses
- 02:37 PM Revision 3196: schemas/vegbien.ERD.mwb: Fixed lines
- 02:24 PM Revision 3195: db_xml.py: put_table(): Moved in_row_ct updating to Subsetting section so the cursor's rowcount can be used directly
- 02:18 PM Revision 3194: db_xml.py: put_table(): Subsetting in_table: Don't count # rows because this takes awhile for large datasets. Instead, use the chunking algorithm in digir_client, which ends the loop when a partial or empty partition is encountered.
- 01:58 PM Revision 3193: inputs/import.stats.xls: Updated with new stats from an independent import
- 01:20 PM Revision 3192: schemas/vegbien.sql: Fixed UNIQUE INDEXes that were still using COALESCE(..., 0) to use COALESCE(..., 2147483647) in order to match what sql_gen.EnsureNotNull uses
- 12:41 PM Revision 3191: schemas/vegbien.sql: specimenreplicate: UNIQUE INDEX on catalognumber_dwc: Added collectioncode_dwc so that datasources that specify it in addition to the institution_id (such as aggregators) will not need to have catalognumbers be unique within an institution
- 12:30 PM Revision 3190: inputs/import.stats.xls: Updated with more stats from latest import
- 11:44 AM Revision 3189: inputs/import.stats.xls: Updated with initial stats from latest import. Reformatted to put successive runs of column-based next to each other, so they could be directly compared and so that the row-based data wouldn't need to be duplicated. Added empty-value checks to formulas so that they don't need to be manually deleted when one of their inputs is empty.
- 10:32 AM Revision 3188: input.Makefile: Documentation: import/steps.by_col.sql: Fixed bug where needed to run import in test mode
- 10:12 AM Revision 3187: sql_io.py: put_table(): Don't set pkeys of missing rows to default value if out_table is a SQL function, because then there is already an entry for every row
- 10:03 AM Revision 3186: bin/map: by_col: Reuse existing out_db connection for in_db instead of opening separate connection
- 09:50 AM Revision 3185: sql.py: mk_select(): Replaced newlines with spaces when query is simple enough to fit on one line
- 09:40 AM Revision 3184: db_xml.py: put_table(): Set db.src to help identify the data source in pg_stat_activity
- 09:39 AM Revision 3183: sql.py: DbConn: Added src config param, which in autocommit mode, will be included in a comment in every query, to help identify the data source in pg_stat_activity
- 09:38 AM Revision 3182: sql_gen.py: Added lstrip() to remove comments
- 09:13 AM Revision 3181: sql.py: mk_insert_select(): Added src param to help identify the data source in pg_stat_activity
- 08:33 AM Revision 3180: mappings/DwC2-VegBIEN.specimens.csv: Mapped institutionCode. This will enable datasources to use specimenreplicate's institution_id index for duplicate elimination.
- 08:31 AM Revision 3179: input.Makefile: Prompt user to accept test, instead of providing command line func for doing so
- 07:45 AM Revision 3178: schemas/vegbien.sql: specimenreplicate: UNIQUE INDEX on catalognumber_dwc: Added institution_id so that datasources that specify it (such as aggregators) will not need to have catalognumbers be globally unique. Once the institution_id is mapped to, this will fix a bug where rows with the same catalognumber were assumed to be duplicates even though they were from different institutions. This should also avoid the need to do any duplicate elimination joins when importing specimenreplicate, speeding up column-based import.
- 07:32 AM Revision 3177: schemas/vegbien.sql: specimenreplicate: Renamed museum_id to institution_id to correspond with DwC's institutionCode, so that it would be more obvious where to map institutionCode fields to
- 07:16 AM Revision 3176: inputs/import.stats.xls: Updated to include run times for rest of datasources for most recent column-based import
06/29/2012
- 08:09 AM Revision 3175: db_xml.py: put_table(): Subsetting in_table: Prepend schema to subset table name so that in pg_stat_activity, it's clear which datasource a particular query is from
- 07:46 AM Revision 3174: sql_io.py: cast_temp_col(): add_col()'s distinguishing comment param: Add the type in case the same input column is being cast to different types, and both types have the same first word (causing their new column names to be the same)
- 07:42 AM Revision 3173: sql_io.py: cast_temp_col(): Name the new column with only the first word of the type, to save space in the limited identifier length
- 07:41 AM Revision 3172: strings.py: Added first_word()
- 07:35 AM Revision 3171: sql_io.py: cast_temp_col(): Use sql_gen.suffixed_col() to create the new column name
- 06:16 AM Revision 3170: inputs/import.stats.xls: Added run time for SALVIAS organisms, which just finished
- 06:14 AM Revision 3169: inputs/import.stats.xls: Use [1]-style footnotes because copying and pasting to Gmail doesn't preserve the superscripts
- 06:11 AM Revision 3168: inputs/import.stats.xls: Updated for latest simultaneous column-based import
- 04:42 AM Revision 3167: sql_io.py: cast_temp_col(): Don't automatically create an index on the new column, because it doesn't necessarily need an index and the main index used for the join is now added automatically by distinct_table()
- 04:39 AM Revision 3166: sql.py: flatten(): Don't automatically create indexes on all columns, because most columns don't need indexes and the main index used for the join is now added automatically by distinct_table()
- 04:35 AM Revision 3165: sql.py: Removed no longer needed add_index_col() and ensure_not_null() because we are not using index columns
- 04:33 AM Revision 3164: sql.py: add_index(): Don't create index columns for nullable columns, because they require indexes to be created on all columns in order to use a distinct_table() temp table. Also, now that we are no longer using LEFT JOINs, the COALESCE() call would only be evaluated once (in the plain JOIN) in the event that PostgreSQL doesn't use an index on a COALESCE() expression.
- 03:30 AM Revision 3163: schemas/vegbien.sql: location: Dropped unique constraint on lat/long because it covered only some rows, which interfered with column-based import's selection of different insert methods based on the presence or absence of duplicate keys. (With the constraint, locations with coordinates would have duplicates eliminated, but locations without coordinates would not be able to find which row was added for a particular location because there was no lookup key to join on, and would all just use the first inserted row.) The previous behavior didn't make much sense anyway, because it would assert that two locationevents occurred in the same place just because they had the same coordinates, which may not have been precise enough to make this determination. Asserting that two locationevents occurred in the same place is really part of the secondary validation, not the import process.
- 01:58 AM Revision 3162: sql.py: DbConn: Fixed bug where Exceptions did not have the query appended if the query was not run in cacheable mode, by moving _add_cursor_info() from DbCursor.execute() to run_query() so it would also get called for non-cacheable queries that use a native cursor rather than a wrapper. Fixed bug where non-cacheable queries were not autocommitted, by moving self.do_autocommit() from DbCursor.execute() to run_query() so it would also get called for non-cacheable queries that use a native cursor rather than a wrapper.
- 01:54 AM Revision 3161: sql.py: DbConn._db(): Record that a transaction is already open before setting the search_path so that a query is never run with an _savepoint value less than 1 (manual transactions are not supported yet)
- 01:52 AM Revision 3160: sql.py: DbConn.with_savepoint(): Increment _savepoint before running queries so they don't get autocommitted
- 01:10 AM Revision 3159: sql.py: empty_temp(): Empty temp tables even in debug_temp mode, so that it can be seen which tables have been garbage collected and disk space leaks can be detected. This will not affect the external re-runnability of slow queries in debug_temp mode, as long as the user aborts the debug_temp import while the slow query is still running.
- 01:07 AM Revision 3158: sql_gen.py: ColDict: Use OrderedDict so that order of keys in input dict (if ordered) will be preserved. This should ensure that tempt table unique indexes have their columns in the same order as the output table, so that a merge join can be used.
- 01:01 AM Revision 3157: util.py: dict_subset(): Use OrderedDict so that order of keys in input dict (if ordered) will be preserved
- 12:55 AM Revision 3156: main Makefile: python-Darwin: Added pip installation instructions. python-Linux: Added ordereddict.
- 12:04 AM Revision 3155: sql.py: DbConn.col_info(): cacheable param defaults to True now that callers explicitly turn off cacheable when needed
- 12:00 AM Revision 3154: sql.py: add_index_col(): Explicitly set update()'s col_info caching depending on whether col_info will be changed later by add_not_null()
06/28/2012
- 11:55 PM Revision 3153: sql.py: mk_update(): Renamed cacheable param to cacheable_ so it wouldn't conflict with update()'s cacheable param
- 11:54 PM Revision 3152: sql.py: mk_update(): Added cacheable param to set whether column structure information used to generate the query can be cached
- 11:40 PM Revision 3151: sql.py: add_index_col(): Explicitly set col_info()'s caching depending on whether col_info will be changed later by add_not_null()
- 11:35 PM Revision 3150: sql.py: DbConn.col_info(): Allow caller to specify whether query is cacheable
- 11:22 PM Revision 3149: csv2db: Fixed bug where CREATE TABLE statement was cached, causing it not to be re-executed after a rollback due to a failed COPY FROM. Avoid re-creating the table after a failed COPY FROM, and instead just remove any existing rows.
- 11:09 PM Revision 3148: sql.py: add_index(): Don't generate a unique name for the index because the database does that automatically
- 11:00 PM Revision 3147: csv2db: Vacuum table instead of just reanalyzing it because for some reason reanalyzing it isn't enough to fix the cached row count (causing pgAdmin3 to report that the table needs to be vacuumed)
- 10:54 PM Revision 3146: csv2db: Don't add indexes on the created table because they use up more disk space than the table itself and currently aren't used. (The import process adds indexes on each iteration's column subset instead.)
- 10:21 PM Revision 3145: db_xml.py: partition_size: Turning partitioning back on (with a larger limit), since the largest datasources' temp tables are still too big
- 10:20 PM Revision 3144: sql_io.py: put_table(): Fixed bug where if there were multiple unique constraints that were violated, only the distinct temp table for the last one would get garbage-collected
- 09:01 PM Revision 3143: db_xml.py: partition_size: Set to sys.maxint to disable partitioning. The last bugfix, which avoided returning a large result set to the client which was never read, seems to have fixed the disk space leak, so it's worth reattempting a full simultaneous import.
- 08:30 PM Revision 3142: db_xml.py: put_table(): Subsetting in_table: Truncate in_table when finished with it, to avoid temp table disk space leaks
- 07:56 PM Revision 3141: sql.py: insert_select(): If caller is only interested in the rowcount (if returning == None), keep the NULL rows for each insert on the server using CREATE TABLE AS. (CREATE TABLE AS sets rowcount to # rows in query, so rowcount will still be set correctly.)
- 04:59 PM Revision 3140: top-level map: Added support for custom public schema, to be able to run imports and tests simultaneously (e.g. on a dev machine)
06/27/2012
- 10:56 PM Revision 3139: csv2db: Fixed bug where table needed to be a sql_gen.Table object with the proper schema, so that errors_table would be created in the correct schema. Removed no longer needed changing of the search_path.
- 10:55 PM Revision 3138: csv2db: Fixed bug where table needed to be a sql_gen.Table object with the proper schema, so that errors_table would be created in the correct schema. Removed no longer needed changing of the search_path.
- 10:12 PM Revision 3137: sql.py: DbConn.with_savepoint(): Open a new transaction if one is not already open
- 10:11 PM Revision 3136: sql.py: DbConn: _savepoint starts at 1 because the driver is not in autocommit mode, so a transaction is already open
- 10:05 PM Revision 3135: sql.py: DbConn: _savepoint starts at 1 because the driver is not in autocommit mode, so a transaction is already open
- 09:31 PM Revision 3134: csv2db: Create errors table first, so that imports can start using it right away
- 09:25 PM Revision 3133: input.Makefile: Added import/steps.by_col.sql to generate a Redmine-formatted list of steps for column-based import
- 08:56 PM Revision 3132: bin/map: Optimized default verbosities for the mode: automated tests should not be verbose, column-based import should show all queries to assist profiling, and row-based import should just show row progress
- 08:43 PM Revision 3131: sql_io.py: put(): Run data import queries with log_level=3.5 so they don't clutter the output at the normal import verbosity of 3
- 08:27 PM Revision 3130: db_xml.py: put_table(): Work around PostgreSQL's temp table disk space leak by reconnecting to the DB after every partition
- 08:26 PM Revision 3129: sql.py: mk_select(): Also support limit and start values of type long
- 08:13 PM Revision 3128: sql_gen.py: suffixed_table(): Fixed bug where needed to copy *all* table attrs, such as is_temp status
- 08:05 PM Revision 3127: sql.py: create_table(): Fixed bug where needed to run query in recover mode in case the table exists and was created before the current connection, such that the CREATE TABLE statement would not have been cached
- 07:50 PM Revision 3126: sql.py: create_table(): Removed final newline after query because that's added by the logging mechanism
- 07:43 PM Revision 3125: sql.py: Added reconnect()
- 07:37 PM Revision 3124: sql.py: DbConn._reset(): Assert that _savepoint is 0 instead of setting it to 0
- 07:31 PM Revision 3123: db_xml.py: put_table(): put_table_(): Removed no longer used limit, start params
- 07:23 PM Revision 3122: db_xml.py: put_table(): Merged partitioning and subsetting into same section for simplicity, to avoid creating extra temp tables, and to later allow the connection to be closed and reopened between partitions. partition_size: Expressed value without exponent notation to ensure that it's an integer.
- 07:11 PM Revision 3121: db_xml.py: put_table(): Partitioning in_table: Adjust bounds of last partition to actual row #s included
- 06:43 PM Revision 3120: sql.py: DbConn: Added _ to reset() to indicate that it's a protected method and users should not call it directly
- 06:41 PM Revision 3119: sql.py: DbConn.close(): Reset the connection completely using reset()
- 06:40 PM Revision 3118: sql.py: DbConn: Added clear_cache() and reset() and use reset() in __init__()
- 06:31 PM Revision 3117: bin/map: Use new DbConn.close()
- 06:31 PM Revision 3116: sql.py: DbConn: Added close()
- 06:07 PM Revision 3115: db_xml.py: partition_size: Set to just more than the size of the largest data source that was successfully imported in simultaneous import
- 05:32 PM Revision 3114: db_xml.py: put_table(): Partition in_table if larger than a threshold. The threshold is initially set to disable partitioning. Partitioning will hopefully eliminate the excessive disk usage for large input tables, which has caused the system to run out of disk space due to what may be a bug in PostgreSQL.
- 05:27 PM Revision 3113: db_xml.py: put_table(): Set in_table's default srcs to in_table itself instead of sql_gen.src_self, so that any copies of in_table will inherit the same srcs instead of being treated as a top-level table. This ensures that the top-level table's errors table will always be used.
- 05:17 PM Revision 3112: sql_io.py: cast(): Always convert exceptions to warnings if the input is a column or expression, even if there is no place to save the errors, so that invalid data does not need to be handled by the caller in a (much slower) extra exception-handling loop
- 04:47 PM Revision 3111: sql_io.py: put_table(): MissingCastException: When casting, handle InvalidValueException by filtering out invalid values with invalid2null() in a loop
- 04:45 PM Revision 3110: sql_io.py: cast_temp_col(): Run sql.update() in recover mode in case expr produces errors. Don't cache sql.update() in case this function will be called again after error recovery.
- 04:40 PM Revision 3109: sql.py: Generalized FunctionValueException to InvalidValueException so that it will match all invalid-value errors, not just those occurring in user-defined functions
- 04:22 PM Revision 3108: sql_io.py: put_table(): Removed no longer used sql.FunctionValueException handling, because type casting functions now do their own invalid value handling
- 03:44 PM Revision 3107: db_xml.py: put_table(): Subsetting in_table: Call put_table() recursively using put_table_() to ensure that limit and start are reset to their default values, in case the table gets partitioned (which needs up-to-date limit and start values)
- 03:14 PM Revision 3106: sql_io.py: put_table(): mk_main_select(): Fixed bug where the table of each cond needed to be changed to insert_in_table because mk_main_select() uses the distinct table rather than the full input table
- 03:12 PM Revision 3105: sql_gen.py: with_table(): Support columns that are wrapped in a FunctionCall object
- 02:47 PM Revision 3104: sql_gen.py: index_cols: Store just the name of the index column, and add the table in index_col(), in case the table is ever copied and renamed
06/26/2012
- 11:06 PM Revision 3103: Moved error tracking from sql.py to sql_io.py
- 11:04 PM Revision 3102: sql_io.py: put_table(): Use sql.distinct_table() to uniquify input table, instead of DISTINCT ON. This avoids letting PostgreSQL create a sort temp table to store the output of the DISTINCT ON, which is not automatically removed until the end of the connection, causing database bloat that can use up the available disk space.
- 10:36 PM Revision 3101: sql_gen.py: suffixed_table(): Use concat()
- 10:34 PM Revision 3100: sql_gen.py: with_default_table(): Remove no longer used overwrite param
- 10:33 PM Revision 3099: sql.py: distinct_table(): Return new table instead of renaming input table so that columns that use input table will continue to work correctly
- 10:31 PM Revision 3098: sql_gen.py: Moved NamedCol check from with_default_table() to with_table()
- 09:39 PM Revision 3097: sql.py: distinct_table(): Fixed bug where empty distinct_on cols needed to create a table with one sample row, instead of returning the original table, because this indicates that the full set of distinct_on columns are all literal values and should only occur once
- 09:12 PM Revision 3096: sql.py: run_query(): DuplicateKeyException: Fixed bug where only constraint names matching a certain format were interpreted as DuplicateKeyExceptions. Support constraint names with the name and table separated by ".", not just "_".
- 09:10 PM Revision 3095: sql.py: run_query(): Exception parsing: Match patterns only at the beginning of the exception message to avoid matching embedded messages in causes and literal values
- 08:46 PM Revision 3094: sql.py: Added distinct_table()
- 08:46 PM Revision 3093: sql_gen.py: Added with_table() and use it in with_default_table()
- 07:52 PM Revision 3092: sql.py: mk_insert_select(): ignore mode: Support inserting all columns when cols == None
- 07:47 PM Revision 3091: sql_gen.py: Col, Table: Support non-string names
- 07:25 PM Revision 3090: sql_gen.py: row_count: Use new all_cols
- 07:24 PM Revision 3089: sql_gen.py: Added all_cols
- 07:17 PM Revision 3088: sql_gen.py: Use new as_Name() instead of db.esc_name()
- 07:12 PM Revision 3087: sql_gen.py: Name: Truncate the input name
- 07:11 PM Revision 3086: sql_gen.py: Added Name class and associated functions
- 06:46 PM Revision 3085: sql.py: create_table(): Support creating temp tables. This fixes a bug in copy_table_struct() where the created table was not a temp table if the source table was. copy_table_struct(): Removed no longer needed versioning because that is now handled by create_table().
- 06:33 PM Revision 3084: sql.py: Added copy_table_struct()
- 06:32 PM Revision 3083: sql.py: Moved add_indexes() to Indexes subsection
- 06:30 PM Revision 3082: sql.py: create_table(): Support LIKE table
- 05:18 PM Revision 3081: Moved Data cleanup from sql.py to sql_io.py
- 05:18 PM Revision 3080: Moved error tracking from sql.py to sql_io.py
- 05:12 PM Revision 3079: sql.py: Organized Database structure introspection and Structural changes functions into subsections
- 04:56 PM Revision 3078: Moved error tracking from sql.py to sql_io.py
- 04:46 PM Revision 3077: Moved Heuristic queries from sql.py to new sql_io.py
- 04:32 PM Revision 3076: Added top-level analysis dir for range modeling
- 04:02 PM Revision 3075: sql.py: run_query_into(): Documented why analyze() must be run manually on newly populated temp tables
- 03:57 PM Revision 3074: sql.py: DbConn: Added autoanalyze mode. Added autoanalyze() which runs analyze() only if in autoanalyze mode. Use new autoanalyze() in functions that change a table's contents.
- 03:52 PM Revision 3073: sql.py: run_query_into(): analyze() the created table to ensure the query planner's initial stats are accurate
06/25/2012
- 09:46 PM Revision 3072: inputs/SpeciesLink/src: Added custom header that overwrites existing header so that column names will not be too long for the staging table
- 09:35 PM Revision 3071: cat_csv: Support overwriting the existing header using a separate header file
- 08:49 PM Revision 3070: schemas/vegbien.sql: Added location.location_coords index to speed up large imports by providing an index for merge joins
- 08:43 PM Revision 3069: csv2db: Reanalyze table, so that query planner stats are up to date even though the table doesn't need to be vacuumed anymore
- 08:42 PM Revision 3068: sql.py: Added analyze()
- 08:11 PM Revision 3067: csv2db: Removed no longer needed table vacuum (cleanup_table() now avoids creating dead rows)
- 08:10 PM Revision 3066: sql.py: cleanup_table(): Use update()'s new in_place mode to avoid needing to vacuum the table
- 08:02 PM Revision 3065: sql.py: mk_update(): in_place: Support updating multiple columns at once
- 07:44 PM Revision 3064: sql.py: update() calls: Use in_place where possible to avoid creating dead rows, which bloats table size
- 07:37 PM Revision 3063: sql.py: DbConn.col_info(): Support user-defined types
- 07:33 PM Revision 3062: sql_gen.py: Added Nullif
- 07:27 PM Revision 3061: sql_gen.py: Added Coalesce class and use it in EnsureNotNull
- 07:15 PM Revision 3060: sql_gen.py: Added coalesce and use it in EnsureNotNull
- 07:00 PM Revision 3059: sql.py: DbConn.col_info(): Don't cache the structure query because the column type, etc. may change between calls. This fixes a bug in mk_update() where the column type would be retrieved before a NOT NULL constraint was added, causing the NOT NULL constraint not to be in the cache info about the column.
- 06:56 PM Revision 3058: sql.py: mk_update(): Implemented in_place mode
- 06:39 PM Revision 3057: sql.py: mk_update(): Factored out filtering of input values so only `.to_str(db)` is used inline in the creation of the query
- 06:32 PM Revision 3056: sql.py: mk_update(): Added in_place param
- 06:13 PM Revision 3055: csvs.py: TsvReader: Prevent "new-line character seen in unquoted field" errors by replacing '\r' with '\n'
- 06:12 PM Revision 3054: csv2db: Adding indexes: Fixed bug where sql.add_index()'s ensure_not_null param needed to be renamed to ensure_not_null_
- 05:16 PM Revision 3053: sql.py: cast(): columns values clause: Use start=0 to avoid "SELECT statement missing a WHERE, LIMIT, or OFFSET clause" warning
- 02:45 PM Revision 3052: inputs/import.stats.xls: Updated for most recent run
06/22/2012
- 07:46 PM Revision 3051: sql.py: Removed no longer needed mk_track_data_error()
- 07:45 PM Revision 3050: sql.py: track_data_error(): Use for loop and insert() (ignoring DuplicateKeyException) to insert entries into the errors table, to get the same optimization benefits this change provides in other filter-out contexts, and to improve clarity
- 07:35 PM Revision 3049: sql.py: cast(): Use FOR loop with EXCEPTION block instead of CROSS JOIN with LEFT JOIN to insert entries into the errors table, to get the same optimization benefits this change provides in other filter-out contexts, and to improve clarity
- 07:31 PM Revision 3048: sql_gen.py: NamedValues: Support None cols param for no named columns
- 06:55 PM Revision 3047: sql.py: ensure_not_null(): Made the use of index columns configurable based on the # of rows in the table, because for small datasources, they seem to add 6-25% to the total import time
- 06:12 PM Revision 3046: xml_func.py: _noCV: Fixed bug where assumed items was an iterator when it's now a list
- 06:02 PM Revision 3045: sql.py: add_index_col(), cast_temp_col(): Cache add_col() by providing new comment param to distinguish new columns of the same (colliding) suffixed name but from different source columns
- 05:54 PM Revision 3044: sql.py: add_index_col(), cast_temp_col(): Cache the update that fills in the new column, since it's idempotent
- 05:52 PM Revision 3043: sql.py: update(): Pass cacheable to run_query()
- 05:48 PM Revision 3042: sql.py: add_col(): Added comment param which can be used to distinguish columns of the same name from each other when they contain different data, to allow the ADD COLUMN query (and related queries, such as adding indexes) to be cached
- 05:46 PM Revision 3041: sql_gen.py: Added esc_comment()
- 05:37 PM Revision 3040: sql.py: DbConn.DbCursor.execute(): Allow ADD COLUMN to be cached if it has a distinguishing comment, because then the rest of query will be unique in the face of name collisions
- 05:29 PM Revision 3039: sql.py: add_col(): Document that name may be versioned, so caller needs to propagate any renaming back to any source column for the TypedCol
- 05:27 PM Revision 3038: sql.py: add_col() callers: Removed column name versioning because that is now handled by add_col()
- 05:22 PM Revision 3037: sql.py: add_col() callers: Fixed bug where needed to propagate any renaming of typed column back to regular column
- 05:19 PM Revision 3036: sql.py: add_col(): Version column names to avoid collisions. (Previously, callers were required to do this themselves.)
- 05:13 PM Revision 3035: sql.py: cast_temp_col(): Handle column name collisions like add_index_col()
- 04:49 PM Revision 3034: sql.py: mk_insert_select(): INSERT IGNORE: Switched to using FOR loop rather than cursors because cursors are only needed if you want to process multiple rows in the same EXCEPTION block (which you can't do because then all the previous inserts will be rolled back if one row is a duplicate key)
- 04:40 PM Revision 3033: sql.py: mk_insert_select(): INSERT IGNORE: Moved FETCH FROM cursor outside EXCEPTION block since only the insert needs to have unique_violations ignored
- 04:33 PM Revision 3032: sql.py: put_table(): Removed no longer needed EXCLUSIVE lock on the output table
- 04:25 PM Revision 3031: add_index_col(): Run all operations with log_level=3 because it is the equivalent of creating an index, and that has log_level=3 to be excluded from the primary algorithm steps. add_not_null(): Run query with log_level=3 because it's an internal operation and for add_index_col() above.
- 04:23 PM Revision 3030: sql.py: update(): Pass any log_level to run_query()
- 04:11 PM Revision 3029: xml_func.py: process(): Added pass-through optimization for aggregating functions with one arg
- 04:02 PM Revision 3028: xml_func.py: process(): Call list() on the items' iterator so that its length can be determined wherever it's used without needing to call list() on it separately
- 03:43 PM Revision 3027: sql.py: mk_select(): Put '*' on same line as 'SELECT' if there is no DISTINCT ON, to avoid extra newlines
- 03:38 PM Revision 3026: sql_gen.py: NamedTable.to_str(): Put AS clause on same line as code being named if it didn't contain a newline, to avoid extra newlines
- 03:29 PM Revision 3025: sql.py: run_query(): Always parse exceptions, whether recover is set or not, to avoid making the caller set recover just to parse exceptions. If recover is not set, just don't run any queries when generating the parsed exception and return an empty value for the needed information. (A partial parsed exception is better than an unparsed one.)
- 03:18 PM Revision 3024: sql_gen.py: is_temp_col(): Use is_table_col() so that the type of the parameter is also checked
- 03:15 PM Revision 3023: inputs/import.stats.xls: Bolded the ms/row totals. Added note that non-bolded totals are not directly comparable because different methods were run with different numbers of rows.
- 03:10 PM Revision 3022: sql.py: put_table(): Inserting new rows: Removed no longer needed filter-out LEFT JOIN, since the INSERT IGNORE replacement handles duplicate keys
- 03:02 PM Revision 3021: sql.py: put_table(): Inserting new rows: Use insert_select()'s (mk_insert_select()'s) new ignore option to ignore duplicate keys
- 02:59 PM Revision 3020: sql_gen.py: to_name_only_col(): Consider any non-table column, including NamedCol, to be already name-only. This fixes a bug in sql.mk_insert_select() where the value of a NamedCol was removed by to_name_only_col() even though it was needed.
- 02:30 PM Revision 3019: sql.py: mk_insert_select(): INSERT IGNORE: Use an EXCEPTION block for each individual row because "When an error is caught by an EXCEPTION clause, [...] all changes to persistent database state within the block are rolled back." (http://www.postgresql.org/docs/8.3/static/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING). Documented that cursor stays at current position in spite of automatic ROLLBACK on EXCEPTION.
- 02:11 PM Revision 3018: sql.py: mk_insert_select(): INSERT IGNORE: Use RETURN QUERY on the insert query instead of RETURN NEXT on a local var to forward the RETURNING value
- 02:08 PM Revision 3017: sql.py: mk_insert_select(): INSERT IGNORE: Support RETURNING clause. Always return something to set the correct rowcount.
- 02:06 PM Revision 3016: sql_gen.py: Use an as_*() function instead of manually checking the type wherever possible
- 01:18 PM Revision 3015: sql.py: mk_insert_select(): INSERT IGNORE: Added duplicate key handling by using EXCEPTION block to catch unique_violation
- 12:58 PM Revision 3014: sql.py: mk_insert_select(): INSERT IGNORE: Fixed bug where user-defined types were not supported correctly, by referencing the column's type directly in PL/pgSQL instead of retrieving it from the information_schema. Use a row variable of the output table's row type to store the select query data, so that you don't need to create a separate local var for each output column.
06/21/2012
- 10:24 PM Revision 3013: import.stats.xls: Changed \/row (ms)" to "ms/row"
- 09:52 PM Revision 3012: sql.py: mk_insert_select(): INSERT IGNORE: Used typed vars to store each column of the select query because the select query's column names are not known, so it would be unknown which record var column to use for each output column. (Add types so there are no unknown-type values that PostgreSQL will say can't be cast.)
- 09:21 PM Revision 3011: sql.py: insert_select(): In ignore mode, always recover from errors
- 09:14 PM Revision 3010: sql.py: mk_insert_select(): INSERT IGNORE: Implemented simple cursor loop that just inserts each row, without ignoring duplicate keys
- 08:26 PM Revision 3009: sql.py: mk_insert_select(): Added ignore param to do an INSERT IGNORE equivalent operation. Added basic structure for INSERT IGNORE. Support embeddable queries without RETURNING columns (which is the case for most INSERT IGNORE queries).
- 06:50 PM Revision 3008: sql.py: run_query_into(): Always set into.schema to None, so that table is created in the temp schema in debug_temp mode
- 06:25 PM Revision 3007: sql.py: DbConn.DbCursor.execute(): Don't cache ADD COLUMN queries because the rest of the query must be unique in order to use caching
- 06:24 PM Revision 3006: sql.py: add_col(): Don't ignore already-existing columns because sometimes name truncation causes collisions, requiring the caller to version the name. add_index_col(): Version the column name to avoid collisions.
- 06:08 PM Revision 3005: sql.py: add_index(): Version index names to avoid collisions
- 05:58 PM Revision 3004: sql.py: DbConn.DbCursor.execute(): Always cache structural changes, not just if they throw an exception
- 05:40 PM Revision 3003: sql.py: add_index(): Fixed bug where expr could not be deep-copied until ensure_not_null() had been run on it, because ensure_not_null() modifies the index_cols of the table and this needs to be modified on the original table
- 05:39 PM Revision 3002: sql.py: add_index(): Fixed bug where expr could not be deep-copied until ensure_not_null() had been run on it, because ensure_not_null() modifies the index_cols of the table and this needs to be modified on the original table
- 05:12 PM Revision 3001: sql_gen.py: EnsureNotNull: Added to_str() which uses the column's index col if available. This works around a PostgreSQL bug where indexes on expressions are sometimes not used, but indexes on plain columns are used.
- 05:09 PM Revision 3000: sql_gen.py: Store index cols in the table instead of in the column, because columns are often recreated from the table and name but tables are generally not copied
- 04:21 PM Revision 2999: sql_gen.py: Added is_indexed_col()
- 04:15 PM Revision 2998: sql.py: add_index(): Use new sql.ensure_not_null(), which creates a separate column to store the index expr where possible
- 04:12 PM Revision 2997: sql.py: Added add_not_null(), add_index_col(), ensure_not_null()
- 04:10 PM Revision 2996: sql_gen.py: Added is_temp_col()
- 03:57 PM Revision 2995: sql_gen.py: concat(): Also preserve cast suffix
- 03:44 PM Revision 2994: sql_gen.py: Col: Added index_col attr
- 03:19 PM Revision 2993: sql_gen.py: Added suffixed_col()
- 02:59 PM Revision 2992: sql.py: run_query_into(): Set the into table to be a temp table
- 02:59 PM Revision 2991: sql_gen.py: Table: Store whether table is temp table
- 01:50 PM Revision 2990: sql_gen.py: concat(): Also preserve appended column names
- 01:48 PM Revision 2989: sql_gen.py: Col.to_str(): Take for_str param which does both concat() and str()/clean_name() instead of use_concat param which doesn't remove quotes from the strings before concatenating, causing strings to be incorrectly truncated
06/20/2012
- 09:46 PM Revision 2988: sql_gen.py: EnsureNotNull: Cast the null sentinel to the type it's the sentinel for, so that COALESCE() returns the same type as the provided type, instead of potentially incorrectly assuming the constant's type. This fixes a bug where COALESCE() returned an incorrect type.
- 09:38 PM Revision 2987: sql.py: cast(): Use new sql_gen.Cast
- 09:38 PM Revision 2986: sql_gen.py: Added Cast
- 09:37 PM Revision 2985: sql_gen.py: concat(): Consider ")"s part of the existing suffix, so that parentheses match up
- 09:05 PM Revision 2984: Removed extra trailing whitespace
- 08:36 PM Revision 2983: main Makefile: Added explicit schemas/temp/reinstall target because vegbiendev (Ubuntu 12.04) seems to ignore %/reinstall
- 08:24 PM Revision 2982: vegbien_dest: Added temp to schemas list so the intermediate tables will be created there, where they are easy to remove with `make schemas/temp/reinstall`
- 08:23 PM Revision 2981: main Makefile: Added %/reinstall, which calls the corresponding uninstall and install targets
- 08:11 PM Revision 2980: Added temp schema to store intermediate tables permanently when desired for debugging. It is not part of schemas/all because it should not be remade to include current intermediate tables when remaking other schema exports.
- 08:04 PM Revision 2979: schemas: Moved *.sql.make into Makefile to take advantage of % pattern matching
- 07:47 PM Revision 2978: Removed no longer used schemas/vegbien_empty.sql
- 07:46 PM Revision 2977: main Makefile: Removed empty_db, because `make schemas/reinstall` has the same effect and is simpler
- 07:40 PM Revision 2976: README.TXT: Changed documentation to use make schemas/reinstall to empty the DB, since that command is simpler. Added how to archive the last import.
- 07:11 PM Revision 2975: db_xml.py: put_table(): Removed `if not db.debug_temp` check because that is done by sql.empty_temp()
- 07:10 PM Revision 2974: sql.py: put_table(): Use new empty_temp()
- 07:06 PM Revision 2973: import.stats.xls: Added comments for estimated numbers. Added "," separators to large numbers.
- 06:21 PM Revision 2972: sql.py: empty_temp(): In debug_temp mode, leave temp tables there for debugging
- 06:06 PM Revision 2971: sql.py: empty_temp(): Don't output at log_level 2 because it's an internal query, not part of the core algorithm
- 06:06 PM Revision 2970: sql.py: truncate(): Added kw_args to pass to run_query()
- 05:52 PM Revision 2969: Added inputs/import.stats.xls, which compares row-based and column-based import. This shows that column-based import is slowed down by table locking when run simultaneously, so we will need a new INSERT IGNORE replacement that doesn't lock tables.
- 03:14 PM Revision 2968: inputs: Ignore OpenOffice.org lock files
- 02:19 PM Revision 2967: sql.py: empty_temp(): Don't print log message if not emptying any tables
- 02:16 PM Revision 2966: db_xml.py: put_table(): Empty unneeded temp tables to free up memory
- 02:14 PM Revision 2965: sql.py: Added empty_temp()
- 02:14 PM Revision 2964: sql.py: Use new lists.mk_seq()
- 02:13 PM Revision 2963: lists.py: Added mk_seq()
- 02:11 PM Revision 2962: lists.py: is_seq(): Also return true for sets
06/19/2012
- 03:02 PM Revision 2961: schemas/vegbien.sql: specimenreplicate: Added indexes using COALESCE() to match what sql_gen does
- 02:08 PM Revision 2960: sql.py: put_table(): Getting output table pkeys of existing/inserted rows: Do a DISTINCT ON the input pkey (row_num) in case the plain JOIN matched multiple output table rows for one input table row
- 01:44 PM Revision 2959: sql.py: put_table(): Empty unneeded temp tables to free up memory and avoid running out of memory (the temp tables seem to be in-memory only)
- 01:30 PM Revision 2958: sql_gen.py: null_sentinels: Added value for type timestamp with time zone. Put each type on its own line for clarity.
- 01:03 PM Revision 2957: schemas/vegbien.sql: locationdetermination: Changed indexes to use COALESCE() to match what sql_gen now does
- 12:23 PM Revision 2956: schemas/vegbien.sql: location: Added indexes using COALESCE() to match what sql_gen does
- 12:06 PM Revision 2955: sql.py: cast_temp_col(): Add an index on the created column
- 11:55 AM Revision 2954: sql_gen.py: null_sentinels: Added value for type double precision
- 11:52 AM Revision 2953: sql_gen.py: ensure_not_null(): Warn of no null sentinel for type, even if caller catches error
- 10:11 AM Revision 2952: schemas/py_functions.sql: Added plain function _namePart() and use it in trigger function _namePart()
- 09:42 AM Revision 2951: schemas/py_functions.sql: Added plain functions _dateRangeStart() and _dateRangeEnd() and use them in trigger functions _dateRangeStart() and _dateRangeEnd()
- 09:28 AM Revision 2950: schemas/functions.sql: _label(): Ensure that label is NOT NULL so it doesn't NULL out the entire string
- 09:23 AM Revision 2949: schemas/functions.sql: Added plain function _nullIf() and use it in trigger function _nullIf()
- 08:56 AM Revision 2948: sql.py: DbConn.DbCursor._cache_result(): Corrected comment to reflect why different types of queries are cached differently
- 08:46 AM Revision 2947: sql.py: add_col(): Catch DuplicateExceptions so that columns that already existed are ignored
- 08:43 AM Revision 2946: sql.py: run_query(): DuplicateException: Also match "column already exists" errors
- 08:20 AM Revision 2945: sql.py: Merged DuplicateTableException and DuplicateFunctionException into one exception DuplicateException, with a type variable for the type of duplicate item. Added ExceptionWithNameType.
- 08:05 AM Revision 2944: schemas/functions.sql: Fixed bug where external function calls needed to be schema-qualified in case functions schema is not in the search_path
- 07:59 AM Revision 2943: schemas/functions.sql: Added plain function _label() and use it in trigger function _label()
- 07:50 AM Revision 2942: sql.py: put_table(): Support plain SQL functions in addition to relational functions
06/18/2012
- 11:08 PM Revision 2941: sql_gen.py: Added NamedArg. FunctionCall: Support named arguments (http://www.postgresql.org/docs/9.0/static/sql-syntax-calling-funcs.html).
- 10:54 PM Revision 2940: schemas/functions.sql: Added plain function _merge() and use it in trigger function _merge()
- 10:49 PM Revision 2939: schemas/functions.sql: Added plain function _alt() and use it in trigger function _alt()
- 10:37 PM Revision 2938: schemas/functions.sql: Removed no longer used ensure_not_null()
- 10:22 PM Revision 2937: sql.py: put_table(): MissingCastException: Use cast_temp_col() so that cast will occur before any main insert, which locks the output table and should take as little time as possible
- 10:18 PM Revision 2936: sql.py: Added cast_temp_col()
- 10:17 PM Revision 2935: sql.py: add_col(): Support additional run_query() kw_args. add_row_num(): Use new add_col().
- 10:09 PM Revision 2934: sql.py: Added add_col()
- 08:17 PM Revision 2933: sql_gen.py: Col.__str__(): Truncate any table name using concat() to ensure that the full column name is included in the string
- 07:59 PM Revision 2932: strings.py, sql_gen.py: Renamed add_suffix() to concat() to reflect that this is a fixed-length replacement for +
- 07:49 PM Revision 2931: sql.py: put_table(): Moved MissingCastException to the top of the exceptions list because it's more of a core exception than the others, and will be raised before any rows are even inserted
- 06:20 PM Revision 2930: sql.py: DbConn.with_savepoint(): Always release savepoint, because after ROLLBACK TO SAVEPOINT, "The savepoint remains valid and can be rolled back to again" (http://www.postgresql.org/docs/8.3/static/sql-rollback-to.html). Moved `self._savepoint -= 1` to the main try block's new finally block.
- 05:59 PM Revision 2929: sql.py: put_table(): Lock output table right before, and in the same nested transaction as, the insert statement that needs lock, so that it is not released in a prior autocommit and is held for as little time as possible
- 05:38 PM Revision 2928: db_xml.py: put_table(): Removed no longer needed commit param
- 05:16 PM Revision 2927: bin/map: Removed rollback() call before closing the connection because PostgreSQL does this automatically
- 05:14 PM Revision 2926: csv2db: Log inserts with log_level=5 so they are not shown for verbosity 4, which is used to see the savepoints and autocommits
- 05:13 PM Revision 2925: Removed unnecessary db.db.commit() calls because commits are now done automatically by DbConn's autocommit mode
- 04:54 PM Revision 2924: sql.py: DbConn.do_autocommit(): Output the "Autocommitting" debug message with level=4 so that it doesn't clutter up the logging output for normal verbosities
- 04:50 PM Revision 2923: DbConn: autocommit mode defaults to True so that all scripts get the benefit of automatic commits
- 04:49 PM Revision 2922: input.Makefile: Staging tables: import/install-%: Include the table name in the log file name so that successive tables for the same datasource don't overwrite the same log file
- 04:39 PM Revision 2921: sql.py: DbConn: Don't always autocommit in debug_temp mode, because this could cause autocommit mode to be turned on when the user does not expect it
- 04:36 PM Revision 2920: bin/map: connect_db(): Autocommit in commit mode to avoid the need for manual commits. This should also reduce the time that table locks are held, to avoid unnecessary contention when multiple processes are trying to insert into the same output table. (The program always uses nested transactions to support rollbacks, so there is no problem autocommitting whenever a top-level nested transaction or top-level query completes.)
- 04:29 PM Revision 2919: sql_gen.py: Removed TempFunction because that functionality is now provided by DbConn.TempFunction()
- 04:28 PM Revision 2918: sql.py: Use new DbConn.TempFunction()
- 04:28 PM Revision 2917: sql.py: DbConn: Added TempFunction()
- 04:25 PM Revision 2916: sql.py: Use new DbConn.debug_temp config option to control whether temporary objects should instead be permanent
- 04:20 PM Revision 2915: sql.py: DbConn: Added config option debug_temp
- 04:12 PM Revision 2914: sql.py: function_exists(): Fixed bug where trigger functions needed to be excluded, since they cannot be called directly
- 03:49 PM Revision 2913: sql.py: Added function_exists()
- 03:49 PM Revision 2912: sql_gen.py: Made Function an alias of Table so that isinstance(..., Function) will always work correctly
- 03:45 PM Revision 2911: sql_gen.py: Added as_Function()
06/15/2012
- 06:16 AM Revision 2910: sql.py: put_table(): Lock the output table in EXCLUSIVE mode *before* getting its pkey so that an ACCESS SHARE lock is not acquired before EXCLUSIVE (causing a lock upgrade and deadlock). This race condition may not have been previously noticeable because pkey() is cached, so calling it doesn't necessarily execute a query or acquire an ACCESS SHARE lock.
- 05:52 AM Revision 2909: sql.py: put_table(): Document that must be run at the *beginning* of a transaction
- 05:49 AM Revision 2908: sql.py: put_table(), mk_select(): Switched back to having put_table() acquire the EXCLUSIVE locks, but right at the beginning of the transaction, in order to avoid lock upgrades which cause deadlocks
- 05:35 AM Revision 2907: sql.py: with_autocommit(): Only allow turning autocommit on, because the opposite is not meaningful and may conflict with the session-global isolation level
- 05:33 AM Revision 2906: sql.py: DbConn: Set the transaction isolation level to READ COMMITTED using set_isolation_level() so that the isolation level affects all transactions in the session, not just the current one
- 05:21 AM Revision 2905: sql.py: DbConn: Always set the transaction isolation level to READ COMMITTED so that when a table is locked for update, its contents are frozen at that point rather than earlier. This ensures that no concurrent duplicate keys were inserted between the time the table was snapshotted (at the beginning of the transaction for SERIALIZABLE) and the time it was locked for update.
- 05:02 AM Revision 2904: sql.py: put_table(): Removed locking output tables to prevent concurrent duplicate keys because that is now done automatically by mk_select()
- 05:01 AM Revision 2903: sql.py: mk_select(): Filtering on no match: Lock the joined table in EXCLUSIVE mode to prevent concurrent duplicate keys when used with INSERT SELECT
- 04:59 AM Revision 2902: sql_gen.py: Added underlying_table() and use it in underlying_col()
- 04:39 AM Revision 2901: main Makefile: schemas/rotate: Fixed bug where needed to run schemas/public/install, not full schemas/install, after renaming public schema
- 04:32 AM Revision 2900: sql.py: put_table(): Lock output tables to prevent concurrent duplicate keys
- 04:31 AM Revision 2899: sql.py: Added lock_table()
- 03:53 AM Revision 2898: bin/map: connect_db(): Only use autocommit mode if verbosity > 3, to avoid accidentally activating it if you want debug output in normal import mode
- 03:45 AM Revision 2897: bin/map: connect_db(): Only use autocommit mode if verbosity > 2, because it causes the intermediate tables to be created as permanent tables, which you don't want unless you're actually debugging (verbosity = 2 is normal for column-based import)
- 03:25 AM Revision 2896: sql.py: put_table(): remove_all_rows(): Changed log message to "Ignoring all rows" because NULL is not necessarily the pkey value that will be returned for the rows
- 03:17 AM Revision 2895: sql.py: put_table(): Don't add index on columns that will have values filtered out, because indexes have already been added on all columns in the iteration's input table by flatten()
- 03:12 AM Revision 2894: sql.py: DbConn._db(): Setting serializable isolation level: Always set this (if self.serializable is set), even in autocommit mode, because autocommit mode is implemented by manual commits in the DbConn wrapper object rather than using the underlying connection's autocommit mode (which does not allow setting the isolation level)
- 03:08 AM Revision 2893: sql.py: DbConn._db(): Setting search_path: Use `SET search_path` and `SHOW search_path` instead of combining the old and new search_paths in SQL itself using `SELECT set_config('search_path', ...)`
- 02:47 AM Revision 2892: csv2db: ProgressInputStream: Use default progress message 'Read %d line(s)' because there is not necessarily one CSV row per line, due to embedded newlines
- 01:47 AM Revision 2891: input.Makefile: Staging tables: import/install-%: Only output to the log file if log option is non-empty (which it is by default)
- 01:46 AM Revision 2890: csv2db: Support reinstalling just the errors table using new errors_table_only option
- 01:45 AM Revision 2889: sql.py: Added drop_table()
- 01:20 AM Revision 2888: schemas/vegbien.sql: method: Changed indexes to use `COALESCE(..., E'\\N')` to match what sql_gen now does
- 01:16 AM Revision 2887: schemas/vegbien.sql: specimenreplicate: Added indexes using COALESCE() to match what sql_gen does
- 01:12 AM Revision 2886: schemas/vegbien.sql: locationevent: Added indexes using COALESCE() to match what sql_gen does
- 12:57 AM Revision 2885: schemas/vegbien.ERD.mwb: Synced with schema
- 12:54 AM Revision 2884: schemas/vegbien.sql: party: Changed indexes to use `COALESCE(..., E'\\N')` to match what sql_gen now does
- 12:38 AM Revision 2883: Wrap sys.stderr.write() calls in strings.to_raw_str() to avoid UnicodeEncodeErrors when stderr is to a file and the default encoding is ASCII
- 12:37 AM Revision 2882: strings.py: Added to_raw_str()
- 12:12 AM Revision 2881: bin/map: When logging the row # being processed, add 1 because row # is interally 0-based, but 1-based to the user
- 12:05 AM Revision 2880: bin/map: Log the row # being processed with level=1.1 so that the user can see a status report if desired
06/14/2012
- 11:35 PM Revision 2879: exc.py: str_(): Fixed bug where UnicodeEncodeError would be raised when msg contains non-ASCII chars, by wrapping e.args[0] in strings.ustr()
- 11:23 PM Revision 2878: exc.py: print_ex(): Wrap msg in strings.to_unicode() to try to avoid UnicodeEncodeError when msg contains non-ASCII chars
- 11:20 PM Revision 2877: sql.py: create_table(): Don't set pkey.nullable to False because the caller should make sure the pkey has the appropriate type
- 11:20 PM Revision 2876: csv2db: Use sql_gen.TypedCol.nullable instead of manually adding 'NOT NULL' to the type. Ensure that pkeys are properly NOT NULL.
- 11:15 PM Revision 2875: csv2db: Adding indexes: Create plain indexes using ensure_not_null=False because the indexes will primarily be used by the user to search for specific values, rather than by the mapping script which uses the ensure_not_null
- 11:13 PM Revision 2874: sql.py: DbConn.col_info(): Run query with log_level=4 because it gathers information about database structure, and should have the same log_level as other queries that do that
- 11:09 PM Revision 2873: csv2db: Adding indexes: Fixed bug where col.to_Col() could not be used because sql.add_index() does not support name-only columns (plain strings are OK, though)
- 11:00 PM Revision 2872: sql.py: create_table(): has_pkey: Use new TypedCol.constraints to store 'PRIMARY KEY'
- 10:59 PM Revision 2871: sql_gen.py: TypedCol: Added constraints instance var
- 10:38 PM Revision 2870: sql_gen.py: EnsureNotNull: Made coalesce() all uppercase to match how pg_dump spells it
- 10:36 PM Revision 2869: schemas/vegbien.sql: namedplace: Fixed bug where parent_id needed to be included in UNIQUE CONSTRAINT (now UNIQUE INDEX), since there can be more than one e.g. city of the same name if they are in different states
- 10:32 PM Revision 2868: schemas/vegbien.sql: plantname: Changed indexes to use `COALESCE(..., 2147483647)` to match what sql_gen now does
- 10:22 PM Revision 2867: schemas/py_functions.sql: _dateRangeStart, _dateRangeEnd: Changed indexes to use `COALESCE(..., E'\N')` to match what sql_gen now does
- 10:19 PM Revision 2866: schemas/py_functions.sql: _namePart: Changed indexes to use `COALESCE(..., E'\N')` to match what sql_gen now does
- 10:16 PM Revision 2865: schemas/functions.sql: _nullIf: Changed indexes to use `COALESCE(..., E'\N')` to match what sql_gen now does
- 10:15 PM Revision 2864: schemas/functions.sql: _nullIf: Require a non-NULL null-equivalent value
- 10:12 PM Revision 2863: schemas/functions.sql: _label: Changed indexes to use `COALESCE(..., E'\N')` to match what sql_gen now does
- 10:09 PM Revision 2862: schemas/functions.sql: _label: Require a non-NULL label
- 10:04 PM Revision 2861: sql_gen.py: null_sentinels: Removed types where a sentinel doesn't make sense (unknown types, boolean) because types with no sentinel are now handled gracefully by users of ensure_not_null()
- 10:00 PM Revision 2860: sql.py: add_index(): ensure_not_null: Handle unknown types gracefully
- 09:52 PM Revision 2859: sql_gen.py: MockDb: Added col_info()
- 09:48 PM Revision 2858: sql_gen.py: Use as_*() functions where the auto-wrapping was previously done manually
- 09:44 PM Revision 2857: sql_gen.py: CompareCond.to_str(): Use ensure_not_null()'s new type_ param to apply same function to both sides but not if the right side is already NOT NULL
- 09:26 PM Revision 2856: sql_gen.py: null_sentinels: Added value for character varying type
- 08:45 PM Revision 2855: sql_gen.py: ensure_not_null(): Support non-column inputs if type_ is set
- 08:32 PM Revision 2854: sql_gen.py: null_sentinels: Added value for USER-DEFINED type
- 08:31 PM Revision 2853: sql.py: mk_select(): Joins: Filtering on no match: Use '~=' sql_gen.CompareCond operator so that IS NULL is always used, regardless of the not-null column's nullability
- 08:24 PM Revision 2852: sql_gen.py: null_sentinels: Added value for boolean type
- 08:21 PM Revision 2851: sql_gen.py: ensure_not_null(): Added type_ param to override the underlying column's type
- 08:09 PM Revision 2850: sql_gen.py: EnsureNotNull: Take a type param instead of a null param so that the EnsureNotNull object stores the underlying column's type
- 08:00 PM Revision 2849: sql_gen.py: underlying_col(): Support non-Col inputs
- 07:42 PM Revision 2848: sql_gen.py: EnsureNotNull: Removed default value for null param to remind user that default value depends on value's type and will not always be a string
- 07:39 PM Revision 2847: sql.py: add_index(): Added ensure_not_null param to disable the ensure_not_null functionality to force a plain index
- 07:14 PM Revision 2846: sql.py: flatten(): Add indexes on the created table so its columns can be used in an O(n) merge join
- 07:12 PM Revision 2845: sql_gen.py: null_sentinels: Added value for integer type
- 06:52 PM Revision 2844: sql_gen.py: CompareCond.to_str(): Always wrap the left-side column if it's nullable. Wrap the right-side value if the left side was wrapped, rather than if both the left and right side are nullable. This causes coalesce() indexes to be used to look up NULL values using the value NULL gets coalesced to, rather than doing a sequential scan.
- 06:43 PM Revision 2843: sql_gen.py: Run truncate() on all identifiers so that literal-string-based lookups for an identifier (such as in db.col_info()) don't use the untruncated value
- 06:40 PM Revision 2842: sql_gen.py: Added truncate()
- 06:40 PM Revision 2841: sql.py: put_table(): Resolving default value column: Fixed bug where default value of None was used as a key for mapping, even though this is an invalid Col name
- 05:30 PM Revision 2840: sql_gen.py: ensure_not_null(): If input column cannot be ensured to be NULL, pass any raised exception through rather than suppressing it and leaving the column in a nullable state
06/13/2012
- 11:11 PM Revision 2839: schemas/functions.sql: _merge: Changed indexes to use `COALESCE(..., E'\\N')` to match what sql_gen now does
- 11:08 PM Revision 2838: schemas/functions.sql: _alt: Changed indexes to use `COALESCE(..., E'\\N')` to match what sql_gen now does
- 10:59 PM Revision 2837: sql_gen.py: CompareCond.to_str(): Handle nullable columns using ensure_not_null()
- 10:46 PM Revision 2836: sql_gen.py: ensure_not_null(): Raise NoUnderlyingTableException if can't ensure not null for that reason
- 10:20 PM Revision 2835: sql_gen.py: is_underlying_table(): Support non-Table inputs
- 10:15 PM Revision 2834: sql_gen.py: NamedValues: Call set_cols_table() with the created table, not just the name, so that is_underlying_table() works properly
- 09:49 PM Revision 2833: sql_gen.py: underlying_col(): If no underlying table, raise NoUnderlyingTableException
- 09:45 PM Revision 2832: sql_gen.py: Added is_underlying_table()
- 09:34 PM Revision 2831: sql_gen.py: ensure_not_null(): Call underlying_col() on the column to remove all renamings
- 09:33 PM Revision 2830: sql_gen.py: Added underlying_col()
- 09:24 PM Revision 2829: sql_gen.py: Join.to_str(): join(): Removed no longer needed `*_table = *_table.to_Table()`
- 09:22 PM Revision 2828: sql_gen.py: Col: Support Table objects that are not just names, by calling `.to_Table()` on the table before stringifying it
- 09:22 PM Revision 2827: sql_gen.py: Col: Support Table objects that are not just names, by calling `.to_Table()` on the table before stringifying it
- 08:59 PM Revision 2826: sql_gen.py: ensure_not_null(): Added ignore_unknown_type param
- 08:55 PM Revision 2825: sql_gen.py: CompareCond.to_str(): Put handling nullable columns as a separate step so it can be expanded
- 08:36 PM Revision 2824: csv2db: Errors table: Removed no longer needed sql_gen.EnsureNotNull() because this is now added automatically
- 08:35 PM Revision 2823: sql.py: add_index(): Handle nullable columns by using sql_gen.ensure_not_null() to automatically add a coalesce() wrapper where needed
- 08:33 PM Revision 2822: sql_gen.py: Added ensure_not_null()
- 08:29 PM Revision 2821: sql.py: DbConn.col_info(): Fixed bug where is_nullable needed to be cast to a boolean
- 08:06 PM Revision 2820: sql.py: cast(): Support string column name inputs
- 07:12 PM Revision 2819: sql.py: DbConn: Renamed col_default() to col_info() and have it return a sql_gen.TypedCol object containing all the TypedCol info about the column, not just the default value
- 06:55 PM Revision 2818: sql_gen.py: TypedCol: Added default and nullable params
- 06:53 PM Revision 2817: dicts.py: Import util after items that util depends on have been defined, to avoid unsatisfied circular dependency
- 06:30 PM Revision 2816: sql.py: DbConn.col_default(): Pass the connection to sql_gen.as_Code() so it fixes the syntax on values returned by PostgreSQL
- 06:29 PM Revision 2815: sql_gen.py: as_Code(): Added optional db param, which causes the function to run db.std_code() on the value to fix the syntax
- 06:27 PM Revision 2814: sql.py: DbConn: Added std_code()
- 03:02 PM Revision 2813: db_xml.py: Removed into_table_name() because this functionality is now handled by sql.into_table_name()
- 02:59 PM Revision 2812: sql.py: into_table_name(): Also parse hierarchical tables (mappings with a rank column) using a special syntax
- 02:44 PM Revision 2811: sql.py: put_table(): Fixed bug where distinct_on included columns that were not in the input table, and were thus incorrectly taken from the LEFT JOINed output table
06/12/2012
- 10:32 PM Revision 2810: sql.py: track_data_error(): Do nothing if cols are empty, because mk_track_data_error() requires at least one col. mk_track_data_error(): Assert that cols are not empty because VALUES clause requires at least one row.
- 10:26 PM Revision 2809: bin/map: by_col: Pass on_error to db_xml.put_table() that calls ex_tracker.track()
- 10:14 PM Revision 2808: sql.py: put_table(): No handler for exception: Pass exception to on_error() instead of raising a warning, so that error message can be formatted
- 10:08 PM Revision 2807: db_xml.py: put_table(): Pass on_error to sql.put_table()
- 10:07 PM Revision 2806: db_xml.py: put_table(): Take on_error param like row-based put()
- 10:06 PM Revision 2805: sql.py: put_table(): Take on_error param
- 09:48 PM Revision 2804: sql.py: get_cur_query(): Removed no longer used input_params parameter
- 09:46 PM Revision 2803: sql.py: Removed unused mogrify()
- 09:42 PM Revision 2802: sql.py: DbConn.DbCursor.execute(): Removed no longer used params parameter
- 09:32 PM Revision 2801: sql.py: with_autocommit(): Use isolation_level attr and set_isolation_level() method of connection instead of autocommit attr to support older versions of psycopg2
- 09:18 PM Revision 2800: sql.py: DbConn.DbCursor.execute(): Only fetch all rows for empty SELECT query, to support older versions of Python that would give a "no results to fetch" error for other types of queries
- 09:13 PM Revision 2799: csv2db: When reraising exception, use `raise` instead of `raise e` to preserve whole stack trace
- 09:12 PM Revision 2798: sql.py: Removed no longer used _query_lookup()
- 09:05 PM Revision 2797: sql.py: DbConn: Cache queries without params, as params are no longer used
- 09:03 PM Revision 2796: sql.py: DbConn.is_cached(): Removed no longer used params parameter
- 09:01 PM Revision 2795: sql.py: Removed no longer used run_raw_query()
- 09:00 PM Revision 2794: sql.py: run_query(): Call db.run_query() directly instead of via run_raw_query()
- 08:56 PM Revision 2793: sql.py: DbConn.run_query(): Removed no longer used params parameter
- 08:50 PM Revision 2792: sql.py: DbConn._db(): Setting search_path: Use esc_value() instead of params
- 08:43 PM Revision 2791: sql.py: run_query(): Removed no longer used params parameter
- 08:37 PM Revision 2790: sql.py: run_query_into(): Moved main case (into != None) outside of if statement because the special-case if statement contains `return`
- 08:35 PM Revision 2789: sql.py: run_query_into(): Removed no longer used params parameter
- 08:32 PM Revision 2788: sql.py: mk_insert_select(): Removed no longer used params parameter
- 08:25 PM Revision 2787: sql.py: mk_insert_select(): Return just the query instead of the query plus empty params
- 08:22 PM Revision 2786: sql.py: mk_select(): Return just the query instead of the query plus empty params
- 08:12 PM Revision 2785: sql.py: tables(): Use select() instead of a custom run_query() to avoid using params, which will be deprecated to make it easier to support old versions of Python
- 08:11 PM Revision 2784: sql.py: DbConn.DbCursor.execute(): Require that params are empty, to ensure that code uses db.esc_value() instead. This keeps literal values in the same place as the rest of the query, so that they do not need to be maintained and passed around separately in a params list.
- 08:06 PM Revision 2783: sql.py: constraint_cols(): Use db.esc_value() instead of params
- 08:05 PM Revision 2782: sql.py: index_cols(): Use db.esc_value() instead of params
- 08:00 PM Revision 2781: sql.py: add_pkey(): Use simpler `ADD PRIMARY KEY` syntax to avoid having to create a name for the primary key
- 07:57 PM Revision 2780: db_xml.py: put_table(): Subsetting in_table: Add pkey to created temp table to facilitate joining it with intermediate tables
- 07:47 PM Revision 2779: sql.py: tables(): Use select() instead of a custom run_query() to avoid using params, which will be deprecated to make it easier to support old versions of Python
- 07:14 PM Revision 2778: schemas/postgresql.nimoy.conf: shared_buffers: Fixed syntax error where decimals were not supported
- 07:03 PM Revision 2777: sql.py: truncate(): Re-added support for string tables using sql_gen.as_Table(). This fixes empty_db(), which relied on this functionality.
- 07:01 PM Revision 2776: sql_gen.py: as_Table(): Added schema param to use as default schema
- 06:54 PM Revision 2775: inputs/SALVIAS: Switched to using CSV exports of the DB, so that staging tables could be created for column-based import
- 06:48 PM Revision 2774: sql.py: run_query_into(): Added add_indexes_ param which causes the function to add indexes on the created table
- 06:18 PM Revision 2773: sql.py: create_table(): Use new add_indexes()
- 06:15 PM Revision 2772: sql.py: Added add_indexes()
- 05:38 PM Revision 2771: sql.py: get_cur_query(): Fixed bug where strings.ustr() needed to be used instead of str() when ensuring that get_cur_query() returns a string
- 05:35 PM Revision 2770: sql.py: cast(): Removed conditional checks for save_errors, since it's now always true if the function got passed the `not save_errors` special case
- 05:32 PM Revision 2769: sql.py: cast(): Only convert errors to warnings if errors will be saved in errors_table, so that import will always be aborted if user supplied invalid values in the mappings, even if these values are passed through a relational function
- 04:58 PM Revision 2768: sql.py: put_table(): Support inserting tables with all default values, by providing the pkey's default value for all rows so that the SELECT query has at least one column
- 04:57 PM Revision 2767: sql_gen.py: is_table_col(): Check that input is a Col object
- 04:31 PM Revision 2766: sql.py: put_table(): Assert that mapping is non-empty
- 04:23 PM Revision 2765: sql.py: mk_select(): Assert that fields list is non-empty
- 04:18 PM Revision 2764: sql.py: DbConn.DbCursor.execute(): Set _is_insert only if query *starts with* INSERT, so that function definitions containing INSERT are not cached as INSERT statements (exceptions only) themselves
- 04:10 PM Revision 2763: sql.py: DbConn.DbCursor.execute(): Fixed bug where params == None would apparently turn off the mogrifier completely, causing "%"s to be excessively escaped, by just setting params to None if it was [] or () and not using strings.esc_for_mogrify() at all
- 03:51 PM Revision 2762: sql.py: DbConn.DbCursor.execute(): If not using params, escape the query using strings.esc_for_mogrify() in case any literals contained "%"s
- 03:49 PM Revision 2761: strings.py: Added esc_for_mogrify()
- 03:05 PM Revision 2760: sql.py: create_table(): Add indexes on all non-pkey columns, unless turned off or deferred using new param col_indexes
- 02:46 PM Revision 2759: csv2db: Add column indexes on errors table. Use typed_cols and `.to_Col()` to iterate over columns to add indexes on, for the main and errors tables.
06/11/2012
- 11:13 PM Revision 2758: sql.py: Added track_data_error(). put_table(): ignore(): Take extra e param for the exception. Use track_data_error() to store the invalid value in the errors table.
- 11:11 PM Revision 2757: sql_gen.py: Join.to_str(): Add newline before and after right table if it's been renamed (and therefore takes up multiple lines)
- 11:10 PM Revision 2756: exc.py: ExceptionWithCause: Store the cause in an instance variable for later use
- 10:47 PM Revision 2755: sql.py: mk_track_data_error(): Rename the errors_table to make the generated SQL less verbose
- 10:42 PM Revision 2754: sql.py: mk_insert_select(): Run sql_gen.remove_table_rename() on table to get just the actual name in the DB
- 10:42 PM Revision 2753: sql_gen.py: Added remove_table_rename()
- 10:40 PM Revision 2752: sql_gen.py: Col: Run `.to_Table()` on table to get just the reference to the table, not any SQL code that defines it
- 10:10 PM Revision 2751: sql.py: Added mk_track_data_error() and use it in cast(). This also ensures that if only one source column's row in the CROSS JOIN violates a unique constraint, other source columns' rows are still inserted.
- 09:59 PM Revision 2750: sql_gen.py: with_default_table(): Added overwrite param to overwrite the table (if it isn't a NamedCol)
- 09:57 PM Revision 2749: sql_gen.py: Join.to_str(): join(): Get just the table name of left_table and right_table using `.to_Table()`. Moved order switching of tables inside join() because the order reversing only applies to an individual condition.
- 09:34 PM Revision 2748: sql_gen.py: Renamed set_default_table() to with_default_table() and copy col before modifying it so don't modify input
- 09:14 PM Revision 2747: sql_gen.py: Added set_default_table(). as_ValueCond(): Use set_default_table() instead of as_Col() so that any name-only column also gets its table set. Join.to_str(): Parse left side using set_default_table() instead of as_Col() so that any name-only column also gets its table set.
- 07:31 PM Revision 2746: sql_gen.py: Join: mapping param defaults to {} for e.g. CROSS JOINs. to_str(): Omit join_cond if mapping is empty, rather than if join is a specific type.
- 07:24 PM Revision 2745: sql_gen.py: NamedValues: Change cols to Col objects with the table set to `name`
- 07:23 PM Revision 2744: sql_gen.py: Added set_cols_table()
- 07:00 PM Revision 2743: sql.py: mk_insert_select(): returning: Use sql_gen.to_name_only_col()
- 06:52 PM Revision 2742: sql_gen.py: NamedTable: cols: Use sql_gen.Col objects or name strings instead of pre-rendered SQL code
- 06:35 PM Revision 2741: sql_gen.py: NamedTable: Wrap nested code in Expr if needed
- 06:27 PM Revision 2740: sql_gen.py: Added NamedValues
- 06:04 PM Revision 2739: sql_gen.py: Values: Support multiple rows
- 05:54 PM Revision 2738: sql.py: insert(): Use new sql_gen.Values
- 05:53 PM Revision 2737: sql_gen.py: Added Values and default
- 05:26 PM Revision 2736: sql_gen.py: Join.to_str(): Don't add join condition for CROSS JOINs
- 05:03 PM Revision 2735: sql.py: put_table(): Factored out errors_table name setting so it can be used by ignore()
- 04:38 PM Revision 2734: bin/map: If doing full import, clear errors table
- 04:37 PM Revision 2733: sql.py: truncate(): Support sql_gen.Table objects
- 04:21 PM Revision 2732: sql.py: Moved truncate() to Database structure queries section
- 04:11 PM Revision 2731: sql.py: tables(): Run query with log_level=4 because it's a low-level structure-determining query
- 04:07 PM Revision 2730: sql.py: table_exists(): Use new tables() exact param so that LIKE special chars in table name are not interpreted specially
- 04:07 PM Revision 2729: sql.py: tables(): Added exact param to check for exact matches only
- 04:03 PM Revision 2728: sql.py: put_table(): MissingCastException: Use new errors_table()
- 04:02 PM Revision 2727: csv2db: Use new sql.errors_table()
- 04:02 PM Revision 2726: sql.py: Added table_exists() and errors_table()
Also available in: Atom