Project

General

Profile

Statistics
| Revision:

# Date Author Comment
3335 07/11/2012 08:02 PM Aaron Marcuse-Kubitza

xml_func.py: process(): Merge mergeable siblings recursively so that newly-mergeable children are also merged

3334 07/11/2012 07:44 PM Aaron Marcuse-Kubitza

xml_dom.py: Removed no longer used merge_same_name()

3333 07/11/2012 07:43 PM Aaron Marcuse-Kubitza

xml_func.py: process(): Recombining pieces of nodes that were split apart in the mappings: Only combine nodes that are explicitly marked as mergeable, to avoid unwanted merges. Refactored to use new xml_dom.merge_by_name().

3332 07/11/2012 07:39 PM Aaron Marcuse-Kubitza

mappings/VegX-VegBIEN.stems.csv: Marked aggregateoccurrence as mergeable, in preparation for switching to explicit merging only to avoid unwanted merges

3331 07/11/2012 07:33 PM Aaron Marcuse-Kubitza

xml_dom.py: Moved merge_adjacent() recursion and name checking into merge() so that other callers of merge() can take advantage of it, too. Added merge_by_name() and use it in merge_same_name(). Removed now-unused merge_adjacent().

3330 07/11/2012 07:30 PM Aaron Marcuse-Kubitza

xpath.py: get(): Support checking all children when the elem name is '*'

3329 07/11/2012 07:01 PM Aaron Marcuse-Kubitza

xml_dom.py: replace(): Assert that old node not removed from parent tree

3328 07/11/2012 06:44 PM Aaron Marcuse-Kubitza

xpath.py: Added is_all()

3327 07/11/2012 06:17 PM Aaron Marcuse-Kubitza

xml_func.py: process(): List out the xml_dom.NodeElemIter iterator so that it won't be affected by concurrent changes to the DOM tree

3326 07/11/2012 05:32 PM Aaron Marcuse-Kubitza

xml_func.py: process(): Merging children: Support merging non-adjacent children by using new xml_dom.merge_same_name(). This should eliminate the need to sort children of the same name next to each other in the mappings so that they will be merged.

3325 07/11/2012 05:30 PM Aaron Marcuse-Kubitza

xml_dom.py: Added merge_same_name()

3324 07/11/2012 05:15 PM Aaron Marcuse-Kubitza

bin/map: ex_tracker: Don't add row_ct to iters count in column-based import (by_col) because errors are not done by row, so a % of rows affected is not meaningful

3323 07/11/2012 05:13 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): NullValueException: Missing mapping for NOT NULL column: Raise error if there is no default input column to fill in the rows with. Missing mappings for required columns are a normal circumstance with path mappings (plantname, namedplace) because the mappings include intervening levels that may not be specified by the datasource. These mappings have default input columns for the intervening levels, so for them, this will continue to not be flagged as an error.

3322 07/11/2012 04:55 PM Aaron Marcuse-Kubitza

sql.py: index_cols(): Fixed bug where index columns were sorted by their position in the table, not their position in the index. This was causing the order of columns in covering indexes not to match their order in the corresponding output table index, preventing merge joins and slowing down joins because of the need to do a seq scan. Refactored to use pg_get_indexdef() on each column position index and parsing the result with parse_expr_col(), which greatly simplifies the query and automatically provides the correct sort order.

3321 07/11/2012 04:50 PM Aaron Marcuse-Kubitza

sql.py: Added parse_expr_col()

3320 07/11/2012 04:50 PM Aaron Marcuse-Kubitza

sql_gen.py: Added unesc_name()

3319 07/11/2012 04:02 PM Aaron Marcuse-Kubitza

sql.py: index_cols(): Removed unneeded table param, since the index name is unique within its schema. (The schema is looked up in the search_path if needed, by using ::regclass.)

3318 07/11/2012 03:59 PM Aaron Marcuse-Kubitza

sql.py: index_cols(): Fixed bug where index was looked up only by name, not by schema-qualified name, causing columns from other schemas (such as previous versions of public) to be included in the index columns list. This problem is only noticeable when a column in an index is deleted, as locationevent.datasource_id in locationevent_unique_accessioncode was recently.

3317 07/10/2012 09:53 PM Aaron Marcuse-Kubitza

input.Makefile: Name logs with extension .log.sql so they are syntax-highlighted for the SQL statements they contain

3316 07/10/2012 09:24 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: taxonoccurrence: Added sourceaccessioncode so that taxonoccurrences can be uniquely identified in plots data

3315 07/10/2012 09:12 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: locationevent: Removed datasource_id because locationevents are now scoped by their required location, which itself is scoped by datasource

3314 07/10/2012 09:07 PM Aaron Marcuse-Kubitza

mappings/VegX-VegBIEN.stems.csv: Removed locationevent.datasource_id mappings because locationevents are now scoped by their required location, which itself is scoped by datasource

3313 07/10/2012 08:42 PM Aaron Marcuse-Kubitza

sql.py: distinct_table(): Don't sort the inserted rows by pkey because they should stay in the table order that they were in. (The select order with no ORDER BY should be the table order. Even if it isn't, it doesn't matter what order they are in for our current application.)

3312 07/10/2012 08:38 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): Creating an empty pkeys table: Don't sort the inserted result by pkey because it's empty (limit=0)

3311 07/10/2012 08:32 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): ignore(): Fixed bug where in_col's table needed to be changed to insert_in_table, because it's insert_in_table's rows that are being modified but mapping (which in_col comes from) qualifies columns by in_table

3310 07/10/2012 08:28 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): ignore(): Also add an index on in_col if mapping the value to NULL

3309 07/10/2012 08:28 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): ignore(): Only delete from the insert_in_table, because the invalid rows only need to be removed from the rows that are actually being inserted into the DB. If there are invalid rows in the full (not uniquified) in_table, that's OK, as they can still get a valid output pkey if the first copy of a row they were considered a duplicate of is valid (this is a very unusual situation, so this change should not affect most real data).

3308 07/10/2012 08:22 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): ignore(): Merged filter_ var into sql.delete() call because that's the only place it's used

3307 07/10/2012 08:18 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): insert_into_pkeys(): Removed no longer used distinct param

3306 07/10/2012 08:16 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): Getting output table pkeys of existing/inserted rows: Don't DISTINCT ON the joined rows by input pkey, because this adds sorting overhead. This should not be needed because there generally should not be any duplicate rows for the columns in a unique index (if there are, this is an index configuration problem and should be fixed in the schema). It's possible that partial indexes (with a filter condition) were causing this, but testing without it in place will be needed to determine the cause.

3305 07/10/2012 08:02 PM Aaron Marcuse-Kubitza

sql.py: flatten(): Auto-add a pkey on the created temp table. This should be standard practice for most temp tables, and for sql_io.put_table() especially this will be useful if we ever want to add back sorting the in_table by row_num (possibly by CLUSTERing on the pkey to avoid pkey index scans).

3304 07/10/2012 07:54 PM Aaron Marcuse-Kubitza

sql.run_query_into() calls: Use new add_pkey_ param instead of manually calling sql.add_pkey()

3303 07/10/2012 07:53 PM Aaron Marcuse-Kubitza

sql.py: run_query_into(): Changed add_indexes_ param to add_pkey_ and add just a pkey if it's set. It's no longer necessary to create indexes on every column of a temp table, because the covering indexes for the join columns have been fixed to have columns in the same order as the output table's corresponding index so that they can be used for a merge join.

3302 07/10/2012 07:41 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): Add pkey on pkeys table right when it's created, so that any duplicates are detected right away instead of at the end of the iteration. (Duplicates are created as a result of joins matching multiple rows, which often indicates a database misconfiguration.)

3301 07/10/2012 07:34 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): Adding pkey on pkeys table: Removed log message because adding an index is considered a low-level operation, which isn't included in the Redmine SQL

3300 07/10/2012 07:27 PM Aaron Marcuse-Kubitza

schemas/tree_cross-links.sql: Ancestors table: Synced with current definition, which removes unneeded fki_* indexes. Note that the index on ancestor_id might be needed in the future if we ever want to get all the descendants of a plantname/namedplace or perform deletions on plantname/namedplace (which cascade to *_ancestor). For getting all the plantnames/namedplaces (of any rank) for a plantconcept/locationdetermination, though, the *_ancestor_pkey index is sufficient because plantname_id/namedplace_id is the first column in it.

3299 07/10/2012 07:20 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: {plantname,namedplace}_update_ancestors(): Fixed slowdown due to removed index on {plantname,namedplace}.parent_id by adding COALESCE to enable using the plantname_unique index for the lookup instead

3298 07/10/2012 06:26 PM Aaron Marcuse-Kubitza

sql.mk_select() calls: Removed no longer needed start=0 to turn off missing WHERE, LIMIT, or OFFSET clause warning

3297 07/10/2012 06:21 PM Aaron Marcuse-Kubitza

sql.py: mk_select(): Don't output warning if there is no WHERE, LIMIT, or OFFSET clause, because column-based import has many queries where this is the case and it's annoying to need to specify start=0 to turn off this warning

3296 07/10/2012 06:04 PM Aaron Marcuse-Kubitza

sql.py: flatten(): Don't sort the input tables by the pkey because it doesn't matter what order the datasource's rows are inserted in. Note that PostgreSQL doesn't guarantee the order of rows in a table, so it is possible that the rows were being inserted in an unknown order before this change, as well.

3295 07/10/2012 05:58 PM Aaron Marcuse-Kubitza

sql.py: delete(): Cache deletes by default

3294 07/10/2012 05:56 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): Merged remove_rows() and invalid2null() into one ignore() function that chooses the action (map to NULL or delete) depending on the value and whether NULLs have been filtered out of the column

3293 07/10/2012 05:46 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): remove_rows(): Delete the rows containing the invalid value instead of filtering them out of each select, so that the filtering can be profiled separately from the insertion. This also requires deleting rows with invalid non-NULL values instead of mapping them to NULL if NULLs have already been filtered out of the column in question.

3292 07/10/2012 05:34 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): Main insert: Don't run it inside an extra savepoint, because this will cause the creation of any helper SQL functions to be rolled back if an exception is thrown. If those functions are later re-used, the cache will think they exist when they no longer do. (Calling a function on input rows is now run in recover mode, so that it doesn't need the outer savepoint anymore.)

3291 07/10/2012 05:30 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): main_insert(): Moved code that is not part of the main query outside the function, so it wouldn't be subject to the exception handling. Preparing to insert new rows: Only do the preparation code for insert_select() if the out_table is not a function.

3290 07/10/2012 05:19 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): main_insert(): is_function: Run insert_into_pkeys() with recover=True so that errors in the function are properly rolled back

3289 07/10/2012 05:18 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): insert_into_pkeys(): Support custom query kw_args, such as recover

3288 07/10/2012 04:41 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): Use full_in_table in the into table row count assertion, since in_table may have rows deleted

3287 07/10/2012 04:36 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): Save default values for all rows in new temp table full_in_table since in_table may have rows deleted

3286 07/10/2012 04:13 PM Aaron Marcuse-Kubitza

sql.py: Added mk_delete() and delete()

3285 07/10/2012 03:36 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): mk_main_select(): Turned off unnecessary ORDER BY to avoid sorting the entire table every time it's used. (PostgreSQL has no concept of reordering a table and re-using that ordering, so it just re-sorts the table each time. Index scans on the pkey do not appear to be used in practice, according to EXPLAIN results from live imports.) Document that we instead assume that identical SELECT queries retrieve rows in the same order.

3284 07/10/2012 01:56 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: taxondetermination: Fixed bug where taxondetermination_taxonoccurrence_id_fkey trigger was applied before the NOT NULL constraint on taxonoccurrence_id was checked, causing the trigger to fail on NULL taxonoccurrence_ids, by making it an AFTER trigger. (An AFTER trigger will still roll back the entire insert if it fails, even though it runs after the insert itself.)

3283 07/09/2012 05:45 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: specimenreplicate: institution_id: Fixed typo in comment

3282 07/09/2012 05:26 PM Aaron Marcuse-Kubitza

inputs/import.stats.xls: Fixed date for most recent import

3281 07/09/2012 05:26 PM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Put the data source comment on a separate line in the log file instead of using a carriage return, which sometimes had the desired effect of overwriting the src comment with the first line of the query but sometimes the line lengths weren't right and there wasn't enough overlap

3280 07/09/2012 04:53 PM Aaron Marcuse-Kubitza

schemas/vegbien.ERD.mwb: Synced with schema

3279 07/09/2012 04:42 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: Removed per-column indexes, which are no longer needed by either row-based or column-based import because they are able to do a merge join or lookup using the table's UNIQUE INDEX. Instead of forcing the database to build and maintain large indexes (15+ GB!) that are not used, optimization-only (non-UNIQUE) indexes should be added as needed only once the database is actually used for queries. In most cases it will not even be necessary to add additional indexes then, because most UNIQUE indexes can be reused for broad lookups (rather than just duplicate elimination). Even the foreign key covering indexes (fki_*) are not needed because we virtually never delete rows in the DB, and even if we were to start doing that regularly, the cost of maintaining the indexes on import is most likely not worth the speed improvements for cascading deletes.

3278 07/09/2012 04:32 PM Aaron Marcuse-Kubitza

schemas/py_functions.sql: Removed per-column indexes on relational functions, which are no longer needed by row-based import because it is able to do a merge join-style lookup using the table's UNIQUE INDEX. (Note that column-based import doesn't use the (slower) relational functions at all anymore, and instead calls the corresponding SQL function directly using named arguments.)

3277 07/09/2012 04:31 PM Aaron Marcuse-Kubitza

schemas/functions.sql: Removed per-column indexes on relational functions, which are no longer needed by row-based import because it is able to do a merge join-style lookup using the table's UNIQUE INDEX. (Note that column-based import doesn't use the (slower) relational functions at all anymore, and instead calls the corresponding SQL function directly using named arguments.)

3276 07/09/2012 04:26 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: plantname: plantname_unique UNIQUE INDEX: Moved scope_id to the back so that the index can easily be used for lookup queries (not just column-based import) without having to explicitly specify NULL for that field. This takes advantage of a btree sorting feature where a broader lookup can be done using just the first n columns of the index.

3275 07/09/2012 04:15 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: locationevent, specimenreplicate: Turned UNIQUE CONSTRAINTs and UNIQUE INDEXes with nullable fields into partial UNIQUE INDEXes with IS NOT NULL filter conditions, in order to work automatically with sql_gen without requiring a separate covering lookup index. Removed no longer needed covering lookup indexes.

3274 07/09/2012 03:07 PM Aaron Marcuse-Kubitza

sql_io.py: put_table(): DuplicateKeyException: Fixed bug where combining multiple unique constraints was incorrectly allowed, when in fact the constraints need to be separately applied to the different rows that violate them, which is not currently supported

3273 07/09/2012 03:02 PM Aaron Marcuse-Kubitza

sql.py: DbConn.with_savepoint(): Log transaction profiling info with level=4 like the rest of the transaction commands, so that it isn't output when the transaction itself should be hidden (e.g. for name versioning or internal commands)

3272 07/09/2012 02:16 PM Aaron Marcuse-Kubitza

sql.py: DbConn.with_savepoint(): Profile (nested) transactions so that the run time for groups of commands (e.g. csv2db INSERTs) is known

3271 07/09/2012 02:04 PM Aaron Marcuse-Kubitza

csv2db: verbosity defaults to 3 so that detailed queries with profiling stats are included in the log file, to assist in optimization

3270 07/09/2012 02:01 PM Aaron Marcuse-Kubitza

csv2db: Don't cache per-row INSERT queries because this bloats the cache (there aren't repeated identical INSERTs that shouldn't be re-run like in row-based import)

3269 07/09/2012 01:57 PM Aaron Marcuse-Kubitza

sql.py with_explain_comment(), DbConn: Fixed bug where with_explain_comment() was being run in per-row imports (row-based import and csv2db with INSERT), causing the overhead of an EXPLAIN query for every single INSERT and filling up the cache with EXPLAIN query results, by adding autoexplain mode, only running with_explain_comment() in autoexplain mode, and only enabling autoexplain mode for column-based import

3268 07/09/2012 01:11 PM Aaron Marcuse-Kubitza

db_xml.py: put_table(): Turn on autoanalyze mode to help the query planner avoid sequential scans on tables that now contain data. (Don't do this in row-based import because it creates too much overhead per insert.)

3267 07/09/2012 12:24 PM Aaron Marcuse-Kubitza

sql.py: Run all EXPLAIN queries with log_level=4 since the EXPLAIN information is now usually generated when the query is generated rather than when it's run, so the log_level is not known

3266 07/09/2012 12:21 PM Aaron Marcuse-Kubitza

sql.py: Added with_explain_comment() to query generating functions so that nested queries will also have EXPLAIN information

3265 07/09/2012 12:11 PM Aaron Marcuse-Kubitza

sql.py: Added with_explain_comment() and use it in run_query()

3264 07/09/2012 12:01 PM Aaron Marcuse-Kubitza

sql.py: run_query(): EXPLAIN output: Run explain() with log_level 1 higher than the query's log_level, so that low-level queries' EXPLAIN queries are not output when the queries themselves are not output. This also ensures that only level 2 (major) queries have the EXPLAIN logged (to introduce the query that is being run), to avoid cluttering the log output.

3263 07/09/2012 11:54 AM Aaron Marcuse-Kubitza

sql.py: explain(): Support custom log_level

3262 07/09/2012 11:48 AM Aaron Marcuse-Kubitza

schemas/vegbien.sql: taxondetermination: taxondetermination_taxonoccurrence_id_fkey manual fkey constraint: Fixed bug where needed to raise foreign_key_violation instead of unique_violation

3261 07/09/2012 11:23 AM Aaron Marcuse-Kubitza

inputs/import.stats.xls: Updated with stats from latest import

3260 07/06/2012 04:43 PM Aaron Marcuse-Kubitza

debug2redmine.csv: Remove newline before EXPLAIN comment

3259 07/06/2012 04:33 PM Aaron Marcuse-Kubitza

debug2redmine.csv: Filter out EXPLAIN comments

3258 07/06/2012 04:29 PM Aaron Marcuse-Kubitza

sql.py: run_query(): EXPLAIN all explainable queries before they are run, to provide query plans for later profiling and index analysis. At verbosity 3+, this also effectively allows the user to see what query is being run before it's executed.

3257 07/06/2012 04:26 PM Aaron Marcuse-Kubitza

sql.py: is_explainable(): Fixed bug where needed r'' syntax to escape \ in \b

3256 07/06/2012 04:23 PM Aaron Marcuse-Kubitza

sql.py: Added explain() and is_explainable()

3255 07/06/2012 04:19 PM Aaron Marcuse-Kubitza

strings.py: Added join_lines()

3254 07/06/2012 02:50 PM Aaron Marcuse-Kubitza

mk_rm_indexes: Also include the search_path in the outputted commands

3253 07/06/2012 02:45 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: commclass: Fixed bug where commclass_unique needed to be a UNIQUE INDEX

3252 07/06/2012 02:42 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: plantname: Removed unneeded indexes on plantname and rank (plantname_unique takes care of joins)

3251 07/06/2012 02:33 PM Aaron Marcuse-Kubitza

pg_dump_vegbien: Enclose the schema name in "" because pg_dump requires this for schema names with special characters

3250 07/06/2012 02:09 PM Aaron Marcuse-Kubitza

inputs/import.stats.xls: Updated with stats from 2012-7-3 and 2012-7-5 imports. Note that the 2012-7-5 import was partial, so its stats can't be directly compared.

3249 07/06/2012 01:28 PM Aaron Marcuse-Kubitza

root Makefile: VegBIEN DB: Schemas: Added schemas/%/rm_indexes

3248 07/06/2012 01:27 PM Aaron Marcuse-Kubitza

Added mk_rm_indexes

3247 07/06/2012 11:14 AM Aaron Marcuse-Kubitza

sql.py: Added drop() and use it in drop_table()

3246 07/06/2012 10:59 AM Aaron Marcuse-Kubitza

debug2redmine: Remove profiling information from the logging output

3245 07/06/2012 10:43 AM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Only print notices in debug mode, because they are output with a log level higher than the debug verbosity threshold, and this avoid unnecessary overhead

3244 07/06/2012 10:41 AM Aaron Marcuse-Kubitza

sql.py: DbConn: Added profile_row_ct setting, which is passed to profiler.stop() in run_query()

3243 07/06/2012 10:38 AM Aaron Marcuse-Kubitza

bin/map: Logging: Raised debug-mode verbosity threshold to 1.5 so that in row-based imports, which have a default verbosity of 1.1, sql.DbConn.run_query() will not profile the query, to avoid unnecessary overhead

3242 07/06/2012 10:34 AM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Only profile queries in debug mode, to avoid unnecessary overhead when the run time will not be displayed

3241 07/06/2012 10:29 AM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Profile using the profiling.ItersProfiler class, which pretty-prints the run time

3240 07/06/2012 10:22 AM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Added profiling of query execution, which is logged with the query

3239 07/06/2012 09:26 AM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Move log_msg() to where it's used, so that it runs after the query is run and can refer to profiling variables

3238 07/06/2012 09:21 AM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Use else blocks to avoid applying exception handling to commands run after the main command

3237 07/06/2012 09:18 AM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Always output or return the log message after the query is run, so that it can be output with profiling statistics in the log message header

3236 07/06/2012 09:05 AM Aaron Marcuse-Kubitza

sql.py: run_query(): Always output the log message after the query is run, so that it can be output with profiling statistics in the log message header