/ - Changes - BIEN 3 - NCEAS Projects

root @ 3375

#	Date	Author	Comment
3375	07/13/2012 12:38 PM	Aaron Marcuse-Kubitza	inputs/Makefile: Added download to download input data from vegbiendev
3374	07/13/2012 10:05 AM	Aaron Marcuse-Kubitza	main Makefile: VegBIEN DB: Removing database warning message: Changed "public schema versions" to "archived imports" for clarity
3373	07/13/2012 10:01 AM	Aaron Marcuse-Kubitza	main Makefile: VegBIEN DB: Schemas: Added schemas/%.backup/restore and documented it in README.TXT
3372	07/13/2012 09:42 AM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated with stats from latest import. Marked partial imports affecting the slowest datasource with a comment when they might otherwise appear to be comparable with full imports due to having a similar # of rows.
3371	07/13/2012 08:58 AM	Aaron Marcuse-Kubitza	root Makefile: schemas/%.backup: Removed _always prerequisite because an existing backup should not be overwritten by running `make schemas/%.backup`
3370	07/13/2012 08:53 AM	Aaron Marcuse-Kubitza	README.TXT: Updated warning messages to match what's actually output by the commands in question
3369	07/13/2012 08:47 AM	Aaron Marcuse-Kubitza	README.TXT: Documented how to back up an archived import
3368	07/12/2012 03:26 PM	Aaron Marcuse-Kubitza	root Makefile: VegBIEN DB: Schemas: schemas/%/backup: Time the backup
3367	07/12/2012 03:25 PM	Aaron Marcuse-Kubitza	root Makefile: VegBIEN DB: Schemas: schemas/%/backup: Renamed to the name of the file created (schemas/%.backup) so that the file will automatically be removed by make if you cancel the backup. If this didn't happen, the existing partial file would prevent make from backing it up again.
3366	07/12/2012 03:19 PM	Aaron Marcuse-Kubitza	schemas: Ignore *.backup files for svn
3365	07/12/2012 03:00 PM	Aaron Marcuse-Kubitza	root Makefile: VegBIEN DB: Schemas: schemas/%/backup: Documented that it can't be used for the current (unrotated) public schema because pg_dump doesn't back up the CREATE SCHEMA statement for it
3364	07/12/2012 02:45 PM	Aaron Marcuse-Kubitza	root Makefile: VegBIEN DB: Schemas: schemas/%/backup: Changed file extension to .backup to match what pgAdmin3 expects
3363	07/12/2012 02:40 PM	Aaron Marcuse-Kubitza	inputs/SALVIAS/maps/VegX.stems.csv: Fixed PlotObsID mapping to point to sourceAccessionCode so that it would map to the same place as organisms' PlotObsID and the entries would be merged. This will hopefully also fix a SALVIAS stems import bug where the specimenreplicate join on existing/inserted rows matched multiple rows, in spite of using the specimenreplicate_unique_catalognumber UNIQUE INDEX.
3362	07/12/2012 02:35 PM	Aaron Marcuse-Kubitza	root Makefile: VegBIEN DB: Schemas: Added schemas/%/backup for backing up an import's schema
3361	07/12/2012 02:34 PM	Aaron Marcuse-Kubitza	pg_dump_vegbien: Added data option for backing up an import's schema
3360	07/12/2012 01:32 PM	Aaron Marcuse-Kubitza	sql_io.py: cast(): Don't output warnings for errors that are stored in the errors table, because this clutters the log output unnecessarily. Warnings are still output when there is no errors table.
3359	07/12/2012 01:20 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: party: party_unique UNIQUE INDEX: Moved surname before givenname, middlename to enable the index to be used when just the person's last name is known. (Note that if the organization is not known, and explicit NULL would need to be put in for that field to use the index.)
3358	07/12/2012 01:14 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): DuplicateKeyException: Uniquifying input table to avoid internal duplicate keys: Include all join columns (including literal values) in the unique index, not just table columns, to encourage PostgreSQL to use the index in a merge join.
3357	07/12/2012 01:11 PM	Aaron Marcuse-Kubitza	sql.py: distinct_table(): Support literal values as column expressions. Filter out untyped NULLs because the type unknown can't be indexed.
3356	07/12/2012 01:09 PM	Aaron Marcuse-Kubitza	sql.py: add_index(): Support literal values as expressions
3355	07/12/2012 01:06 PM	Aaron Marcuse-Kubitza	sql_gen.py: ensure_not_null(): Support Literal value inputs, which are passed through if not nullable
3354	07/12/2012 01:05 PM	Aaron Marcuse-Kubitza	sql_gen.py: Added cast_literal()
3353	07/12/2012 11:52 AM	Aaron Marcuse-Kubitza	sql.py: Added simplify_expr() and use it in parse_expr_col(). This will remove extraneous information from the condition that gets included in the errors table, so it's easy for the data provider to see what constraint is being violated as it applies to their data.
3352	07/12/2012 11:40 AM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Handle CheckExceptions. This fixes a REMIB bug where some rows violated the specimenreplicate.specimenreplicate_required_key CHECK constraint, but there was previously no way to filter out just these rows so all rows were removed instead.
3351	07/12/2012 11:33 AM	Aaron Marcuse-Kubitza	sql.py: Added map_expr()
3350	07/12/2012 11:33 AM	Aaron Marcuse-Kubitza	sql_gen.py: Added NotCond
3349	07/12/2012 10:06 AM	Aaron Marcuse-Kubitza	sql.py: run_query(): ConstraintException: Get and include the constraint condition in the exception
3348	07/12/2012 10:05 AM	Aaron Marcuse-Kubitza	sql.py: Added constraint_cond()
3347	07/12/2012 09:26 AM	Aaron Marcuse-Kubitza	sql.py: run_query(): ConstraintException: Include table in the constraint name because constraints are scoped by table
3346	07/12/2012 09:18 AM	Aaron Marcuse-Kubitza	sql.py: Added ConstraintException and parse it in run_query()
3345	07/12/2012 09:12 AM	Aaron Marcuse-Kubitza	sql.py: ConstraintException: Also store the constraint's condition, if any
3344	07/12/2012 08:24 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: specimenreplicate: specimenreplicate_unique_catalognumber UNIQUE INDEX: Limited to rows where sourceaccessioncode IS NULL, so that it will only be used as an alternate if the datasource does not specify a sourceaccessioncode. This should fix the SpeciesLink bug where input rows with catalognumber_dwc IS NULL match multiple output rows when compared on the specimenreplicate_unique_catalognumber UNIQUE INDEX because they were excluded from the uniquifying by the index filter.
3343	07/12/2012 08:10 AM	Aaron Marcuse-Kubitza	schemas/vegbien.ERD.mwb: Synced with schema
3342	07/12/2012 08:06 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: specimenreplicate: Added UNIQUE INDEX on plantobservation_id for direct vouchers in plots data
3341	07/12/2012 07:16 AM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated with stats from latest import
3340	07/11/2012 10:36 PM	Aaron Marcuse-Kubitza	sql.py: Removed no longer used constraint_cols(). index_cols() can always be used for a UNIQUE constraint because a UNIQUE index is automatically created for it.
3339	07/11/2012 10:34 PM	Aaron Marcuse-Kubitza	bin/map: Fixed bug where errors table indexes could not be looked up using index_cols() because their schema was not in the search_path, by explicitly adding the in_schema at the end of the search_path. This is the only reason the in_schema needs to be in the search_path, but it's unavoidable because the "duplicate key value violates unique constraint" error does not included the constraint's schema.
3338	07/11/2012 10:24 PM	Aaron Marcuse-Kubitza	sql.py: run_query(): DuplicateKeyException: Don't parse table name out of constraint name because it isn't used and can't reliably be determined for table names containing "_"
3337	07/11/2012 09:10 PM	Aaron Marcuse-Kubitza	mappings/VegX-VegBIEN.stems.csv: Reattached location.datasource_id :[] mappings directly to the location itself, rather than to the location via the locationevent
3336	07/11/2012 08:37 PM	Aaron Marcuse-Kubitza	mappings/VegX-VegBIEN.stems.csv: Mapped to taxonoccurrence.sourceaccessioncode to avoid underconstraining taxonoccurrence
3335	07/11/2012 08:02 PM	Aaron Marcuse-Kubitza	xml_func.py: process(): Merge mergeable siblings recursively so that newly-mergeable children are also merged
3334	07/11/2012 07:44 PM	Aaron Marcuse-Kubitza	xml_dom.py: Removed no longer used merge_same_name()
3333	07/11/2012 07:43 PM	Aaron Marcuse-Kubitza	xml_func.py: process(): Recombining pieces of nodes that were split apart in the mappings: Only combine nodes that are explicitly marked as mergeable, to avoid unwanted merges. Refactored to use new xml_dom.merge_by_name().
3332	07/11/2012 07:39 PM	Aaron Marcuse-Kubitza	mappings/VegX-VegBIEN.stems.csv: Marked aggregateoccurrence as mergeable, in preparation for switching to explicit merging only to avoid unwanted merges
3331	07/11/2012 07:33 PM	Aaron Marcuse-Kubitza	xml_dom.py: Moved merge_adjacent() recursion and name checking into merge() so that other callers of merge() can take advantage of it, too. Added merge_by_name() and use it in merge_same_name(). Removed now-unused merge_adjacent().
3330	07/11/2012 07:30 PM	Aaron Marcuse-Kubitza	xpath.py: get(): Support checking all children when the elem name is '*'
3329	07/11/2012 07:01 PM	Aaron Marcuse-Kubitza	xml_dom.py: replace(): Assert that old node not removed from parent tree
3328	07/11/2012 06:44 PM	Aaron Marcuse-Kubitza	xpath.py: Added is_all()
3327	07/11/2012 06:17 PM	Aaron Marcuse-Kubitza	xml_func.py: process(): List out the xml_dom.NodeElemIter iterator so that it won't be affected by concurrent changes to the DOM tree
3326	07/11/2012 05:32 PM	Aaron Marcuse-Kubitza	xml_func.py: process(): Merging children: Support merging non-adjacent children by using new xml_dom.merge_same_name(). This should eliminate the need to sort children of the same name next to each other in the mappings so that they will be merged.
3325	07/11/2012 05:30 PM	Aaron Marcuse-Kubitza	xml_dom.py: Added merge_same_name()
3324	07/11/2012 05:15 PM	Aaron Marcuse-Kubitza	bin/map: ex_tracker: Don't add row_ct to iters count in column-based import (by_col) because errors are not done by row, so a % of rows affected is not meaningful
3323	07/11/2012 05:13 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): NullValueException: Missing mapping for NOT NULL column: Raise error if there is no default input column to fill in the rows with. Missing mappings for required columns are a normal circumstance with path mappings (plantname, namedplace) because the mappings include intervening levels that may not be specified by the datasource. These mappings have default input columns for the intervening levels, so for them, this will continue to not be flagged as an error.
3322	07/11/2012 04:55 PM	Aaron Marcuse-Kubitza	sql.py: index_cols(): Fixed bug where index columns were sorted by their position in the table, not their position in the index. This was causing the order of columns in covering indexes not to match their order in the corresponding output table index, preventing merge joins and slowing down joins because of the need to do a seq scan. Refactored to use pg_get_indexdef() on each column position index and parsing the result with parse_expr_col(), which greatly simplifies the query and automatically provides the correct sort order.
3321	07/11/2012 04:50 PM	Aaron Marcuse-Kubitza	sql.py: Added parse_expr_col()
3320	07/11/2012 04:50 PM	Aaron Marcuse-Kubitza	sql_gen.py: Added unesc_name()
3319	07/11/2012 04:02 PM	Aaron Marcuse-Kubitza	sql.py: index_cols(): Removed unneeded table param, since the index name is unique within its schema. (The schema is looked up in the search_path if needed, by using ::regclass.)
3318	07/11/2012 03:59 PM	Aaron Marcuse-Kubitza	sql.py: index_cols(): Fixed bug where index was looked up only by name, not by schema-qualified name, causing columns from other schemas (such as previous versions of public) to be included in the index columns list. This problem is only noticeable when a column in an index is deleted, as locationevent.datasource_id in locationevent_unique_accessioncode was recently.
3317	07/10/2012 09:53 PM	Aaron Marcuse-Kubitza	input.Makefile: Name logs with extension .log.sql so they are syntax-highlighted for the SQL statements they contain
3316	07/10/2012 09:24 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: taxonoccurrence: Added sourceaccessioncode so that taxonoccurrences can be uniquely identified in plots data
3315	07/10/2012 09:12 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: locationevent: Removed datasource_id because locationevents are now scoped by their required location, which itself is scoped by datasource
3314	07/10/2012 09:07 PM	Aaron Marcuse-Kubitza	mappings/VegX-VegBIEN.stems.csv: Removed locationevent.datasource_id mappings because locationevents are now scoped by their required location, which itself is scoped by datasource
3313	07/10/2012 08:42 PM	Aaron Marcuse-Kubitza	sql.py: distinct_table(): Don't sort the inserted rows by pkey because they should stay in the table order that they were in. (The select order with no ORDER BY should be the table order. Even if it isn't, it doesn't matter what order they are in for our current application.)
3312	07/10/2012 08:38 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Creating an empty pkeys table: Don't sort the inserted result by pkey because it's empty (limit=0)
3311	07/10/2012 08:32 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): ignore(): Fixed bug where in_col's table needed to be changed to insert_in_table, because it's insert_in_table's rows that are being modified but mapping (which in_col comes from) qualifies columns by in_table
3310	07/10/2012 08:28 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): ignore(): Also add an index on in_col if mapping the value to NULL
3309	07/10/2012 08:28 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): ignore(): Only delete from the insert_in_table, because the invalid rows only need to be removed from the rows that are actually being inserted into the DB. If there are invalid rows in the full (not uniquified) in_table, that's OK, as they can still get a valid output pkey if the first copy of a row they were considered a duplicate of is valid (this is a very unusual situation, so this change should not affect most real data).
3308	07/10/2012 08:22 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): ignore(): Merged filter_ var into sql.delete() call because that's the only place it's used
3307	07/10/2012 08:18 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): insert_into_pkeys(): Removed no longer used distinct param
3306	07/10/2012 08:16 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Getting output table pkeys of existing/inserted rows: Don't DISTINCT ON the joined rows by input pkey, because this adds sorting overhead. This should not be needed because there generally should not be any duplicate rows for the columns in a unique index (if there are, this is an index configuration problem and should be fixed in the schema). It's possible that partial indexes (with a filter condition) were causing this, but testing without it in place will be needed to determine the cause.
3305	07/10/2012 08:02 PM	Aaron Marcuse-Kubitza	sql.py: flatten(): Auto-add a pkey on the created temp table. This should be standard practice for most temp tables, and for sql_io.put_table() especially this will be useful if we ever want to add back sorting the in_table by row_num (possibly by CLUSTERing on the pkey to avoid pkey index scans).
3304	07/10/2012 07:54 PM	Aaron Marcuse-Kubitza	sql.run_query_into() calls: Use new add_pkey_ param instead of manually calling sql.add_pkey()
3303	07/10/2012 07:53 PM	Aaron Marcuse-Kubitza	sql.py: run_query_into(): Changed add_indexes_ param to add_pkey_ and add just a pkey if it's set. It's no longer necessary to create indexes on every column of a temp table, because the covering indexes for the join columns have been fixed to have columns in the same order as the output table's corresponding index so that they can be used for a merge join.
3302	07/10/2012 07:41 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Add pkey on pkeys table right when it's created, so that any duplicates are detected right away instead of at the end of the iteration. (Duplicates are created as a result of joins matching multiple rows, which often indicates a database misconfiguration.)
3301	07/10/2012 07:34 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Adding pkey on pkeys table: Removed log message because adding an index is considered a low-level operation, which isn't included in the Redmine SQL
3300	07/10/2012 07:27 PM	Aaron Marcuse-Kubitza	schemas/tree_cross-links.sql: Ancestors table: Synced with current definition, which removes unneeded fki_* indexes. Note that the index on ancestor_id might be needed in the future if we ever want to get all the descendants of a plantname/namedplace or perform deletions on plantname/namedplace (which cascade to _ancestor). For getting all the plantnames/namedplaces (of any rank) for a plantconcept/locationdetermination, though, the _ancestor_pkey index is sufficient because plantname_id/namedplace_id is the first column in it.
3299	07/10/2012 07:20 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: {plantname,namedplace}_update_ancestors(): Fixed slowdown due to removed index on {plantname,namedplace}.parent_id by adding COALESCE to enable using the plantname_unique index for the lookup instead
3298	07/10/2012 06:26 PM	Aaron Marcuse-Kubitza	sql.mk_select() calls: Removed no longer needed start=0 to turn off missing WHERE, LIMIT, or OFFSET clause warning
3297	07/10/2012 06:21 PM	Aaron Marcuse-Kubitza	sql.py: mk_select(): Don't output warning if there is no WHERE, LIMIT, or OFFSET clause, because column-based import has many queries where this is the case and it's annoying to need to specify start=0 to turn off this warning
3296	07/10/2012 06:04 PM	Aaron Marcuse-Kubitza	sql.py: flatten(): Don't sort the input tables by the pkey because it doesn't matter what order the datasource's rows are inserted in. Note that PostgreSQL doesn't guarantee the order of rows in a table, so it is possible that the rows were being inserted in an unknown order before this change, as well.
3295	07/10/2012 05:58 PM	Aaron Marcuse-Kubitza	sql.py: delete(): Cache deletes by default
3294	07/10/2012 05:56 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Merged remove_rows() and invalid2null() into one ignore() function that chooses the action (map to NULL or delete) depending on the value and whether NULLs have been filtered out of the column
3293	07/10/2012 05:46 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): remove_rows(): Delete the rows containing the invalid value instead of filtering them out of each select, so that the filtering can be profiled separately from the insertion. This also requires deleting rows with invalid non-NULL values instead of mapping them to NULL if NULLs have already been filtered out of the column in question.
3292	07/10/2012 05:34 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Main insert: Don't run it inside an extra savepoint, because this will cause the creation of any helper SQL functions to be rolled back if an exception is thrown. If those functions are later re-used, the cache will think they exist when they no longer do. (Calling a function on input rows is now run in recover mode, so that it doesn't need the outer savepoint anymore.)
3291	07/10/2012 05:30 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): main_insert(): Moved code that is not part of the main query outside the function, so it wouldn't be subject to the exception handling. Preparing to insert new rows: Only do the preparation code for insert_select() if the out_table is not a function.
3290	07/10/2012 05:19 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): main_insert(): is_function: Run insert_into_pkeys() with recover=True so that errors in the function are properly rolled back
3289	07/10/2012 05:18 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): insert_into_pkeys(): Support custom query kw_args, such as recover
3288	07/10/2012 04:41 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Use full_in_table in the into table row count assertion, since in_table may have rows deleted
3287	07/10/2012 04:36 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Save default values for all rows in new temp table full_in_table since in_table may have rows deleted
3286	07/10/2012 04:13 PM	Aaron Marcuse-Kubitza	sql.py: Added mk_delete() and delete()
3285	07/10/2012 03:36 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): mk_main_select(): Turned off unnecessary ORDER BY to avoid sorting the entire table every time it's used. (PostgreSQL has no concept of reordering a table and re-using that ordering, so it just re-sorts the table each time. Index scans on the pkey do not appear to be used in practice, according to EXPLAIN results from live imports.) Document that we instead assume that identical SELECT queries retrieve rows in the same order.
3284	07/10/2012 01:56 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: taxondetermination: Fixed bug where taxondetermination_taxonoccurrence_id_fkey trigger was applied before the NOT NULL constraint on taxonoccurrence_id was checked, causing the trigger to fail on NULL taxonoccurrence_ids, by making it an AFTER trigger. (An AFTER trigger will still roll back the entire insert if it fails, even though it runs after the insert itself.)
3283	07/09/2012 05:45 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: specimenreplicate: institution_id: Fixed typo in comment
3282	07/09/2012 05:26 PM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Fixed date for most recent import
3281	07/09/2012 05:26 PM	Aaron Marcuse-Kubitza	sql.py: DbConn.run_query(): Put the data source comment on a separate line in the log file instead of using a carriage return, which sometimes had the desired effect of overwriting the src comment with the first line of the query but sometimes the line lengths weren't right and there wasn't enough overlap
3280	07/09/2012 04:53 PM	Aaron Marcuse-Kubitza	schemas/vegbien.ERD.mwb: Synced with schema
3279	07/09/2012 04:42 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: Removed per-column indexes, which are no longer needed by either row-based or column-based import because they are able to do a merge join or lookup using the table's UNIQUE INDEX. Instead of forcing the database to build and maintain large indexes (15+ GB!) that are not used, optimization-only (non-UNIQUE) indexes should be added as needed only once the database is actually used for queries. In most cases it will not even be necessary to add additional indexes then, because most UNIQUE indexes can be reused for broad lookups (rather than just duplicate elimination). Even the foreign key covering indexes (fki_*) are not needed because we virtually never delete rows in the DB, and even if we were to start doing that regularly, the cost of maintaining the indexes on import is most likely not worth the speed improvements for cascading deletes.
3278	07/09/2012 04:32 PM	Aaron Marcuse-Kubitza	schemas/py_functions.sql: Removed per-column indexes on relational functions, which are no longer needed by row-based import because it is able to do a merge join-style lookup using the table's UNIQUE INDEX. (Note that column-based import doesn't use the (slower) relational functions at all anymore, and instead calls the corresponding SQL function directly using named arguments.)
3277	07/09/2012 04:31 PM	Aaron Marcuse-Kubitza	schemas/functions.sql: Removed per-column indexes on relational functions, which are no longer needed by row-based import because it is able to do a merge join-style lookup using the table's UNIQUE INDEX. (Note that column-based import doesn't use the (slower) relational functions at all anymore, and instead calls the corresponding SQL function directly using named arguments.)
3276	07/09/2012 04:26 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: plantname: plantname_unique UNIQUE INDEX: Moved scope_id to the back so that the index can easily be used for lookup queries (not just column-based import) without having to explicitly specify NULL for that field. This takes advantage of a btree sorting feature where a broader lookup can be done using just the first n columns of the index.

Project

General

Profile