/ - Changes - BIEN 3 - NCEAS Projects

root @ 3206

#	Date	Author	Comment
3206	07/03/2012 03:59 PM	Aaron Marcuse-Kubitza	sql.py: DbConn.run_query(): Separate the data source comment from the query with a tab in the executed query but a \r in the logged query, so that the query will be shown on the same line as the data source comment in pg_stat_activity, but be hidden by the following line when cating the file and be put on a separate line when viewed in a text editor. This causes the first line of the query to be at the left edge when the log file is viewed, so that it looks more natural.
3205	07/03/2012 03:15 PM	Aaron Marcuse-Kubitza	README.TXT: Data import: Import data into VegBIEN: Added command to use for column-based import
3204	07/03/2012 02:10 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: locationevent: Allow a locationevent to be uniquely specified by its location (which is now datasource-scoped) and start date
3203	07/03/2012 01:26 PM	Aaron Marcuse-Kubitza	VegBIEN: Fixing import issue related to duplicate entries in tables with children, where when a new table entry duplicates an existing entry, the 1:1 tables of that table and those tables' children are not merged, causing them to become orphaned. It is described in detail at <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/Import_issues#Merging-duplicates-with-children>, including the rationale for this solution. Note that this is not a bug in column-based import, it applies to row-based import as well. This commit fixes the issue for specimenreplicate->...->location, by also mapping specimenreplicate's unique keys to location.sourceaccessioncode and setting location.datasource_id.
3202	07/03/2012 12:47 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: locationevent: Allow to be uniquely specified by location_id. This is useful for specimens data where there is one location for every locationevent.
3201	07/03/2012 12:27 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: location: Added datasource_id and sourceaccessioncode so locations can be uniquely specified by the input datasource, rather than being created automatically for each locationevent
3200	07/03/2012 11:45 AM	Aaron Marcuse-Kubitza	schemas/filter_ERD.csv: Add back taxondetermination->taxonoccurrence fkey because that has been replaced by a trigger in the SQL
3199	07/03/2012 11:06 AM	Aaron Marcuse-Kubitza	VegBIEN: Fixing import issue related to duplicate entries in tables with children, where when a new table entry duplicates an existing entry, the 1:1 tables of that table and those tables' children are not merged, causing them to become orphaned. It is described in detail at <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/Import_issues#Merging-duplicates-with-children>, including the rationale for this solution. Note that this is not a bug in column-based import, it applies to row-based import as well. This commit fixes the issue for specimenreplicate->taxonoccurrence.
3198	07/03/2012 07:40 AM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated with remaining stats from most recent import
3197	07/02/2012 03:52 PM	Aaron Marcuse-Kubitza	PostgreSQL-MySQL.csv: Remove INHERITS clauses
3196	07/02/2012 02:37 PM	Aaron Marcuse-Kubitza	schemas/vegbien.ERD.mwb: Fixed lines
3195	07/02/2012 02:24 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Moved in_row_ct updating to Subsetting section so the cursor's rowcount can be used directly
3194	07/02/2012 02:18 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting in_table: Don't count # rows because this takes awhile for large datasets. Instead, use the chunking algorithm in digir_client, which ends the loop when a partial or empty partition is encountered.
3193	07/02/2012 01:58 PM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated with new stats from an independent import
3192	07/02/2012 01:20 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: Fixed UNIQUE INDEXes that were still using COALESCE to use COALESCE in order to match what sql_gen.EnsureNotNull uses
3191	07/02/2012 12:41 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: specimenreplicate: UNIQUE INDEX on catalognumber_dwc: Added collectioncode_dwc so that datasources that specify it in addition to the institution_id (such as aggregators) will not need to have catalognumbers be unique within an institution
3190	07/02/2012 12:30 PM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated with more stats from latest import
3189	07/02/2012 11:44 AM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated with initial stats from latest import. Reformatted to put successive runs of column-based next to each other, so they could be directly compared and so that the row-based data wouldn't need to be duplicated. Added empty-value checks to formulas so that they don't need to be manually deleted when one of their inputs is empty.
3188	07/02/2012 10:32 AM	Aaron Marcuse-Kubitza	input.Makefile: Documentation: import/steps.by_col.sql: Fixed bug where needed to run import in test mode
3187	07/02/2012 10:12 AM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Don't set pkeys of missing rows to default value if out_table is a SQL function, because then there is already an entry for every row
3186	07/02/2012 10:03 AM	Aaron Marcuse-Kubitza	bin/map: by_col: Reuse existing out_db connection for in_db instead of opening separate connection
3185	07/02/2012 09:50 AM	Aaron Marcuse-Kubitza	sql.py: mk_select(): Replaced newlines with spaces when query is simple enough to fit on one line
3184	07/02/2012 09:40 AM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Set db.src to help identify the data source in pg_stat_activity
3183	07/02/2012 09:39 AM	Aaron Marcuse-Kubitza	sql.py: DbConn: Added src config param, which in autocommit mode, will be included in a comment in every query, to help identify the data source in pg_stat_activity
3182	07/02/2012 09:38 AM	Aaron Marcuse-Kubitza	sql_gen.py: Added lstrip() to remove comments
3181	07/02/2012 09:13 AM	Aaron Marcuse-Kubitza	sql.py: mk_insert_select(): Added src param to help identify the data source in pg_stat_activity
3180	07/02/2012 08:33 AM	Aaron Marcuse-Kubitza	mappings/DwC2-VegBIEN.specimens.csv: Mapped institutionCode. This will enable datasources to use specimenreplicate's institution_id index for duplicate elimination.
3179	07/02/2012 08:31 AM	Aaron Marcuse-Kubitza	input.Makefile: Prompt user to accept test, instead of providing command line func for doing so
3178	07/02/2012 07:45 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: specimenreplicate: UNIQUE INDEX on catalognumber_dwc: Added institution_id so that datasources that specify it (such as aggregators) will not need to have catalognumbers be globally unique. Once the institution_id is mapped to, this will fix a bug where rows with the same catalognumber were assumed to be duplicates even though they were from different institutions. This should also avoid the need to do any duplicate elimination joins when importing specimenreplicate, speeding up column-based import.
3177	07/02/2012 07:32 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: specimenreplicate: Renamed museum_id to institution_id to correspond with DwC's institutionCode, so that it would be more obvious where to map institutionCode fields to
3176	07/02/2012 07:16 AM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated to include run times for rest of datasources for most recent column-based import
3175	06/29/2012 08:09 AM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting in_table: Prepend schema to subset table name so that in pg_stat_activity, it's clear which datasource a particular query is from
3174	06/29/2012 07:46 AM	Aaron Marcuse-Kubitza	sql_io.py: cast_temp_col(): add_col()'s distinguishing comment param: Add the type in case the same input column is being cast to different types, and both types have the same first word (causing their new column names to be the same)
3173	06/29/2012 07:42 AM	Aaron Marcuse-Kubitza	sql_io.py: cast_temp_col(): Name the new column with only the first word of the type, to save space in the limited identifier length
3172	06/29/2012 07:41 AM	Aaron Marcuse-Kubitza	strings.py: Added first_word()
3171	06/29/2012 07:35 AM	Aaron Marcuse-Kubitza	sql_io.py: cast_temp_col(): Use sql_gen.suffixed_col() to create the new column name
3170	06/29/2012 06:16 AM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Added run time for SALVIAS organisms, which just finished
3169	06/29/2012 06:14 AM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Use [1]-style footnotes because copying and pasting to Gmail doesn't preserve the superscripts
3168	06/29/2012 06:11 AM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated for latest simultaneous column-based import
3167	06/29/2012 04:42 AM	Aaron Marcuse-Kubitza	sql_io.py: cast_temp_col(): Don't automatically create an index on the new column, because it doesn't necessarily need an index and the main index used for the join is now added automatically by distinct_table()
3166	06/29/2012 04:39 AM	Aaron Marcuse-Kubitza	sql.py: flatten(): Don't automatically create indexes on all columns, because most columns don't need indexes and the main index used for the join is now added automatically by distinct_table()
3165	06/29/2012 04:35 AM	Aaron Marcuse-Kubitza	sql.py: Removed no longer needed add_index_col() and ensure_not_null() because we are not using index columns
3164	06/29/2012 04:33 AM	Aaron Marcuse-Kubitza	sql.py: add_index(): Don't create index columns for nullable columns, because they require indexes to be created on all columns in order to use a distinct_table() temp table. Also, now that we are no longer using LEFT JOINs, the COALESCE call would only be evaluated once (in the plain JOIN) in the event that PostgreSQL doesn't use an index on a COALESCE expression.
3163	06/29/2012 03:30 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: location: Dropped unique constraint on lat/long because it covered only some rows, which interfered with column-based import's selection of different insert methods based on the presence or absence of duplicate keys. (With the constraint, locations with coordinates would have duplicates eliminated, but locations without coordinates would not be able to find which row was added for a particular location because there was no lookup key to join on, and would all just use the first inserted row.) The previous behavior didn't make much sense anyway, because it would assert that two locationevents occurred in the same place just because they had the same coordinates, which may not have been precise enough to make this determination. Asserting that two locationevents occurred in the same place is really part of the secondary validation, not the import process.
3162	06/29/2012 01:58 AM	Aaron Marcuse-Kubitza	sql.py: DbConn: Fixed bug where Exceptions did not have the query appended if the query was not run in cacheable mode, by moving _add_cursor_info() from DbCursor.execute() to run_query() so it would also get called for non-cacheable queries that use a native cursor rather than a wrapper. Fixed bug where non-cacheable queries were not autocommitted, by moving self.do_autocommit() from DbCursor.execute() to run_query() so it would also get called for non-cacheable queries that use a native cursor rather than a wrapper.
3161	06/29/2012 01:54 AM	Aaron Marcuse-Kubitza	sql.py: DbConn._db(): Record that a transaction is already open before setting the search_path so that a query is never run with an _savepoint value less than 1 (manual transactions are not supported yet)
3160	06/29/2012 01:52 AM	Aaron Marcuse-Kubitza	sql.py: DbConn.with_savepoint(): Increment _savepoint before running queries so they don't get autocommitted
3159	06/29/2012 01:10 AM	Aaron Marcuse-Kubitza	sql.py: empty_temp(): Empty temp tables even in debug_temp mode, so that it can be seen which tables have been garbage collected and disk space leaks can be detected. This will not affect the external re-runnability of slow queries in debug_temp mode, as long as the user aborts the debug_temp import while the slow query is still running.
3158	06/29/2012 01:07 AM	Aaron Marcuse-Kubitza	sql_gen.py: ColDict: Use OrderedDict so that order of keys in input dict (if ordered) will be preserved. This should ensure that tempt table unique indexes have their columns in the same order as the output table, so that a merge join can be used.
3157	06/29/2012 01:01 AM	Aaron Marcuse-Kubitza	util.py: dict_subset(): Use OrderedDict so that order of keys in input dict (if ordered) will be preserved
3156	06/29/2012 12:55 AM	Aaron Marcuse-Kubitza	main Makefile: python-Darwin: Added pip installation instructions. python-Linux: Added ordereddict.
3155	06/29/2012 12:04 AM	Aaron Marcuse-Kubitza	sql.py: DbConn.col_info(): cacheable param defaults to True now that callers explicitly turn off cacheable when needed
3154	06/29/2012 12:00 AM	Aaron Marcuse-Kubitza	sql.py: add_index_col(): Explicitly set update()'s col_info caching depending on whether col_info will be changed later by add_not_null()
3153	06/28/2012 11:55 PM	Aaron Marcuse-Kubitza	sql.py: mk_update(): Renamed cacheable param to cacheable_ so it wouldn't conflict with update()'s cacheable param
3152	06/28/2012 11:54 PM	Aaron Marcuse-Kubitza	sql.py: mk_update(): Added cacheable param to set whether column structure information used to generate the query can be cached
3151	06/28/2012 11:40 PM	Aaron Marcuse-Kubitza	sql.py: add_index_col(): Explicitly set col_info()'s caching depending on whether col_info will be changed later by add_not_null()
3150	06/28/2012 11:35 PM	Aaron Marcuse-Kubitza	sql.py: DbConn.col_info(): Allow caller to specify whether query is cacheable
3149	06/28/2012 11:22 PM	Aaron Marcuse-Kubitza	csv2db: Fixed bug where CREATE TABLE statement was cached, causing it not to be re-executed after a rollback due to a failed COPY FROM. Avoid re-creating the table after a failed COPY FROM, and instead just remove any existing rows.
3148	06/28/2012 11:09 PM	Aaron Marcuse-Kubitza	sql.py: add_index(): Don't generate a unique name for the index because the database does that automatically
3147	06/28/2012 11:00 PM	Aaron Marcuse-Kubitza	csv2db: Vacuum table instead of just reanalyzing it because for some reason reanalyzing it isn't enough to fix the cached row count (causing pgAdmin3 to report that the table needs to be vacuumed)
3146	06/28/2012 10:54 PM	Aaron Marcuse-Kubitza	csv2db: Don't add indexes on the created table because they use up more disk space than the table itself and currently aren't used. (The import process adds indexes on each iteration's column subset instead.)
3145	06/28/2012 10:21 PM	Aaron Marcuse-Kubitza	db_xml.py: partition_size: Turning partitioning back on (with a larger limit), since the largest datasources' temp tables are still too big
3144	06/28/2012 10:20 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Fixed bug where if there were multiple unique constraints that were violated, only the distinct temp table for the last one would get garbage-collected
3143	06/28/2012 09:01 PM	Aaron Marcuse-Kubitza	db_xml.py: partition_size: Set to sys.maxint to disable partitioning. The last bugfix, which avoided returning a large result set to the client which was never read, seems to have fixed the disk space leak, so it's worth reattempting a full simultaneous import.
3142	06/28/2012 08:30 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting in_table: Truncate in_table when finished with it, to avoid temp table disk space leaks
3141	06/28/2012 07:56 PM	Aaron Marcuse-Kubitza	sql.py: insert_select(): If caller is only interested in the rowcount (if returning == None), keep the NULL rows for each insert on the server using CREATE TABLE AS. (CREATE TABLE AS sets rowcount to # rows in query, so rowcount will still be set correctly.)
3140	06/28/2012 04:59 PM	Aaron Marcuse-Kubitza	top-level map: Added support for custom public schema, to be able to run imports and tests simultaneously (e.g. on a dev machine)
3139	06/27/2012 10:56 PM	Aaron Marcuse-Kubitza	csv2db: Fixed bug where table needed to be a sql_gen.Table object with the proper schema, so that errors_table would be created in the correct schema. Removed no longer needed changing of the search_path.
3138	06/27/2012 10:55 PM	Aaron Marcuse-Kubitza	csv2db: Fixed bug where table needed to be a sql_gen.Table object with the proper schema, so that errors_table would be created in the correct schema. Removed no longer needed changing of the search_path.
3137	06/27/2012 10:12 PM	Aaron Marcuse-Kubitza	sql.py: DbConn.with_savepoint(): Open a new transaction if one is not already open
3136	06/27/2012 10:11 PM	Aaron Marcuse-Kubitza	sql.py: DbConn: _savepoint starts at 1 because the driver is not in autocommit mode, so a transaction is already open
3135	06/27/2012 10:05 PM	Aaron Marcuse-Kubitza	sql.py: DbConn: _savepoint starts at 1 because the driver is not in autocommit mode, so a transaction is already open
3134	06/27/2012 09:31 PM	Aaron Marcuse-Kubitza	csv2db: Create errors table first, so that imports can start using it right away
3133	06/27/2012 09:25 PM	Aaron Marcuse-Kubitza	input.Makefile: Added import/steps.by_col.sql to generate a Redmine-formatted list of steps for column-based import
3132	06/27/2012 08:56 PM	Aaron Marcuse-Kubitza	bin/map: Optimized default verbosities for the mode: automated tests should not be verbose, column-based import should show all queries to assist profiling, and row-based import should just show row progress
3131	06/27/2012 08:43 PM	Aaron Marcuse-Kubitza	sql_io.py: put(): Run data import queries with log_level=3.5 so they don't clutter the output at the normal import verbosity of 3
3130	06/27/2012 08:27 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Work around PostgreSQL's temp table disk space leak by reconnecting to the DB after every partition
3129	06/27/2012 08:26 PM	Aaron Marcuse-Kubitza	sql.py: mk_select(): Also support limit and start values of type long
3128	06/27/2012 08:13 PM	Aaron Marcuse-Kubitza	sql_gen.py: suffixed_table(): Fixed bug where needed to copy all table attrs, such as is_temp status
3127	06/27/2012 08:05 PM	Aaron Marcuse-Kubitza	sql.py: create_table(): Fixed bug where needed to run query in recover mode in case the table exists and was created before the current connection, such that the CREATE TABLE statement would not have been cached
3126	06/27/2012 07:50 PM	Aaron Marcuse-Kubitza	sql.py: create_table(): Removed final newline after query because that's added by the logging mechanism
3125	06/27/2012 07:43 PM	Aaron Marcuse-Kubitza	sql.py: Added reconnect()
3124	06/27/2012 07:37 PM	Aaron Marcuse-Kubitza	sql.py: DbConn._reset(): Assert that _savepoint is 0 instead of setting it to 0
3123	06/27/2012 07:31 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): put_table_(): Removed no longer used limit, start params
3122	06/27/2012 07:23 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Merged partitioning and subsetting into same section for simplicity, to avoid creating extra temp tables, and to later allow the connection to be closed and reopened between partitions. partition_size: Expressed value without exponent notation to ensure that it's an integer.
3121	06/27/2012 07:11 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Partitioning in_table: Adjust bounds of last partition to actual row #s included
3120	06/27/2012 06:43 PM	Aaron Marcuse-Kubitza	sql.py: DbConn: Added _ to reset() to indicate that it's a protected method and users should not call it directly
3119	06/27/2012 06:41 PM	Aaron Marcuse-Kubitza	sql.py: DbConn.close(): Reset the connection completely using reset()
3118	06/27/2012 06:40 PM	Aaron Marcuse-Kubitza	sql.py: DbConn: Added clear_cache() and reset() and use reset() in init()
3117	06/27/2012 06:31 PM	Aaron Marcuse-Kubitza	bin/map: Use new DbConn.close()
3116	06/27/2012 06:31 PM	Aaron Marcuse-Kubitza	sql.py: DbConn: Added close()
3115	06/27/2012 06:07 PM	Aaron Marcuse-Kubitza	db_xml.py: partition_size: Set to just more than the size of the largest data source that was successfully imported in simultaneous import
3114	06/27/2012 05:32 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Partition in_table if larger than a threshold. The threshold is initially set to disable partitioning. Partitioning will hopefully eliminate the excessive disk usage for large input tables, which has caused the system to run out of disk space due to what may be a bug in PostgreSQL.
3113	06/27/2012 05:27 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Set in_table's default srcs to in_table itself instead of sql_gen.src_self, so that any copies of in_table will inherit the same srcs instead of being treated as a top-level table. This ensures that the top-level table's errors table will always be used.
3112	06/27/2012 05:17 PM	Aaron Marcuse-Kubitza	sql_io.py: cast(): Always convert exceptions to warnings if the input is a column or expression, even if there is no place to save the errors, so that invalid data does not need to be handled by the caller in a (much slower) extra exception-handling loop
3111	06/27/2012 04:47 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): MissingCastException: When casting, handle InvalidValueException by filtering out invalid values with invalid2null() in a loop
3110	06/27/2012 04:45 PM	Aaron Marcuse-Kubitza	sql_io.py: cast_temp_col(): Run sql.update() in recover mode in case expr produces errors. Don't cache sql.update() in case this function will be called again after error recovery.
3109	06/27/2012 04:40 PM	Aaron Marcuse-Kubitza	sql.py: Generalized FunctionValueException to InvalidValueException so that it will match all invalid-value errors, not just those occurring in user-defined functions
3108	06/27/2012 04:22 PM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Removed no longer used sql.FunctionValueException handling, because type casting functions now do their own invalid value handling
3107	06/27/2012 03:44 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting in_table: Call put_table() recursively using put_table_() to ensure that limit and start are reset to their default values, in case the table gets partitioned (which needs up-to-date limit and start values)

Project

General

Profile

root @ 3206