/ - Changes - BIEN 3 - NCEAS Projects

root @ 3251

#	Date	Author	Comment
3251	07/06/2012 02:33 PM	Aaron Marcuse-Kubitza	pg_dump_vegbien: Enclose the schema name in "" because pg_dump requires this for schema names with special characters
3250	07/06/2012 02:09 PM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated with stats from 2012-7-3 and 2012-7-5 imports. Note that the 2012-7-5 import was partial, so its stats can't be directly compared.
3249	07/06/2012 01:28 PM	Aaron Marcuse-Kubitza	root Makefile: VegBIEN DB: Schemas: Added schemas/%/rm_indexes
3248	07/06/2012 01:27 PM	Aaron Marcuse-Kubitza	Added mk_rm_indexes
3247	07/06/2012 11:14 AM	Aaron Marcuse-Kubitza	sql.py: Added drop() and use it in drop_table()
3246	07/06/2012 10:59 AM	Aaron Marcuse-Kubitza	debug2redmine: Remove profiling information from the logging output
3245	07/06/2012 10:43 AM	Aaron Marcuse-Kubitza	sql.py: DbConn.run_query(): Only print notices in debug mode, because they are output with a log level higher than the debug verbosity threshold, and this avoid unnecessary overhead
3244	07/06/2012 10:41 AM	Aaron Marcuse-Kubitza	sql.py: DbConn: Added profile_row_ct setting, which is passed to profiler.stop() in run_query()
3243	07/06/2012 10:38 AM	Aaron Marcuse-Kubitza	bin/map: Logging: Raised debug-mode verbosity threshold to 1.5 so that in row-based imports, which have a default verbosity of 1.1, sql.DbConn.run_query() will not profile the query, to avoid unnecessary overhead
3242	07/06/2012 10:34 AM	Aaron Marcuse-Kubitza	sql.py: DbConn.run_query(): Only profile queries in debug mode, to avoid unnecessary overhead when the run time will not be displayed
3241	07/06/2012 10:29 AM	Aaron Marcuse-Kubitza	sql.py: DbConn.run_query(): Profile using the profiling.ItersProfiler class, which pretty-prints the run time
3240	07/06/2012 10:22 AM	Aaron Marcuse-Kubitza	sql.py: DbConn.run_query(): Added profiling of query execution, which is logged with the query
3239	07/06/2012 09:26 AM	Aaron Marcuse-Kubitza	sql.py: DbConn.run_query(): Move log_msg() to where it's used, so that it runs after the query is run and can refer to profiling variables
3238	07/06/2012 09:21 AM	Aaron Marcuse-Kubitza	sql.py: DbConn.run_query(): Use else blocks to avoid applying exception handling to commands run after the main command
3237	07/06/2012 09:18 AM	Aaron Marcuse-Kubitza	sql.py: DbConn.run_query(): Always output or return the log message after the query is run, so that it can be output with profiling statistics in the log message header
3236	07/06/2012 09:05 AM	Aaron Marcuse-Kubitza	sql.py: run_query(): Always output the log message after the query is run, so that it can be output with profiling statistics in the log message header
3235	07/05/2012 03:16 PM	Aaron Marcuse-Kubitza	Regenerated vegbien.ERD exports
3234	07/05/2012 03:13 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: locationevent: Added covering lookup indexes on the unique constraints to enable fast merge joins in column-based import. Removed no longer needed individual-column lookup indexes because the constraint-covering lookup indexes now handle lookups. This also avoids index bloat.
3233	07/05/2012 03:00 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: specimenreplicate: Removed no longer needed individual-column lookup indexes because the constraint-covering lookup indexes now handle lookups. This also avoids index bloat.
3232	07/05/2012 02:57 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: specimenreplicate: Added covering lookup indexes on the unique constraints to enable fast merge joins in column-based import
3231	07/05/2012 02:48 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: specimenreplicate: Added CHECK constraint which ensures that there is at least one key to sufficiently uniquely identify the specimenreplicate
3230	07/05/2012 02:44 PM	Aaron Marcuse-Kubitza	inputs/CTFS/maps/VegX.organisms.csv: Mapped VegX sourceAccessionCode = VegBIEN plantobservation,specimenreplicate.sourceaccessioncode so that specimenreplicate would have a required key
3229	07/05/2012 02:38 PM	Aaron Marcuse-Kubitza	mappings/VegX-VegBIEN.stems.csv: Sort the plantobservation.sourceaccessioncode/specimenreplicate.sourceaccessioncode mapping with the other _ifs so the adjacent node merging works properly and it gets created before _ignore removes voucherType
3228	07/05/2012 02:34 PM	Aaron Marcuse-Kubitza	mappings/VegX-VegBIEN.stems.csv: Also map plantobservation.sourceaccessioncode to specimenreplicate.sourceaccessioncode so specimenreplicate always has a key and will never be underconstrained
3227	07/05/2012 02:12 PM	Aaron Marcuse-Kubitza	xml_func.py: process(): Fixed bug where an evaluated XML function might create a node of the same name as an existing node, but these nodes would not be merged even though they referred to the same object, by merging siblings of a newly-evaluated (replaced) node if they have the same name
3226	07/05/2012 02:09 PM	Aaron Marcuse-Kubitza	xml_dom.py: Added merge() and merge_adjacent()
3225	07/05/2012 02:08 PM	Aaron Marcuse-Kubitza	xml_dom.py: replace_with_text(): Return the new node
3224	07/05/2012 12:33 PM	Aaron Marcuse-Kubitza	mappings/VegX-VegBIEN.stems.csv: Indirect voucher mappings: Removed no longer needed ":[_id/taxonoccurrence]" because a specimenreplicate is a taxonoccurrence, so it doesn't need to have* an empty taxonoccurrence
3223	07/05/2012 12:27 PM	Aaron Marcuse-Kubitza	mappings/VegX-VegBIEN.stems.csv: Fixing specimenreplicate->taxonoccurrence mapping bug where taxonoccurrence_id is no longer used as an fkey because it's instead a pkey inherited from taxonoccurrence, by instead using the new fkey to plantobservation for direct vouchers. Note that a duplicate aggregateoccurrence is created, because the _if XML function runs after the XPaths have created the initial tree, and thus the nodes it pulls forward do not automatically get merged with adjacent nodes of the same name. This will eventually need to be fixed by auto-merging the nodes.
3222	07/05/2012 12:00 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: specimenreplicate: Fixing specimenreplicate->taxonoccurrence mapping bug where taxonoccurrence_id is no longer used as an fkey because it's instead a pkey inherited from taxonoccurrence, by instead adding an fkey to plantobservation for direct vouchers. Also, it makes more sense for a specimenreplicate to directly voucher the plant it came from rather than that plant's taxonoccurrence, because a direct voucher is a closer relationship to the plant.
3221	07/05/2012 11:22 AM	Aaron Marcuse-Kubitza	mappings/VegX-VegBIEN.stems.csv: Map collectiondate to specimenreplicate via voucher when the voucher is indirect, rather than always directly to the taxonoccurrence, because the collectiondate relates to the specimenreplicate, not the taxonoccurrence, and is not necessarily 1:1 with it
3220	07/05/2012 11:17 AM	Aaron Marcuse-Kubitza	mappings: Updated for_review VegX-VegBIEN mappings, which hadn't been auto-updated because of a modification time issue. (mappings/VegX-VegBIEN.stems.csv was replaced with an older version, which did not trigger make to remake files depending on it.)
3219	07/05/2012 10:28 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: locationevent: Added sql_gen-compatible indexes on all columns in the locationevent_unique_project_authorcode UNIQUE index: Changed locationevent_project_id index to use COALESCE. Added index on obsstartdate.
3218	07/05/2012 10:19 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: locationevent: Removed no longer needed COALESCE index on location_id now that location_id is NOT NULL
3217	07/05/2012 10:16 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: locationevent: Fixed bug where locationevent_unique_location index was overconstraining locationevent when a sourceaccessioncode or obsstartdate was specified, by combining the locationevent_unique_location, locationevent_unique_accessioncode, and locationevent_unique_location_date indexes into one COALESCE index on the combined fields of those indexes
3216	07/05/2012 10:10 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: locationevent: Made location_id required because every locationevent should have a location, even one with no locationdeterminations. This also avoids the creation of a parent locationevent when subplots are not being used.
3215	07/05/2012 09:48 AM	Aaron Marcuse-Kubitza	mappings/VegX-VegBIEN.stems.csv: Removed _collapse where it's no longer needed because sql_io.put() handles that now. Note that each locationevent will get an empty commclass, whether or not there are any commdeterminations. This can later be used to add new commdeterminations.
3214	07/05/2012 09:45 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: commclass: Changed commclass_unique to COALESCE classnotes so that there is only one commclass for a locationevent when the commclasses are not separately named. (Currently classnotes is used as the class name field, commname being the name of the community itself.)
3213	07/05/2012 09:33 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: commdetermination: Made commconcept_id NOT NULL because it doesn't make sense to have a commdetermination on nothing. Note that the commname field in commdetermination is not used for making determinations (and may need to be removed to avoid confusion); commname.commname is used instead.
3212	07/05/2012 09:28 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: locationevent: Added COALESCE index on location_id for use by column-based import
3211	07/05/2012 09:24 AM	Aaron Marcuse-Kubitza	mappings/VegX-VegBIEN.stems.csv: Removed _collapse where it's no longer needed because sql_io.put() handles that now. Note that each plantobservation will get an empty stemobservation, whether or not there are any stemtags. This can later be used to add further stemtags.
3210	07/05/2012 08:58 AM	Aaron Marcuse-Kubitza	mappings/VegX-VegBIEN.stems.csv: Removed _collapse where it's no longer needed because sql_io.put() handles that now
3209	07/05/2012 08:31 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: location: Made datasource_id, sourceaccessioncode NOT NULL to ensure that all locations are uniquely identifiable by their datasource's unique key (sourceaccessioncode)
3208	07/05/2012 08:28 AM	Aaron Marcuse-Kubitza	sql_io.py: put(): Handle NullValueExceptions by returning a NULL pkey, just like put_table() (column-based import) does
3207	07/03/2012 05:29 PM	Aaron Marcuse-Kubitza	VegBIEN: Fixing import issue related to duplicate entries in tables with children, where when a new table entry duplicates an existing entry, the 1:1 tables of that table and those tables' children are not merged, causing them to become orphaned. It is described in detail at <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/Import_issues#Merging-duplicates-with-children>, including the rationale for this solution. Note that this is not a bug in column-based import, it applies to row-based import as well. This commit fixes the issue for locationevent->location in plots data, by also mapping locationevent's unique keys to location.sourceaccessioncode and setting location.datasource_id.
3206	07/03/2012 03:59 PM	Aaron Marcuse-Kubitza	sql.py: DbConn.run_query(): Separate the data source comment from the query with a tab in the executed query but a \r in the logged query, so that the query will be shown on the same line as the data source comment in pg_stat_activity, but be hidden by the following line when cating the file and be put on a separate line when viewed in a text editor. This causes the first line of the query to be at the left edge when the log file is viewed, so that it looks more natural.
3205	07/03/2012 03:15 PM	Aaron Marcuse-Kubitza	README.TXT: Data import: Import data into VegBIEN: Added command to use for column-based import
3204	07/03/2012 02:10 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: locationevent: Allow a locationevent to be uniquely specified by its location (which is now datasource-scoped) and start date
3203	07/03/2012 01:26 PM	Aaron Marcuse-Kubitza	VegBIEN: Fixing import issue related to duplicate entries in tables with children, where when a new table entry duplicates an existing entry, the 1:1 tables of that table and those tables' children are not merged, causing them to become orphaned. It is described in detail at <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/Import_issues#Merging-duplicates-with-children>, including the rationale for this solution. Note that this is not a bug in column-based import, it applies to row-based import as well. This commit fixes the issue for specimenreplicate->...->location, by also mapping specimenreplicate's unique keys to location.sourceaccessioncode and setting location.datasource_id.
3202	07/03/2012 12:47 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: locationevent: Allow to be uniquely specified by location_id. This is useful for specimens data where there is one location for every locationevent.
3201	07/03/2012 12:27 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: location: Added datasource_id and sourceaccessioncode so locations can be uniquely specified by the input datasource, rather than being created automatically for each locationevent
3200	07/03/2012 11:45 AM	Aaron Marcuse-Kubitza	schemas/filter_ERD.csv: Add back taxondetermination->taxonoccurrence fkey because that has been replaced by a trigger in the SQL
3199	07/03/2012 11:06 AM	Aaron Marcuse-Kubitza	VegBIEN: Fixing import issue related to duplicate entries in tables with children, where when a new table entry duplicates an existing entry, the 1:1 tables of that table and those tables' children are not merged, causing them to become orphaned. It is described in detail at <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/Import_issues#Merging-duplicates-with-children>, including the rationale for this solution. Note that this is not a bug in column-based import, it applies to row-based import as well. This commit fixes the issue for specimenreplicate->taxonoccurrence.
3198	07/03/2012 07:40 AM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated with remaining stats from most recent import
3197	07/02/2012 03:52 PM	Aaron Marcuse-Kubitza	PostgreSQL-MySQL.csv: Remove INHERITS clauses
3196	07/02/2012 02:37 PM	Aaron Marcuse-Kubitza	schemas/vegbien.ERD.mwb: Fixed lines
3195	07/02/2012 02:24 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Moved in_row_ct updating to Subsetting section so the cursor's rowcount can be used directly
3194	07/02/2012 02:18 PM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting in_table: Don't count # rows because this takes awhile for large datasets. Instead, use the chunking algorithm in digir_client, which ends the loop when a partial or empty partition is encountered.
3193	07/02/2012 01:58 PM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated with new stats from an independent import
3192	07/02/2012 01:20 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: Fixed UNIQUE INDEXes that were still using COALESCE to use COALESCE in order to match what sql_gen.EnsureNotNull uses
3191	07/02/2012 12:41 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: specimenreplicate: UNIQUE INDEX on catalognumber_dwc: Added collectioncode_dwc so that datasources that specify it in addition to the institution_id (such as aggregators) will not need to have catalognumbers be unique within an institution
3190	07/02/2012 12:30 PM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated with more stats from latest import
3189	07/02/2012 11:44 AM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated with initial stats from latest import. Reformatted to put successive runs of column-based next to each other, so they could be directly compared and so that the row-based data wouldn't need to be duplicated. Added empty-value checks to formulas so that they don't need to be manually deleted when one of their inputs is empty.
3188	07/02/2012 10:32 AM	Aaron Marcuse-Kubitza	input.Makefile: Documentation: import/steps.by_col.sql: Fixed bug where needed to run import in test mode
3187	07/02/2012 10:12 AM	Aaron Marcuse-Kubitza	sql_io.py: put_table(): Don't set pkeys of missing rows to default value if out_table is a SQL function, because then there is already an entry for every row
3186	07/02/2012 10:03 AM	Aaron Marcuse-Kubitza	bin/map: by_col: Reuse existing out_db connection for in_db instead of opening separate connection
3185	07/02/2012 09:50 AM	Aaron Marcuse-Kubitza	sql.py: mk_select(): Replaced newlines with spaces when query is simple enough to fit on one line
3184	07/02/2012 09:40 AM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Set db.src to help identify the data source in pg_stat_activity
3183	07/02/2012 09:39 AM	Aaron Marcuse-Kubitza	sql.py: DbConn: Added src config param, which in autocommit mode, will be included in a comment in every query, to help identify the data source in pg_stat_activity
3182	07/02/2012 09:38 AM	Aaron Marcuse-Kubitza	sql_gen.py: Added lstrip() to remove comments
3181	07/02/2012 09:13 AM	Aaron Marcuse-Kubitza	sql.py: mk_insert_select(): Added src param to help identify the data source in pg_stat_activity
3180	07/02/2012 08:33 AM	Aaron Marcuse-Kubitza	mappings/DwC2-VegBIEN.specimens.csv: Mapped institutionCode. This will enable datasources to use specimenreplicate's institution_id index for duplicate elimination.
3179	07/02/2012 08:31 AM	Aaron Marcuse-Kubitza	input.Makefile: Prompt user to accept test, instead of providing command line func for doing so
3178	07/02/2012 07:45 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: specimenreplicate: UNIQUE INDEX on catalognumber_dwc: Added institution_id so that datasources that specify it (such as aggregators) will not need to have catalognumbers be globally unique. Once the institution_id is mapped to, this will fix a bug where rows with the same catalognumber were assumed to be duplicates even though they were from different institutions. This should also avoid the need to do any duplicate elimination joins when importing specimenreplicate, speeding up column-based import.
3177	07/02/2012 07:32 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: specimenreplicate: Renamed museum_id to institution_id to correspond with DwC's institutionCode, so that it would be more obvious where to map institutionCode fields to
3176	07/02/2012 07:16 AM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated to include run times for rest of datasources for most recent column-based import
3175	06/29/2012 08:09 AM	Aaron Marcuse-Kubitza	db_xml.py: put_table(): Subsetting in_table: Prepend schema to subset table name so that in pg_stat_activity, it's clear which datasource a particular query is from
3174	06/29/2012 07:46 AM	Aaron Marcuse-Kubitza	sql_io.py: cast_temp_col(): add_col()'s distinguishing comment param: Add the type in case the same input column is being cast to different types, and both types have the same first word (causing their new column names to be the same)
3173	06/29/2012 07:42 AM	Aaron Marcuse-Kubitza	sql_io.py: cast_temp_col(): Name the new column with only the first word of the type, to save space in the limited identifier length
3172	06/29/2012 07:41 AM	Aaron Marcuse-Kubitza	strings.py: Added first_word()
3171	06/29/2012 07:35 AM	Aaron Marcuse-Kubitza	sql_io.py: cast_temp_col(): Use sql_gen.suffixed_col() to create the new column name
3170	06/29/2012 06:16 AM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Added run time for SALVIAS organisms, which just finished
3169	06/29/2012 06:14 AM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Use [1]-style footnotes because copying and pasting to Gmail doesn't preserve the superscripts
3168	06/29/2012 06:11 AM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated for latest simultaneous column-based import
3167	06/29/2012 04:42 AM	Aaron Marcuse-Kubitza	sql_io.py: cast_temp_col(): Don't automatically create an index on the new column, because it doesn't necessarily need an index and the main index used for the join is now added automatically by distinct_table()
3166	06/29/2012 04:39 AM	Aaron Marcuse-Kubitza	sql.py: flatten(): Don't automatically create indexes on all columns, because most columns don't need indexes and the main index used for the join is now added automatically by distinct_table()
3165	06/29/2012 04:35 AM	Aaron Marcuse-Kubitza	sql.py: Removed no longer needed add_index_col() and ensure_not_null() because we are not using index columns
3164	06/29/2012 04:33 AM	Aaron Marcuse-Kubitza	sql.py: add_index(): Don't create index columns for nullable columns, because they require indexes to be created on all columns in order to use a distinct_table() temp table. Also, now that we are no longer using LEFT JOINs, the COALESCE call would only be evaluated once (in the plain JOIN) in the event that PostgreSQL doesn't use an index on a COALESCE expression.
3163	06/29/2012 03:30 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: location: Dropped unique constraint on lat/long because it covered only some rows, which interfered with column-based import's selection of different insert methods based on the presence or absence of duplicate keys. (With the constraint, locations with coordinates would have duplicates eliminated, but locations without coordinates would not be able to find which row was added for a particular location because there was no lookup key to join on, and would all just use the first inserted row.) The previous behavior didn't make much sense anyway, because it would assert that two locationevents occurred in the same place just because they had the same coordinates, which may not have been precise enough to make this determination. Asserting that two locationevents occurred in the same place is really part of the secondary validation, not the import process.
3162	06/29/2012 01:58 AM	Aaron Marcuse-Kubitza	sql.py: DbConn: Fixed bug where Exceptions did not have the query appended if the query was not run in cacheable mode, by moving _add_cursor_info() from DbCursor.execute() to run_query() so it would also get called for non-cacheable queries that use a native cursor rather than a wrapper. Fixed bug where non-cacheable queries were not autocommitted, by moving self.do_autocommit() from DbCursor.execute() to run_query() so it would also get called for non-cacheable queries that use a native cursor rather than a wrapper.
3161	06/29/2012 01:54 AM	Aaron Marcuse-Kubitza	sql.py: DbConn._db(): Record that a transaction is already open before setting the search_path so that a query is never run with an _savepoint value less than 1 (manual transactions are not supported yet)
3160	06/29/2012 01:52 AM	Aaron Marcuse-Kubitza	sql.py: DbConn.with_savepoint(): Increment _savepoint before running queries so they don't get autocommitted
3159	06/29/2012 01:10 AM	Aaron Marcuse-Kubitza	sql.py: empty_temp(): Empty temp tables even in debug_temp mode, so that it can be seen which tables have been garbage collected and disk space leaks can be detected. This will not affect the external re-runnability of slow queries in debug_temp mode, as long as the user aborts the debug_temp import while the slow query is still running.
3158	06/29/2012 01:07 AM	Aaron Marcuse-Kubitza	sql_gen.py: ColDict: Use OrderedDict so that order of keys in input dict (if ordered) will be preserved. This should ensure that tempt table unique indexes have their columns in the same order as the output table, so that a merge join can be used.
3157	06/29/2012 01:01 AM	Aaron Marcuse-Kubitza	util.py: dict_subset(): Use OrderedDict so that order of keys in input dict (if ordered) will be preserved
3156	06/29/2012 12:55 AM	Aaron Marcuse-Kubitza	main Makefile: python-Darwin: Added pip installation instructions. python-Linux: Added ordereddict.
3155	06/29/2012 12:04 AM	Aaron Marcuse-Kubitza	sql.py: DbConn.col_info(): cacheable param defaults to True now that callers explicitly turn off cacheable when needed
3154	06/29/2012 12:00 AM	Aaron Marcuse-Kubitza	sql.py: add_index_col(): Explicitly set update()'s col_info caching depending on whether col_info will be changed later by add_not_null()
3153	06/28/2012 11:55 PM	Aaron Marcuse-Kubitza	sql.py: mk_update(): Renamed cacheable param to cacheable_ so it wouldn't conflict with update()'s cacheable param
3152	06/28/2012 11:54 PM	Aaron Marcuse-Kubitza	sql.py: mk_update(): Added cacheable param to set whether column structure information used to generate the query can be cached

Project

General

Profile

root @ 3251