Project

General

Profile

Statistics
| Revision:

# Date Author Comment
3272 07/09/2012 02:16 PM Aaron Marcuse-Kubitza

sql.py: DbConn.with_savepoint(): Profile (nested) transactions so that the run time for groups of commands (e.g. csv2db INSERTs) is known

3271 07/09/2012 02:04 PM Aaron Marcuse-Kubitza

csv2db: verbosity defaults to 3 so that detailed queries with profiling stats are included in the log file, to assist in optimization

3270 07/09/2012 02:01 PM Aaron Marcuse-Kubitza

csv2db: Don't cache per-row INSERT queries because this bloats the cache (there aren't repeated identical INSERTs that shouldn't be re-run like in row-based import)

3269 07/09/2012 01:57 PM Aaron Marcuse-Kubitza

sql.py with_explain_comment(), DbConn: Fixed bug where with_explain_comment() was being run in per-row imports (row-based import and csv2db with INSERT), causing the overhead of an EXPLAIN query for every single INSERT and filling up the cache with EXPLAIN query results, by adding autoexplain mode, only running with_explain_comment() in autoexplain mode, and only enabling autoexplain mode for column-based import

3268 07/09/2012 01:11 PM Aaron Marcuse-Kubitza

db_xml.py: put_table(): Turn on autoanalyze mode to help the query planner avoid sequential scans on tables that now contain data. (Don't do this in row-based import because it creates too much overhead per insert.)

3267 07/09/2012 12:24 PM Aaron Marcuse-Kubitza

sql.py: Run all EXPLAIN queries with log_level=4 since the EXPLAIN information is now usually generated when the query is generated rather than when it's run, so the log_level is not known

3266 07/09/2012 12:21 PM Aaron Marcuse-Kubitza

sql.py: Added with_explain_comment() to query generating functions so that nested queries will also have EXPLAIN information

3265 07/09/2012 12:11 PM Aaron Marcuse-Kubitza

sql.py: Added with_explain_comment() and use it in run_query()

3264 07/09/2012 12:01 PM Aaron Marcuse-Kubitza

sql.py: run_query(): EXPLAIN output: Run explain() with log_level 1 higher than the query's log_level, so that low-level queries' EXPLAIN queries are not output when the queries themselves are not output. This also ensures that only level 2 (major) queries have the EXPLAIN logged (to introduce the query that is being run), to avoid cluttering the log output.

3263 07/09/2012 11:54 AM Aaron Marcuse-Kubitza

sql.py: explain(): Support custom log_level

3262 07/09/2012 11:48 AM Aaron Marcuse-Kubitza

schemas/vegbien.sql: taxondetermination: taxondetermination_taxonoccurrence_id_fkey manual fkey constraint: Fixed bug where needed to raise foreign_key_violation instead of unique_violation

3261 07/09/2012 11:23 AM Aaron Marcuse-Kubitza

inputs/import.stats.xls: Updated with stats from latest import

3260 07/06/2012 04:43 PM Aaron Marcuse-Kubitza

debug2redmine.csv: Remove newline before EXPLAIN comment

3259 07/06/2012 04:33 PM Aaron Marcuse-Kubitza

debug2redmine.csv: Filter out EXPLAIN comments

3258 07/06/2012 04:29 PM Aaron Marcuse-Kubitza

sql.py: run_query(): EXPLAIN all explainable queries before they are run, to provide query plans for later profiling and index analysis. At verbosity 3+, this also effectively allows the user to see what query is being run before it's executed.

3257 07/06/2012 04:26 PM Aaron Marcuse-Kubitza

sql.py: is_explainable(): Fixed bug where needed r'' syntax to escape \ in \b

3256 07/06/2012 04:23 PM Aaron Marcuse-Kubitza

sql.py: Added explain() and is_explainable()

3255 07/06/2012 04:19 PM Aaron Marcuse-Kubitza

strings.py: Added join_lines()

3254 07/06/2012 02:50 PM Aaron Marcuse-Kubitza

mk_rm_indexes: Also include the search_path in the outputted commands

3253 07/06/2012 02:45 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: commclass: Fixed bug where commclass_unique needed to be a UNIQUE INDEX

3252 07/06/2012 02:42 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: plantname: Removed unneeded indexes on plantname and rank (plantname_unique takes care of joins)

3251 07/06/2012 02:33 PM Aaron Marcuse-Kubitza

pg_dump_vegbien: Enclose the schema name in "" because pg_dump requires this for schema names with special characters

3250 07/06/2012 02:09 PM Aaron Marcuse-Kubitza

inputs/import.stats.xls: Updated with stats from 2012-7-3 and 2012-7-5 imports. Note that the 2012-7-5 import was partial, so its stats can't be directly compared.

3249 07/06/2012 01:28 PM Aaron Marcuse-Kubitza

root Makefile: VegBIEN DB: Schemas: Added schemas/%/rm_indexes

3248 07/06/2012 01:27 PM Aaron Marcuse-Kubitza

Added mk_rm_indexes

3247 07/06/2012 11:14 AM Aaron Marcuse-Kubitza

sql.py: Added drop() and use it in drop_table()

3246 07/06/2012 10:59 AM Aaron Marcuse-Kubitza

debug2redmine: Remove profiling information from the logging output

3245 07/06/2012 10:43 AM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Only print notices in debug mode, because they are output with a log level higher than the debug verbosity threshold, and this avoid unnecessary overhead

3244 07/06/2012 10:41 AM Aaron Marcuse-Kubitza

sql.py: DbConn: Added profile_row_ct setting, which is passed to profiler.stop() in run_query()

3243 07/06/2012 10:38 AM Aaron Marcuse-Kubitza

bin/map: Logging: Raised debug-mode verbosity threshold to 1.5 so that in row-based imports, which have a default verbosity of 1.1, sql.DbConn.run_query() will not profile the query, to avoid unnecessary overhead

3242 07/06/2012 10:34 AM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Only profile queries in debug mode, to avoid unnecessary overhead when the run time will not be displayed

3241 07/06/2012 10:29 AM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Profile using the profiling.ItersProfiler class, which pretty-prints the run time

3240 07/06/2012 10:22 AM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Added profiling of query execution, which is logged with the query

3239 07/06/2012 09:26 AM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Move log_msg() to where it's used, so that it runs after the query is run and can refer to profiling variables

3238 07/06/2012 09:21 AM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Use else blocks to avoid applying exception handling to commands run after the main command

3237 07/06/2012 09:18 AM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Always output or return the log message after the query is run, so that it can be output with profiling statistics in the log message header

3236 07/06/2012 09:05 AM Aaron Marcuse-Kubitza

sql.py: run_query(): Always output the log message after the query is run, so that it can be output with profiling statistics in the log message header

3235 07/05/2012 03:16 PM Aaron Marcuse-Kubitza

Regenerated vegbien.ERD exports

3234 07/05/2012 03:13 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: locationevent: Added covering lookup indexes on the unique constraints to enable fast merge joins in column-based import. Removed no longer needed individual-column lookup indexes because the constraint-covering lookup indexes now handle lookups. This also avoids index bloat.

3233 07/05/2012 03:00 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: specimenreplicate: Removed no longer needed individual-column lookup indexes because the constraint-covering lookup indexes now handle lookups. This also avoids index bloat.

3232 07/05/2012 02:57 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: specimenreplicate: Added covering lookup indexes on the unique constraints to enable fast merge joins in column-based import

3231 07/05/2012 02:48 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: specimenreplicate: Added CHECK constraint which ensures that there is at least one key to sufficiently uniquely identify the specimenreplicate

3230 07/05/2012 02:44 PM Aaron Marcuse-Kubitza

inputs/CTFS/maps/VegX.organisms.csv: Mapped VegX sourceAccessionCode = VegBIEN plantobservation,specimenreplicate.sourceaccessioncode so that specimenreplicate would have a required key

3229 07/05/2012 02:38 PM Aaron Marcuse-Kubitza

mappings/VegX-VegBIEN.stems.csv: Sort the plantobservation.sourceaccessioncode/specimenreplicate.sourceaccessioncode mapping with the other _ifs so the adjacent node merging works properly and it gets created before _ignore removes voucherType

3228 07/05/2012 02:34 PM Aaron Marcuse-Kubitza

mappings/VegX-VegBIEN.stems.csv: Also map plantobservation.sourceaccessioncode to specimenreplicate.sourceaccessioncode so specimenreplicate always has a key and will never be underconstrained

3227 07/05/2012 02:12 PM Aaron Marcuse-Kubitza

xml_func.py: process(): Fixed bug where an evaluated XML function might create a node of the same name as an existing node, but these nodes would not be merged even though they referred to the same object, by merging siblings of a newly-evaluated (replaced) node if they have the same name

3226 07/05/2012 02:09 PM Aaron Marcuse-Kubitza

xml_dom.py: Added merge() and merge_adjacent()

3225 07/05/2012 02:08 PM Aaron Marcuse-Kubitza

xml_dom.py: replace_with_text(): Return the new node

3224 07/05/2012 12:33 PM Aaron Marcuse-Kubitza

mappings/VegX-VegBIEN.stems.csv: Indirect voucher mappings: Removed no longer needed ":[*_id/taxonoccurrence]" because a specimenreplicate is a taxonoccurrence, so it doesn't need to have an empty taxonoccurrence

3223 07/05/2012 12:27 PM Aaron Marcuse-Kubitza

mappings/VegX-VegBIEN.stems.csv: Fixing specimenreplicate->taxonoccurrence mapping bug where taxonoccurrence_id is no longer used as an fkey because it's instead a pkey inherited from taxonoccurrence, by instead using the new fkey to plantobservation for direct vouchers. Note that a duplicate aggregateoccurrence is created, because the _if XML function runs after the XPaths have created the initial tree, and thus the nodes it pulls forward do not automatically get merged with adjacent nodes of the same name. This will eventually need to be fixed by auto-merging the nodes.

3222 07/05/2012 12:00 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: specimenreplicate: Fixing specimenreplicate->taxonoccurrence mapping bug where taxonoccurrence_id is no longer used as an fkey because it's instead a pkey inherited from taxonoccurrence, by instead adding an fkey to plantobservation for direct vouchers. Also, it makes more sense for a specimenreplicate to directly voucher the plant it came from rather than that plant's taxonoccurrence, because a direct voucher is a closer relationship to the plant.

3221 07/05/2012 11:22 AM Aaron Marcuse-Kubitza

mappings/VegX-VegBIEN.stems.csv: Map collectiondate to specimenreplicate via voucher when the voucher is indirect, rather than always directly to the taxonoccurrence, because the collectiondate relates to the specimenreplicate, not the taxonoccurrence, and is not necessarily 1:1 with it

3220 07/05/2012 11:17 AM Aaron Marcuse-Kubitza

mappings: Updated for_review VegX-VegBIEN mappings, which hadn't been auto-updated because of a modification time issue. (mappings/VegX-VegBIEN.stems.csv was replaced with an older version, which did not trigger make to remake files depending on it.)

3219 07/05/2012 10:28 AM Aaron Marcuse-Kubitza

schemas/vegbien.sql: locationevent: Added sql_gen-compatible indexes on all columns in the locationevent_unique_project_authorcode UNIQUE index: Changed locationevent_project_id index to use COALESCE. Added index on obsstartdate.

3218 07/05/2012 10:19 AM Aaron Marcuse-Kubitza

schemas/vegbien.sql: locationevent: Removed no longer needed COALESCE index on location_id now that location_id is NOT NULL

3217 07/05/2012 10:16 AM Aaron Marcuse-Kubitza

schemas/vegbien.sql: locationevent: Fixed bug where locationevent_unique_location index was overconstraining locationevent when a sourceaccessioncode or obsstartdate was specified, by combining the locationevent_unique_location, locationevent_unique_accessioncode, and locationevent_unique_location_date indexes into one COALESCE index on the combined fields of those indexes

3216 07/05/2012 10:10 AM Aaron Marcuse-Kubitza

schemas/vegbien.sql: locationevent: Made location_id required because every locationevent should have a location, even one with no locationdeterminations. This also avoids the creation of a parent locationevent when subplots are not being used.

3215 07/05/2012 09:48 AM Aaron Marcuse-Kubitza

mappings/VegX-VegBIEN.stems.csv: Removed _collapse where it's no longer needed because sql_io.put() handles that now. Note that each locationevent will get an empty commclass, whether or not there are any commdeterminations. This can later be used to add new commdeterminations.

3214 07/05/2012 09:45 AM Aaron Marcuse-Kubitza

schemas/vegbien.sql: commclass: Changed commclass_unique to COALESCE classnotes so that there is only one commclass for a locationevent when the commclasses are not separately named. (Currently classnotes is used as the class name field, commname being the name of the community itself.)

3213 07/05/2012 09:33 AM Aaron Marcuse-Kubitza

schemas/vegbien.sql: commdetermination: Made commconcept_id NOT NULL because it doesn't make sense to have a commdetermination on nothing. Note that the commname field in commdetermination is not used for making determinations (and may need to be removed to avoid confusion); commname.commname is used instead.

3212 07/05/2012 09:28 AM Aaron Marcuse-Kubitza

schemas/vegbien.sql: locationevent: Added COALESCE index on location_id for use by column-based import

3211 07/05/2012 09:24 AM Aaron Marcuse-Kubitza

mappings/VegX-VegBIEN.stems.csv: Removed _collapse where it's no longer needed because sql_io.put() handles that now. Note that each plantobservation will get an empty stemobservation, whether or not there are any stemtags. This can later be used to add further stemtags.

3210 07/05/2012 08:58 AM Aaron Marcuse-Kubitza

mappings/VegX-VegBIEN.stems.csv: Removed _collapse where it's no longer needed because sql_io.put() handles that now

3209 07/05/2012 08:31 AM Aaron Marcuse-Kubitza

schemas/vegbien.sql: location: Made datasource_id, sourceaccessioncode NOT NULL to ensure that all locations are uniquely identifiable by their datasource's unique key (sourceaccessioncode)

3208 07/05/2012 08:28 AM Aaron Marcuse-Kubitza

sql_io.py: put(): Handle NullValueExceptions by returning a NULL pkey, just like put_table() (column-based import) does

3207 07/03/2012 05:29 PM Aaron Marcuse-Kubitza

VegBIEN: Fixing import issue related to duplicate entries in tables with children, where when a new table entry duplicates an existing entry, the 1:1 tables of that table and those tables' children are not merged, causing them to become orphaned. It is described in detail at <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/Import_issues#Merging-duplicates-with-children&gt;, including the rationale for this solution. Note that this is not a bug in column-based import, it applies to row-based import as well. This commit fixes the issue for locationevent->location in plots data, by also mapping locationevent's unique keys to location.sourceaccessioncode and setting location.datasource_id.

3206 07/03/2012 03:59 PM Aaron Marcuse-Kubitza

sql.py: DbConn.run_query(): Separate the data source comment from the query with a tab in the executed query but a \r in the logged query, so that the query will be shown on the same line as the data source comment in pg_stat_activity, but be hidden by the following line when cating the file and be put on a separate line when viewed in a text editor. This causes the first line of the query to be at the left edge when the log file is viewed, so that it looks more natural.

3205 07/03/2012 03:15 PM Aaron Marcuse-Kubitza

README.TXT: Data import: Import data into VegBIEN: Added command to use for column-based import

3204 07/03/2012 02:10 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: locationevent: Allow a locationevent to be uniquely specified by its location (which is now datasource-scoped) and start date

3203 07/03/2012 01:26 PM Aaron Marcuse-Kubitza

VegBIEN: Fixing import issue related to duplicate entries in tables with children, where when a new table entry duplicates an existing entry, the 1:1 tables of that table and those tables' children are not merged, causing them to become orphaned. It is described in detail at <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/Import_issues#Merging-duplicates-with-children&gt;, including the rationale for this solution. Note that this is not a bug in column-based import, it applies to row-based import as well. This commit fixes the issue for specimenreplicate->...->location, by also mapping specimenreplicate's unique keys to location.sourceaccessioncode and setting location.datasource_id.

3202 07/03/2012 12:47 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: locationevent: Allow to be uniquely specified by location_id. This is useful for specimens data where there is one location for every locationevent.

3201 07/03/2012 12:27 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: location: Added datasource_id and sourceaccessioncode so locations can be uniquely specified by the input datasource, rather than being created automatically for each locationevent

3200 07/03/2012 11:45 AM Aaron Marcuse-Kubitza

schemas/filter_ERD.csv: Add back taxondetermination->taxonoccurrence fkey because that has been replaced by a trigger in the SQL

3199 07/03/2012 11:06 AM Aaron Marcuse-Kubitza

VegBIEN: Fixing import issue related to duplicate entries in tables with children, where when a new table entry duplicates an existing entry, the 1:1 tables of that table and those tables' children are not merged, causing them to become orphaned. It is described in detail at <https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/Import_issues#Merging-duplicates-with-children&gt;, including the rationale for this solution. Note that this is not a bug in column-based import, it applies to row-based import as well. This commit fixes the issue for specimenreplicate->taxonoccurrence.

3198 07/03/2012 07:40 AM Aaron Marcuse-Kubitza

inputs/import.stats.xls: Updated with remaining stats from most recent import

3197 07/02/2012 03:52 PM Aaron Marcuse-Kubitza

PostgreSQL-MySQL.csv: Remove INHERITS clauses

3196 07/02/2012 02:37 PM Aaron Marcuse-Kubitza

schemas/vegbien.ERD.mwb: Fixed lines

3195 07/02/2012 02:24 PM Aaron Marcuse-Kubitza

db_xml.py: put_table(): Moved in_row_ct updating to Subsetting section so the cursor's rowcount can be used directly

3194 07/02/2012 02:18 PM Aaron Marcuse-Kubitza

db_xml.py: put_table(): Subsetting in_table: Don't count # rows because this takes awhile for large datasets. Instead, use the chunking algorithm in digir_client, which ends the loop when a partial or empty partition is encountered.

3193 07/02/2012 01:58 PM Aaron Marcuse-Kubitza

inputs/import.stats.xls: Updated with new stats from an independent import

3192 07/02/2012 01:20 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: Fixed UNIQUE INDEXes that were still using COALESCE to use COALESCE in order to match what sql_gen.EnsureNotNull uses

3191 07/02/2012 12:41 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: specimenreplicate: UNIQUE INDEX on catalognumber_dwc: Added collectioncode_dwc so that datasources that specify it in addition to the institution_id (such as aggregators) will not need to have catalognumbers be unique within an institution

3190 07/02/2012 12:30 PM Aaron Marcuse-Kubitza

inputs/import.stats.xls: Updated with more stats from latest import

3189 07/02/2012 11:44 AM Aaron Marcuse-Kubitza

inputs/import.stats.xls: Updated with initial stats from latest import. Reformatted to put successive runs of column-based next to each other, so they could be directly compared and so that the row-based data wouldn't need to be duplicated. Added empty-value checks to formulas so that they don't need to be manually deleted when one of their inputs is empty.

3188 07/02/2012 10:32 AM Aaron Marcuse-Kubitza

input.Makefile: Documentation: import/steps.by_col.sql: Fixed bug where needed to run import in test mode

3187 07/02/2012 10:12 AM Aaron Marcuse-Kubitza

sql_io.py: put_table(): Don't set pkeys of missing rows to default value if out_table is a SQL function, because then there is already an entry for every row

3186 07/02/2012 10:03 AM Aaron Marcuse-Kubitza

bin/map: by_col: Reuse existing out_db connection for in_db instead of opening separate connection

3185 07/02/2012 09:50 AM Aaron Marcuse-Kubitza

sql.py: mk_select(): Replaced newlines with spaces when query is simple enough to fit on one line

3184 07/02/2012 09:40 AM Aaron Marcuse-Kubitza

db_xml.py: put_table(): Set db.src to help identify the data source in pg_stat_activity

3183 07/02/2012 09:39 AM Aaron Marcuse-Kubitza

sql.py: DbConn: Added src config param, which in autocommit mode, will be included in a comment in every query, to help identify the data source in pg_stat_activity

3182 07/02/2012 09:38 AM Aaron Marcuse-Kubitza

sql_gen.py: Added lstrip() to remove comments

3181 07/02/2012 09:13 AM Aaron Marcuse-Kubitza

sql.py: mk_insert_select(): Added src param to help identify the data source in pg_stat_activity

3180 07/02/2012 08:33 AM Aaron Marcuse-Kubitza

mappings/DwC2-VegBIEN.specimens.csv: Mapped institutionCode. This will enable datasources to use specimenreplicate's institution_id index for duplicate elimination.

3179 07/02/2012 08:31 AM Aaron Marcuse-Kubitza

input.Makefile: Prompt user to accept test, instead of providing command line func for doing so

3178 07/02/2012 07:45 AM Aaron Marcuse-Kubitza

schemas/vegbien.sql: specimenreplicate: UNIQUE INDEX on catalognumber_dwc: Added institution_id so that datasources that specify it (such as aggregators) will not need to have catalognumbers be globally unique. Once the institution_id is mapped to, this will fix a bug where rows with the same catalognumber were assumed to be duplicates even though they were from different institutions. This should also avoid the need to do any duplicate elimination joins when importing specimenreplicate, speeding up column-based import.

3177 07/02/2012 07:32 AM Aaron Marcuse-Kubitza

schemas/vegbien.sql: specimenreplicate: Renamed museum_id to institution_id to correspond with DwC's institutionCode, so that it would be more obvious where to map institutionCode fields to

3176 07/02/2012 07:16 AM Aaron Marcuse-Kubitza

inputs/import.stats.xls: Updated to include run times for rest of datasources for most recent column-based import

3175 06/29/2012 08:09 AM Aaron Marcuse-Kubitza

db_xml.py: put_table(): Subsetting in_table: Prepend schema to subset table name so that in pg_stat_activity, it's clear which datasource a particular query is from

3174 06/29/2012 07:46 AM Aaron Marcuse-Kubitza

sql_io.py: cast_temp_col(): add_col()'s distinguishing comment param: Add the type in case the same input column is being cast to different types, and both types have the same first word (causing their new column names to be the same)

3173 06/29/2012 07:42 AM Aaron Marcuse-Kubitza

sql_io.py: cast_temp_col(): Name the new column with only the first word of the type, to save space in the limited identifier length