Project

General

Profile

# Date Author Comment
12516 02/27/2014 01:27 PM Aaron Marcuse-Kubitza

bugfix: *.sql: public.source_by_shortname(): need to wrap it in a nested SELECT because Postgres incorrectly does not constant-fold (inline) it, leading to a slowdown when it is therefore run many times. this is done using the steps at wiki.vegpath.org/Postgres_queries#wrap-function-call-in-nested-SELECT .

11970 01/20/2014 11:33 AM Aaron Marcuse-Kubitza

moved everything into /trunk/ to create the standard svn layout, for use with tools that require this (eg. git-svn). IMPORTANT: do NOT do an `svn up`. instead, re-use your working copy's existing files with `svn switch` (http://svnbook.red-bean.com/en/1.6/svn.ref.svn.c.switch.html).

10377 07/20/2013 05:09 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: map_nulls() derived cols: documented total runtime (7.5 min on vegbiendev)

10376 07/20/2013 05:07 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: map_nulls() derived cols: updated runtimes for map_nulls() inlining, which created a speed improvement of 7x for the numeric columns and 2.5x for the text columns (292563.362->41929.772 ms and 83640.424->35690.797 ms, respectively). note that the map_nulls__coord__*() calls could be optimized further by combining the successive map_nulls() calls into one, with the hstores merged.

10361 07/20/2013 01:27 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: map_nulls__*(): turned off STRICT to allow dynamic inlining, which speeds up the mk_derived_col() statements by 5x (342799.823 ms -> 71533.252 ms (6 min -> 1 min) for latitude_sec)

10360 07/19/2013 07:23 PM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: runtimes: updated for vegbiendev, before dynamic inlining. the times are about twice as fast as on starscream, so vegbiendev is faster at whatever is the limiting speed factor (probably not CPU, based on other benchmarks).

10350 07/19/2013 02:26 PM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: runtimes: documented the machine the times are from

10349 07/19/2013 01:52 PM Aaron Marcuse-Kubitza

inputs/REMIB/: switched to new-style import, using the steps at wiki.vegpath.org/Switching_to_new-style_import#stage-I-source-specific > "run the following for each datasource"

10340 07/19/2013 08:40 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: removed out-of-date rerun time, which applied to doing all the deletes in the same statement (however, the current rerun time is approximately the same). note that index scans are not actually used (as the previous comment incorrectly stated) because the conditions for this filter are prefix-less regexps.

10339 07/19/2013 08:32 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/: translated single-column filters to postprocessing derived columns, using the steps at wiki.vegpath.org/Switching_to_new-style_import#stage-I-source-specific > "translate single-column filters to postprocessing derived columns". null-mapping filters now use wrappers around new util.map_nulls(). note that the verbatim columns input to the filters need to be renamed to avoid name collisions with their filtered columns, which must be VegCore terms for new-style import.

10338 07/19/2013 07:53 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: also filter out non-numbers for long_sec, lat_min, lat_sec

10337 07/19/2013 07:18 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: remove rows where long_min is not a number

10336 07/19/2013 07:15 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: change E'' to regular '' to avoid the need to double \ (instead ' would be doubled). E'' used to be necessary in previous versions of PostgreSQL to avoid a warning about escape string syntax.

10335 07/19/2013 07:09 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: removed unnecessary () around `DELETE FROM :table WHERE long_deg ...`

10334 07/19/2013 07:03 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: removed coll_year, country, long_deg indexes because the frameshift filter conditions on these columns do not use index scans (because their regexp patterns do not contain a fixed prefix). eventually, some regexp patterns may be able to be modified to use prefixes.

10333 07/19/2013 07:01 AM Aaron Marcuse-Kubitza

bugfix: inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: can't OR together conditions to determine rows to delete, because if any condition is NULL instead of true/false, this will NULL out the entire WHERE condition and prevent any other true conditions from causing a deletion. the best way to fix this is to use a separate DELETE statement for each condition, so that NULLs only impact that particular condition's DELETE. unlike using a modified, NULL-insensitive OR, which would prevent the use of index scans, this allows indexes to be used for conditions that support them.

10332 07/19/2013 06:05 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: removed duplicate CREATE INDEX for the acronym column

10331 07/19/2013 05:59 AM Aaron Marcuse-Kubitza

bugfix: inputs/REMIB/Specimen/postprocess.sql: switched back to the input column names, since the renaming to *_verbatim is part of a later step

10330 07/19/2013 05:26 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/create.sql: moved filtering out of frameshifted rows to postprocess.sql, where it can happen in an idempotent DELETE. this allows filters to remove additional rows to easily be added on top of the existing filters, without needing to remake Specimen (which takes a long time, because of the many stage I derived columns that get added). the logical inversion inherent in the DELETE condition has been factored through rather than wrapped in NOT (...), because removal of frameshifted rows is more accurately specified as the detection of specific patterns that indicate frameshifting rather than the validation of all fields.

10245 07/11/2013 12:55 AM Aaron Marcuse-Kubitza

bugfix: inputs/*/*/postprocess.sql: made all operations idempotent, so that postprocess.sql can be run repeatedly (e.g. by new-style import)

9502 05/21/2013 11:18 PM Aaron Marcuse-Kubitza

inputs/GBIF/Specimen/postprocess.sql, inputs/REMIB/Specimen/postprocess.sql: updated for providers in r9459, which adds TEX

9501 05/21/2013 11:10 PM Aaron Marcuse-Kubitza

inputs/*/*/postprocess.sql: Remove institutions that we have direct data for: query to obtain list: updated for current schema

7250 01/16/2013 09:21 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: Added back ARIZ, NY because some REMIB specimens for these datasources are not yet in the datasources themselves

7249 01/16/2013 08:43 AM Aaron Marcuse-Kubitza

Added inputs/REMIB/Specimen/postprocess.sql to remove institutions that we have direct data for