/trunk/inputs/REMIB/Specimen/postprocess.sql - Changes - BIEN 3 - NCEAS Projects

root/trunk/inputs/REMIB/Specimen/postprocess.sql @ 12193

#	Date	Author	Comment
11970	01/20/2014 11:33 AM	Aaron Marcuse-Kubitza	moved everything into /trunk/ to create the standard svn layout, for use with tools that require this (eg. git-svn). IMPORTANT: do NOT do an `svn up`. instead, re-use your working copy's existing files with `svn switch` (http://svnbook.red-bean.com/en/1.6/svn.ref.svn.c.switch.html).
10377	07/20/2013 05:09 AM	Aaron Marcuse-Kubitza	inputs/REMIB/Specimen/postprocess.sql: map_nulls() derived cols: documented total runtime (7.5 min on vegbiendev)
10376	07/20/2013 05:07 AM	Aaron Marcuse-Kubitza	inputs/REMIB/Specimen/postprocess.sql: map_nulls() derived cols: updated runtimes for map_nulls() inlining, which created a speed improvement of 7x for the numeric columns and 2.5x for the text columns (292563.362->41929.772 ms and 83640.424->35690.797 ms, respectively). note that the map_nulls__coord__*() calls could be optimized further by combining the successive map_nulls() calls into one, with the hstores merged.
10361	07/20/2013 01:27 AM	Aaron Marcuse-Kubitza	inputs/REMIB/Specimen/postprocess.sql: map_nulls__*(): turned off STRICT to allow dynamic inlining, which speeds up the mk_derived_col() statements by 5x (342799.823 ms -> 71533.252 ms (6 min -> 1 min) for latitude_sec)
10360	07/19/2013 07:23 PM	Aaron Marcuse-Kubitza	inputs/REMIB/Specimen/postprocess.sql: runtimes: updated for vegbiendev, before dynamic inlining. the times are about twice as fast as on starscream, so vegbiendev is faster at whatever is the limiting speed factor (probably not CPU, based on other benchmarks).
10350	07/19/2013 02:26 PM	Aaron Marcuse-Kubitza	inputs/REMIB/Specimen/postprocess.sql: runtimes: documented the machine the times are from
10349	07/19/2013 01:52 PM	Aaron Marcuse-Kubitza	inputs/REMIB/: switched to new-style import, using the steps at wiki.vegpath.org/Switching_to_new-style_import#stage-I-source-specific > "run the following for each datasource"
10340	07/19/2013 08:40 AM	Aaron Marcuse-Kubitza	inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: removed out-of-date rerun time, which applied to doing all the deletes in the same statement (however, the current rerun time is approximately the same). note that index scans are not actually used (as the previous comment incorrectly stated) because the conditions for this filter are prefix-less regexps.
10339	07/19/2013 08:32 AM	Aaron Marcuse-Kubitza	inputs/REMIB/Specimen/: translated single-column filters to postprocessing derived columns, using the steps at wiki.vegpath.org/Switching_to_new-style_import#stage-I-source-specific > "translate single-column filters to postprocessing derived columns". null-mapping filters now use wrappers around new util.map_nulls(). note that the verbatim columns input to the filters need to be renamed to avoid name collisions with their filtered columns, which must be VegCore terms for new-style import.
10338	07/19/2013 07:53 AM	Aaron Marcuse-Kubitza	inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: also filter out non-numbers for long_sec, lat_min, lat_sec
10337	07/19/2013 07:18 AM	Aaron Marcuse-Kubitza	inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: remove rows where long_min is not a number
10336	07/19/2013 07:15 AM	Aaron Marcuse-Kubitza	inputs/REMIB/Specimen/postprocess.sql: change E'' to regular '' to avoid the need to double \ (instead ' would be doubled). E'' used to be necessary in previous versions of PostgreSQL to avoid a warning about escape string syntax.
10335	07/19/2013 07:09 AM	Aaron Marcuse-Kubitza	inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: removed unnecessary () around `DELETE FROM :table WHERE long_deg ...`
10334	07/19/2013 07:03 AM	Aaron Marcuse-Kubitza	inputs/REMIB/Specimen/postprocess.sql: removed coll_year, country, long_deg indexes because the frameshift filter conditions on these columns do not use index scans (because their regexp patterns do not contain a fixed prefix). eventually, some regexp patterns may be able to be modified to use prefixes.
10333	07/19/2013 07:01 AM	Aaron Marcuse-Kubitza	bugfix: inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: can't OR together conditions to determine rows to delete, because if any condition is NULL instead of true/false, this will NULL out the entire WHERE condition and prevent any other true conditions from causing a deletion. the best way to fix this is to use a separate DELETE statement for each condition, so that NULLs only impact that particular condition's DELETE. unlike using a modified, NULL-insensitive OR, which would prevent the use of index scans, this allows indexes to be used for conditions that support them.
10332	07/19/2013 06:05 AM	Aaron Marcuse-Kubitza	inputs/REMIB/Specimen/postprocess.sql: removed duplicate CREATE INDEX for the acronym column
10331	07/19/2013 05:59 AM	Aaron Marcuse-Kubitza	bugfix: inputs/REMIB/Specimen/postprocess.sql: switched back to the input column names, since the renaming to *_verbatim is part of a later step
10330	07/19/2013 05:26 AM	Aaron Marcuse-Kubitza	inputs/REMIB/Specimen/create.sql: moved filtering out of frameshifted rows to postprocess.sql, where it can happen in an idempotent DELETE. this allows filters to remove additional rows to easily be added on top of the existing filters, without needing to remake Specimen (which takes a long time, because of the many stage I derived columns that get added). the logical inversion inherent in the DELETE condition has been factored through rather than wrapped in NOT (...), because removal of frameshifted rows is more accurately specified as the detection of specific patterns that indicate frameshifting rather than the validation of all fields.
10245	07/11/2013 12:55 AM	Aaron Marcuse-Kubitza	bugfix: inputs///postprocess.sql: made all operations idempotent, so that postprocess.sql can be run repeatedly (e.g. by new-style import)
9502	05/21/2013 11:18 PM	Aaron Marcuse-Kubitza	inputs/GBIF/Specimen/postprocess.sql, inputs/REMIB/Specimen/postprocess.sql: updated for providers in r9459, which adds TEX
9501	05/21/2013 11:10 PM	Aaron Marcuse-Kubitza	inputs///postprocess.sql: Remove institutions that we have direct data for: query to obtain list: updated for current schema
7250	01/16/2013 09:21 AM	Aaron Marcuse-Kubitza	inputs/REMIB/Specimen/postprocess.sql: Added back ARIZ, NY because some REMIB specimens for these datasources are not yet in the datasources themselves
7249	01/16/2013 08:43 AM	Aaron Marcuse-Kubitza	Added inputs/REMIB/Specimen/postprocess.sql to remove institutions that we have direct data for

Project

General

Profile