Project

General

Profile

Statistics
| Revision:

# Date Author Comment
10435 07/26/2013 12:17 PM Aaron Marcuse-Kubitza

bugfix: inputs/SpeciesLink/Specimen/map.csv: *scientificName: remapped to scientificName instead of taxonName to match the DwC term's name (this is the same dwc_terms_scientificName mismapping that was fixed in r10434)

10434 07/26/2013 11:56 AM Aaron Marcuse-Kubitza

bugfix: inputs/SpeciesLink/Specimen/map.csv: dwc_terms_scientificName: remapped to scientificName instead of taxonName to match that DwC term name, as well as the mappings of other *scientificName terms

10433 07/26/2013 11:06 AM Aaron Marcuse-Kubitza

inputs/SpeciesLink/Specimen/map.csv: marked dwc_geospatial_VerbatimLatitude,Longitude as exact duplicates of dwc_terms_*

10432 07/26/2013 10:52 AM Aaron Marcuse-Kubitza

inputs/SpeciesLink/Specimen/map.csv: remapped identical _alt-ed fields to DUPLICATE. this avoids the need to translate these to postprocessing derived columns for new-style import, and also speeds up column-based import because there are less automatic _alts to perform to resolve filter-less collisions.

10431 07/26/2013 10:06 AM Aaron Marcuse-Kubitza

bugfix: inputs/SpeciesLink/Specimen/map.csv: *CollectorNumber: moved these to the same _alt group as recordNumber, because they are actually duplicates

10430 07/26/2013 09:43 AM Aaron Marcuse-Kubitza

correction: inputs/SpeciesLink/Specimen/map.csv: FieldNumber: fixed incorrect comment that these fields are identical to recordNumber, when instead they have the same *meaning but not the same values. instead, values are stored under either of the two terms. the previous conclusion had been based on an incorrect query, which used != instead of the NULL-sensitive IS NOT DISTINCT FROM.

10429 07/25/2013 08:14 PM Aaron Marcuse-Kubitza

planning/timeline/timeline.2013.xls: Adding derived columns: extended to overlap with all subtasks

10428 07/25/2013 08:12 PM Aaron Marcuse-Kubitza

planning/timeline/timeline.2013.xls: Geoscrubbing: split into separate re-run and automated pipeline tasks

10427 07/25/2013 08:09 PM Aaron Marcuse-Kubitza

planning/timeline/timeline.2013.xls: moved Data provider validations before Adding derived columns because ensuring that the source data is in the database is more important than the derived data, which can always be added later

10426 07/25/2013 08:00 PM Aaron Marcuse-Kubitza

planning/timeline/timeline.2013.xls: Data provider validations: added dot in July because some amount of datasource-level validation happens when mappings issues are discovered during the refactoring

10425 07/25/2013 07:34 PM Aaron Marcuse-Kubitza

bugfix: inputs/*/*/map.csv for specimen tables: remapped eventDate,day,month,year to *Collected, because a general date always applies to the observation itself rather than to any parent event (specimens don't have a parent event)

10424 07/25/2013 07:34 PM Aaron Marcuse-Kubitza

inputs/*/*/map.csv for IndividualObservation tables: also mapped eventDate,day,month,year to *Collected, because a general date always applies to the observation itself in addition to any parent event which it may be a part of

10423 07/25/2013 06:27 PM Aaron Marcuse-Kubitza

bugfix: inputs/XAL/Specimen/, NY/Ecatalog_all/: *JulianDay: remapped to dayOfYear instead of day (the day of the month)

10422 07/25/2013 05:08 PM Aaron Marcuse-Kubitza

inputs/SpeciesLink/Specimen/map.csv: remapped *dayOfYear-related terms to UNUSED

10421 07/25/2013 04:53 PM Aaron Marcuse-Kubitza

bugfix: inputs/SpeciesLink/Specimen/map.csv: remapped conceptual_darwin_2003_1_0_JulianDay, dwc_dwcore_DayOfYear to dayOfYear instead of day (the day of the month)

10420 07/25/2013 04:43 PM Aaron Marcuse-Kubitza

mappings/VegCore.htm: regenerated from wiki. added dayOfYear (=julianDay), which is different from startDayOfYear/endDayOfYear.

10419 07/25/2013 01:59 PM Aaron Marcuse-Kubitza

inputs/CTFS/: switched to new-style import, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource

10418 07/25/2013 01:50 PM Aaron Marcuse-Kubitza

inputs/CTFS/StemObservation/: translated collisions (missing filters) to postprocessing derived columns, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns

10417 07/25/2013 10:57 AM Aaron Marcuse-Kubitza

planning/timeline/timeline.2013.xls: rebalanced tasks across the remaining months, taking into account priority changes made in the conference call (e.g. that we should not be handling people's individual data requests (Brad, wiki.vegpath.org/2013-07-25_conference_call#Decisions-made))

10416 07/25/2013 10:50 AM Aaron Marcuse-Kubitza

planning/timeline/timeline.2013.xls: updated with additional tasks added in conference call: translate source-specific derived columns to plain SQL, flatten the datasources, automated geoscrubbing pipeline

10415 07/25/2013 08:43 AM Aaron Marcuse-Kubitza

planning/goals/BIEN_3_derived_data_products_NormalizedDB_only.docx: removed BIEN species-level phylogeny, which Brad says is out of scope for the BIEN DB

10414 07/25/2013 08:24 AM Aaron Marcuse-Kubitza

removed planning/workflow/bien3_architecture.odp because the current version is now in bien3_architecture.pptx

10413 07/25/2013 08:13 AM Aaron Marcuse-Kubitza

added planning/workflow/validation/TNRS_results.ppt symlink to inputs/test_taxonomic_names/_scrub/TNRS_results.ppt

10412 07/25/2013 08:10 AM Aaron Marcuse-Kubitza

inputs/test_taxonomic_names/_scrub/TNRS_results.ppt: highlighted the sample row and related rows

10411 07/25/2013 08:04 AM Aaron Marcuse-Kubitza

inputs/test_taxonomic_names/_scrub/TNRS_results.xls: moved arrows to TNRS_results.ppt so they can be changed more easily

10410 07/25/2013 07:51 AM Aaron Marcuse-Kubitza

inputs/test_taxonomic_names/_scrub/TNRS_results.ppt: TNRS.tnrs: added diagram labels for the various names and steps

10409 07/25/2013 07:32 AM Aaron Marcuse-Kubitza

inputs/test_taxonomic_names/_scrub/TNRS_results.xls: use "Poa annua var. eriolepis"->"Poaceae Poa annua L." as the synonym example instead of "Poa annua fo. lanuginosa"->"Poaceae Poa annua var. annua" because the input name is simpler and it's closer to the beginning of the list

10408 07/25/2013 07:20 AM Aaron Marcuse-Kubitza

inputs/test_taxonomic_names/_scrub/run: exports/make(): tnrs.csv: include Name_matched instead of Genus_matched+Specific_epithet_matched because this also contains lower ranks, which are used in the TNRS synonymizing

10407 07/25/2013 07:06 AM Aaron Marcuse-Kubitza

inputs/test_taxonomic_names/_scrub/TNRS_results.ppt: added annotations explaining the import steps

10406 07/25/2013 06:36 AM Aaron Marcuse-Kubitza

added inputs/test_taxonomic_names/_scrub/TNRS_results.ppt, containing the *.png screenshots with tables labeled

10405 07/25/2013 06:35 AM Aaron Marcuse-Kubitza

added inputs/test_taxonomic_names/_scrub/*.png, screenshots of the TNRS_results.xls tabs (LibreOffice does not preserve the formatting when pasting a spreadsheet to a PowerPoint as a table, and the table editing options are limited)

10404 07/25/2013 06:31 AM Aaron Marcuse-Kubitza

added inputs/test_taxonomic_names/_scrub/TNRS_results.xls with formatted versions of the *.csv tables

10403 07/24/2013 05:15 PM Aaron Marcuse-Kubitza

inputs/test_taxonomic_names/_scrub/run: exports/make(): subset the columns to include only the most important to demo how the data is represented

10402 07/24/2013 05:13 PM Aaron Marcuse-Kubitza

lib/sh/db.sh: mk_select(): support passing $cols as array instead of SQL string, which is easier to enter in a shell script (less quotes, \ , etc.)

10401 07/24/2013 05:12 PM Aaron Marcuse-Kubitza

lib/sh/db.sh: added cols2list()

10400 07/24/2013 05:10 PM Aaron Marcuse-Kubitza

lib/sh/util.sh: added is_array()

10399 07/24/2013 04:38 PM Aaron Marcuse-Kubitza

inputs/test_taxonomic_names/_scrub/run: exports/make(): allow specifying an explicit columns list for each table using cols=... (initially set to all columns)

10398 07/24/2013 04:09 PM Aaron Marcuse-Kubitza

added inputs/test_taxonomic_names/_scrub/*.csv exports

10397 07/24/2013 04:09 PM Aaron Marcuse-Kubitza

added inputs/test_taxonomic_names/_scrub/run, which exports the test_scrub-populated tables to CSV

10396 07/24/2013 04:08 PM Aaron Marcuse-Kubitza

lib/sh/db_make.sh: added pg_export_table_to_dir(), pg_export_tables_to_dir(). unlike db.sh pg_export_table_to_dir_no_header(), these functions are make-aware and will not clobber an existing file.

10395 07/24/2013 03:15 PM Aaron Marcuse-Kubitza

reran inputs/test_taxonomic_names/test_scrub, which generates the public.test_taxonomic_names sample schema

10394 07/24/2013 01:50 PM Aaron Marcuse-Kubitza

inputs/CTFS/Plot/map.csv: DescriptionOfSite: remapped to locationRemarks, not locality

10393 07/24/2013 01:38 PM Aaron Marcuse-Kubitza

inputs/CTFS/AggregateObservation/: translated multi-column filters to postprocessing derived columns, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns

10392 07/24/2013 01:24 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: geoscrub_input_new: updated for VegCore-renamed geoscrub_output column names

10391 07/24/2013 01:09 PM Aaron Marcuse-Kubitza

schemas/util.sql: added ?>= operator with is_more_complete_than() function

10390 07/24/2013 12:44 PM Aaron Marcuse-Kubitza

inputs/.geoscrub/: switched to new-style import, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource

10389 07/24/2013 12:15 PM Aaron Marcuse-Kubitza

inputs/.geoscrub/geoscrub_output/: translated single-column filters to postprocessing derived columns, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns

10388 07/24/2013 11:18 AM Aaron Marcuse-Kubitza

schemas/util.sql: SQL-language IMMUTABLE functions marked STRICT: removed STRICT to enable dynamic inlining, which speeds up the function up to 7x. STRICT was not removed where the function was particularly complex and the STRICT optimization would likely be more significant than inlining.

10387 07/24/2013 11:07 AM Aaron Marcuse-Kubitza

bugfix: inputs/BRIT/specimen_flat/postprocess.sql: diameterBreastHeight_cm, height_m: use newly NULL-mapped versions of columns instead of the *_verbatim columns

10386 07/24/2013 11:04 AM Aaron Marcuse-Kubitza

inputs/BRIT/: switched to new-style import, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource

10385 07/24/2013 10:49 AM Aaron Marcuse-Kubitza

inputs/BRIT/specimen_flat/: translated multi-column filters with _join() to postprocessing derived columns, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns

10384 07/24/2013 10:43 AM Aaron Marcuse-Kubitza

inputs/BRIT/specimen_flat/map.csv: Habitat_Summary: remapped to UNUSED

10383 07/24/2013 10:16 AM Aaron Marcuse-Kubitza

inputs/BRIT/specimen_flat/postprocess.sql: diameterBreastHeight_cm, height_m: updated runtimes

10382 07/24/2013 10:15 AM Aaron Marcuse-Kubitza

inputs/BRIT/specimen_flat/: DBH_*, Height_*: mapped NULL-equivalent values, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns

10381 07/24/2013 09:27 AM Aaron Marcuse-Kubitza

inputs/.../: translated multi-column filters with _avg() to postprocessing derived columns, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns

10380 07/24/2013 08:18 AM Aaron Marcuse-Kubitza

inputs/BRIT/specimen_flat/: translated single-column filters to postprocessing derived columns, using the steps at wiki.vegpath.org/Switching_to_new-style_import#stage-I-source-specific > "translate single-column filters to postprocessing derived columns"

10379 07/20/2013 05:25 AM Aaron Marcuse-Kubitza

/README.TXT: Maintenance: added instructions for what to do if http://vegbiendev.nceas.ucsb.edu/phppgadmin/ goes down (sometimes displaying a Not found error)

10378 07/20/2013 05:21 AM Aaron Marcuse-Kubitza

schemas/util.sql: schema comment: added note that IMMUTABLE SQL-language functions should never be declared STRICT, because this prevents them from being inlined. inlining can create a significant speed improvement (7x+), by avoiding function calls and enabling additional constant folding.

10377 07/20/2013 05:09 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: map_nulls() derived cols: documented total runtime (7.5 min on vegbiendev)

10376 07/20/2013 05:07 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: map_nulls() derived cols: updated runtimes for map_nulls() inlining, which created a speed improvement of 7x for the numeric columns and 2.5x for the text columns (292563.362->41929.772 ms and 83640.424->35690.797 ms, respectively). note that the map_nulls__coord__*() calls could be optimized further by combining the successive map_nulls() calls into one, with the hstores merged.

10375 07/20/2013 04:37 AM Aaron Marcuse-Kubitza

schemas/util.sql: map_nulls(): documented that inputs/REMIB/Specimen/postprocess.sql > country also shows that inlining is now happening properly. note that the speed improvement due to inlining is not as much, %wise, when the values util._map() is run on are long strings instead of the short strings used in the initial profiling. this is because a greater % of the time is spent in system functions such as hstore>text, which are not affected by the inlining because they are run either way.

10374 07/20/2013 04:18 AM Aaron Marcuse-Kubitza

schemas/util.sql: map_nulls(): use new nulls_map(). proper inlining (i.e. same runtime before and after change) has been verified with the following profiling query:
SELECT util.map_nulls(array[1, 2, 3]::text[], v) FROM unnest(array_fill(1, array100000)) f (v)

10373 07/20/2013 04:05 AM Aaron Marcuse-Kubitza

schemas/util.sql: added nulls_map(), for use with _map()

10372 07/20/2013 03:39 AM Aaron Marcuse-Kubitza

lib/runscripts/table.run: postprocess(): added remake action that calls trim_table()

10371 07/20/2013 03:37 AM Aaron Marcuse-Kubitza

lib/runscripts/table.run: added trim_table(), which calls util.trim(regclass, regclass)

10370 07/20/2013 03:23 AM Aaron Marcuse-Kubitza

lib/runscripts/table.run: map_table(): added remake action that calls reset_col_names()

10369 07/20/2013 03:21 AM Aaron Marcuse-Kubitza

lib/runscripts/table.run: added reset_col_names(), which calls util.reset_col_names()

10368 07/20/2013 03:19 AM Aaron Marcuse-Kubitza

bugfix: lib/runscripts/table.run: map_table(): moved $map_table to global var so it can be used by other functions

10367 07/20/2013 03:09 AM Aaron Marcuse-Kubitza

bugfix: lib/runscripts/table.run: postprocess(): don't propagate $remake to remake_VegBIEN_mappings(), since this will cause map.csv to be remade, which is not related to the postprocessing.

10366 07/20/2013 03:08 AM Aaron Marcuse-Kubitza

lib/runscripts/table.run: map_table(): util.set_col_names_with_metadata(): removed unnecessary cast to regclass, which is performed implicitly. this used to be needed when the polymorphic util.rename_cols() was used instead.

10365 07/20/2013 02:57 AM Aaron Marcuse-Kubitza

schemas/util.sql: added trim(), which trims a table to include only original columns, as defined by a map table

10364 07/20/2013 02:53 AM Aaron Marcuse-Kubitza

schemas/util.sql: added derived_cols(), which gets table_'s derived columns (all the columns not in the names table)

10363 07/20/2013 02:29 AM Aaron Marcuse-Kubitza

schemas/util.sql: added eval2set()

10362 07/20/2013 02:14 AM Aaron Marcuse-Kubitza

schemas/util.sql: added drop_column()

10361 07/20/2013 01:27 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: map_nulls__*(): turned off STRICT to allow dynamic inlining, which speeds up the mk_derived_col() statements by 5x (342799.823 ms -> 71533.252 ms (6 min -> 1 min) for latitude_sec)

10360 07/19/2013 07:23 PM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: runtimes: updated for vegbiendev, before dynamic inlining. the times are about twice as fast as on starscream, so vegbiendev is faster at whatever is the limiting speed factor (probably not CPU, based on other benchmarks).

10359 07/19/2013 07:05 PM Aaron Marcuse-Kubitza

schemas/util.sql: map_nulls(): documented that due to dynamic inlining, this is just as fast as util._map() which it wraps. dynamic inlining now brings altogether a 40x speed improvement to map_nulls() (4000 ms -> 100 ms), and would likely bring a comparable improvement for other functions that are run repeatedly and call other user-defined functions.

10358 07/19/2013 06:35 PM Aaron Marcuse-Kubitza

bugfix: schemas/util.sql: map_nulls(): updated to use hstore(text[], anyelement), which has replaced hstore(anyarray, anyelement)

10357 07/19/2013 06:30 PM Aaron Marcuse-Kubitza

schemas/util.sql: removed hstore(anyarray, anyelement), which did not support dynamic inlining, to avoid confusion over which hstore() function to use. use new hstore(text[], anyelement) instead (with explicit cast on the keys array if needed).

10356 07/19/2013 06:23 PM Aaron Marcuse-Kubitza

schemas/util.sql: added hstore(text[], anyelement), which dynamically inlines properly, unlike hstore(anyarray, anyelement). this can be selected by explicitly casting the keys array to text[], which now provides a 6x speed improvement (380 ms -> 60 ms) for map_nulls().

10355 07/19/2013 05:31 PM Aaron Marcuse-Kubitza

schemas/util.sql: fix_array(): turned off STRICT to allow dynamic inlining, which speeds up util.map_nulls() by 3x (1500 ms -> 500 ms)

10354 07/19/2013 05:15 PM Aaron Marcuse-Kubitza

schemas/util.sql: array_length(anyarray), array_length(anyarray, dimension integer): turned off STRICT to allow dynamic inlining, which speeds up util.map_nulls(). this requires adding a `CASE WHEN $1 IS NULL THEN NULL` statement to array_length(anyarray, dimension integer) to replace the functionality provided by STRICT.

10353 07/19/2013 04:41 PM Aaron Marcuse-Kubitza

schemas/util.sql: map_nulls(): turned off STRICT to allow dynamic inlining, which causes a 2x speed improvement1. (see r10352 for an explanation of dynamic inlining.) note that turning off STRICT disables NULL-skipping (avoiding running a function when all its params are NULL), so it should only be used when the NULL-skipping optimization is needed less than dynamic inlining....

10352 07/19/2013 04:23 PM Aaron Marcuse-Kubitza

schemas/util.sql: inlinable IMMUTABLE functions: avoid using config params (e.g. `SET search_path TO util`) because these prevent dynamic inlining (i.e. inlining of a function call with variable instead of constant arguments, by substituting the arguments into the function's body). dynamic inlining can speed up function evaluation significantly, because a (slow) call to a user-defined SQL function is avoided.

10351 07/19/2013 04:15 PM Aaron Marcuse-Kubitza

schemas/vegbien.my.sql: updated for new bin/repl text mode matching, which also affects non-regexps. this causes the replacement of a few more occurrences of PostgreSQL-only one-word typenames with their MySQL equivalents.

10350 07/19/2013 02:26 PM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: runtimes: documented the machine the times are from

10349 07/19/2013 01:52 PM Aaron Marcuse-Kubitza

inputs/REMIB/: switched to new-style import, using the steps at wiki.vegpath.org/Switching_to_new-style_import#stage-I-source-specific > "run the following for each datasource"

10348 07/19/2013 11:40 AM Aaron Marcuse-Kubitza

bugfix: bin/repl: text mode: repurpose this to match SQL identifiers, for use by inputs/input.Makefile %/postprocess.sql. %/postprocess.sql is the only place currently using this mode, so this will not affect other scripts.

10347 07/19/2013 10:51 AM Aaron Marcuse-Kubitza

bugfix: inputs/input.Makefile: %/postprocess.sql: need to run bin/repl in text mode (text=1) so that values to match are treated as literal strings rather than regular expressions. this difference is important for column names with spaces or special characters.

10346 07/19/2013 10:24 AM Aaron Marcuse-Kubitza

bugfix: inputs/Madidi/LocationObservation/map.csv: resolved Notes, Notes 2 -> locationRemarks collision by _alt()ing them together. note that _alt() is fine because only one of these is ever populated.

10345 07/19/2013 09:54 AM Aaron Marcuse-Kubitza

bugfix: schemas/util.sql: set_col_names(): need to generate error if destination column already exists (rather than suppressing it with try_create()), because this indicates a collision

10344 07/19/2013 09:30 AM Aaron Marcuse-Kubitza

bugfix: inputs/Madidi/IndividualObservation/map.csv: removed derived column FieldFamilyFullName#originalFamily, which should not be in the map table because it can contain only columns that are initially in the table before running postprocess.sql

10343 07/19/2013 09:23 AM Aaron Marcuse-Kubitza

schemas/util.sql: map table: added unique constraint on the to column as well, because the destination names also need to be distinct in order to be a valid set of column names

10342 07/19/2013 09:14 AM Aaron Marcuse-Kubitza

schemas/util.sql: map table: changed pkey to a unique constraint so pgAdmin would sort the entries in table order (matching the order they are in the staging table) instead of alphabetized by the pkey

10341 07/19/2013 08:56 AM Aaron Marcuse-Kubitza

bugfix: inputs/REMIB/Specimen/map.csv: state: changed output column name to stateProvince_verbatim to match the renaming in postprocess.sql

10340 07/19/2013 08:40 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: removed out-of-date rerun time, which applied to doing all the deletes in the same statement (however, the current rerun time is approximately the same). note that index scans are not actually used (as the previous comment incorrectly stated) because the conditions for this filter are prefix-less regexps.

10339 07/19/2013 08:32 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/: translated single-column filters to postprocessing derived columns, using the steps at wiki.vegpath.org/Switching_to_new-style_import#stage-I-source-specific > "translate single-column filters to postprocessing derived columns". null-mapping filters now use wrappers around new util.map_nulls(). note that the verbatim columns input to the filters need to be renamed to avoid name collisions with their filtered columns, which must be VegCore terms for new-style import.

10338 07/19/2013 07:53 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: also filter out non-numbers for long_sec, lat_min, lat_sec

10337 07/19/2013 07:18 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: remove rows where long_min is not a number

10336 07/19/2013 07:15 AM Aaron Marcuse-Kubitza

inputs/REMIB/Specimen/postprocess.sql: change E'' to regular '' to avoid the need to double \ (instead ' would be doubled). E'' used to be necessary in previous versions of PostgreSQL to avoid a warning about escape string syntax.