planning/timeline/timeline.2013.xls: Adding derived columns: extended to overlap with all subtasks
planning/timeline/timeline.2013.xls: Geoscrubbing: split into separate re-run and automated pipeline tasks
planning/timeline/timeline.2013.xls: moved Data provider validations before Adding derived columns because ensuring that the source data is in the database is more important than the derived data, which can always be added later
planning/timeline/timeline.2013.xls: Data provider validations: added dot in July because some amount of datasource-level validation happens when mappings issues are discovered during the refactoring
bugfix: inputs/*/*/map.csv for specimen tables: remapped eventDate,day,month,year to *Collected, because a general date always applies to the observation itself rather than to any parent event (specimens don't have a parent event)
inputs/*/*/map.csv for IndividualObservation tables: also mapped eventDate,day,month,year to *Collected, because a general date always applies to the observation itself in addition to any parent event which it may be a part of
bugfix: inputs/XAL/Specimen/, NY/Ecatalog_all/: *JulianDay: remapped to dayOfYear instead of day (the day of the month)
inputs/SpeciesLink/Specimen/map.csv: remapped *dayOfYear-related terms to UNUSED
bugfix: inputs/SpeciesLink/Specimen/map.csv: remapped conceptual_darwin_2003_1_0_JulianDay, dwc_dwcore_DayOfYear to dayOfYear instead of day (the day of the month)
mappings/VegCore.htm: regenerated from wiki. added dayOfYear (=julianDay), which is different from startDayOfYear/endDayOfYear.
inputs/CTFS/: switched to new-style import, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource
inputs/CTFS/StemObservation/: translated collisions (missing filters) to postprocessing derived columns, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns
planning/timeline/timeline.2013.xls: rebalanced tasks across the remaining months, taking into account priority changes made in the conference call (e.g. that we should not be handling people's individual data requests (Brad, wiki.vegpath.org/2013-07-25_conference_call#Decisions-made))
planning/timeline/timeline.2013.xls: updated with additional tasks added in conference call: translate source-specific derived columns to plain SQL, flatten the datasources, automated geoscrubbing pipeline
planning/goals/BIEN_3_derived_data_products_NormalizedDB_only.docx: removed BIEN species-level phylogeny, which Brad says is out of scope for the BIEN DB
removed planning/workflow/bien3_architecture.odp because the current version is now in bien3_architecture.pptx
added planning/workflow/validation/TNRS_results.ppt symlink to inputs/test_taxonomic_names/_scrub/TNRS_results.ppt
inputs/test_taxonomic_names/_scrub/TNRS_results.ppt: highlighted the sample row and related rows
inputs/test_taxonomic_names/_scrub/TNRS_results.xls: moved arrows to TNRS_results.ppt so they can be changed more easily
inputs/test_taxonomic_names/_scrub/TNRS_results.ppt: TNRS.tnrs: added diagram labels for the various names and steps
inputs/test_taxonomic_names/_scrub/TNRS_results.xls: use "Poa annua var. eriolepis"->"Poaceae Poa annua L." as the synonym example instead of "Poa annua fo. lanuginosa"->"Poaceae Poa annua var. annua" because the input name is simpler and it's closer to the beginning of the list
inputs/test_taxonomic_names/_scrub/run: exports/make(): tnrs.csv: include Name_matched instead of Genus_matched+Specific_epithet_matched because this also contains lower ranks, which are used in the TNRS synonymizing
inputs/test_taxonomic_names/_scrub/TNRS_results.ppt: added annotations explaining the import steps
added inputs/test_taxonomic_names/_scrub/TNRS_results.ppt, containing the *.png screenshots with tables labeled
added inputs/test_taxonomic_names/_scrub/*.png, screenshots of the TNRS_results.xls tabs (LibreOffice does not preserve the formatting when pasting a spreadsheet to a PowerPoint as a table, and the table editing options are limited)
added inputs/test_taxonomic_names/_scrub/TNRS_results.xls with formatted versions of the *.csv tables
inputs/test_taxonomic_names/_scrub/run: exports/make(): subset the columns to include only the most important to demo how the data is represented
lib/sh/db.sh: mk_select(): support passing $cols as array instead of SQL string, which is easier to enter in a shell script (less quotes, \ , etc.)
lib/sh/db.sh: added cols2list()
lib/sh/util.sh: added is_array()
inputs/test_taxonomic_names/_scrub/run: exports/make(): allow specifying an explicit columns list for each table using cols=... (initially set to all columns)
added inputs/test_taxonomic_names/_scrub/*.csv exports
added inputs/test_taxonomic_names/_scrub/run, which exports the test_scrub-populated tables to CSV
lib/sh/db_make.sh: added pg_export_table_to_dir(), pg_export_tables_to_dir(). unlike db.sh pg_export_table_to_dir_no_header(), these functions are make-aware and will not clobber an existing file.
reran inputs/test_taxonomic_names/test_scrub, which generates the public.test_taxonomic_names sample schema
inputs/CTFS/Plot/map.csv: DescriptionOfSite: remapped to locationRemarks, not locality
inputs/CTFS/AggregateObservation/: translated multi-column filters to postprocessing derived columns, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns
schemas/vegbien.sql: geoscrub_input_new: updated for VegCore-renamed geoscrub_output column names
schemas/util.sql: added ?>= operator with is_more_complete_than() function
inputs/.geoscrub/: switched to new-style import, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource
inputs/.geoscrub/geoscrub_output/: translated single-column filters to postprocessing derived columns, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns
schemas/util.sql: SQL-language IMMUTABLE functions marked STRICT: removed STRICT to enable dynamic inlining, which speeds up the function up to 7x. STRICT was not removed where the function was particularly complex and the STRICT optimization would likely be more significant than inlining.
bugfix: inputs/BRIT/specimen_flat/postprocess.sql: diameterBreastHeight_cm, height_m: use newly NULL-mapped versions of columns instead of the *_verbatim columns
inputs/BRIT/: switched to new-style import, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource
inputs/BRIT/specimen_flat/: translated multi-column filters with _join() to postprocessing derived columns, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns
inputs/BRIT/specimen_flat/map.csv: Habitat_Summary: remapped to UNUSED
inputs/BRIT/specimen_flat/postprocess.sql: diameterBreastHeight_cm, height_m: updated runtimes
inputs/BRIT/specimen_flat/: DBH_*, Height_*: mapped NULL-equivalent values, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns
inputs/.../: translated multi-column filters with _avg() to postprocessing derived columns, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns
inputs/BRIT/specimen_flat/: translated single-column filters to postprocessing derived columns, using the steps at wiki.vegpath.org/Switching_to_new-style_import#stage-I-source-specific > "translate single-column filters to postprocessing derived columns"
/README.TXT: Maintenance: added instructions for what to do if http://vegbiendev.nceas.ucsb.edu/phppgadmin/ goes down (sometimes displaying a Not found error)
schemas/util.sql: schema comment: added note that IMMUTABLE SQL-language functions should never be declared STRICT, because this prevents them from being inlined. inlining can create a significant speed improvement (7x+), by avoiding function calls and enabling additional constant folding.
inputs/REMIB/Specimen/postprocess.sql: map_nulls() derived cols: documented total runtime (7.5 min on vegbiendev)
inputs/REMIB/Specimen/postprocess.sql: map_nulls() derived cols: updated runtimes for map_nulls() inlining, which created a speed improvement of 7x for the numeric columns and 2.5x for the text columns (292563.362->41929.772 ms and 83640.424->35690.797 ms, respectively). note that the map_nulls__coord__*() calls could be optimized further by combining the successive map_nulls() calls into one, with the hstores merged.
schemas/util.sql: map_nulls(): documented that inputs/REMIB/Specimen/postprocess.sql > country also shows that inlining is now happening properly. note that the speed improvement due to inlining is not as much, %wise, when the values util._map() is run on are long strings instead of the short strings used in the initial profiling. this is because a greater % of the time is spent in system functions such as hstore>text, which are not affected by the inlining because they are run either way.
schemas/util.sql: map_nulls(): use new nulls_map(). proper inlining (i.e. same runtime before and after change) has been verified with the following profiling query:SELECT util.map_nulls(array[1, 2, 3]::text[], v) FROM unnest(array_fill(1, array100000)) f (v)
schemas/util.sql: added nulls_map(), for use with _map()
lib/runscripts/table.run: postprocess(): added remake action that calls trim_table()
lib/runscripts/table.run: added trim_table(), which calls util.trim(regclass, regclass)
lib/runscripts/table.run: map_table(): added remake action that calls reset_col_names()
lib/runscripts/table.run: added reset_col_names(), which calls util.reset_col_names()
bugfix: lib/runscripts/table.run: map_table(): moved $map_table to global var so it can be used by other functions
bugfix: lib/runscripts/table.run: postprocess(): don't propagate $remake to remake_VegBIEN_mappings(), since this will cause map.csv to be remade, which is not related to the postprocessing.
lib/runscripts/table.run: map_table(): util.set_col_names_with_metadata(): removed unnecessary cast to regclass, which is performed implicitly. this used to be needed when the polymorphic util.rename_cols() was used instead.
schemas/util.sql: added trim(), which trims a table to include only original columns, as defined by a map table
schemas/util.sql: added derived_cols(), which gets table_'s derived columns (all the columns not in the names table)
schemas/util.sql: added eval2set()
schemas/util.sql: added drop_column()
inputs/REMIB/Specimen/postprocess.sql: map_nulls__*(): turned off STRICT to allow dynamic inlining, which speeds up the mk_derived_col() statements by 5x (342799.823 ms -> 71533.252 ms (6 min -> 1 min) for latitude_sec)
inputs/REMIB/Specimen/postprocess.sql: runtimes: updated for vegbiendev, before dynamic inlining. the times are about twice as fast as on starscream, so vegbiendev is faster at whatever is the limiting speed factor (probably not CPU, based on other benchmarks).
schemas/util.sql: map_nulls(): documented that due to dynamic inlining, this is just as fast as util._map() which it wraps. dynamic inlining now brings altogether a 40x speed improvement to map_nulls() (4000 ms -> 100 ms), and would likely bring a comparable improvement for other functions that are run repeatedly and call other user-defined functions.
bugfix: schemas/util.sql: map_nulls(): updated to use hstore(text[], anyelement), which has replaced hstore(anyarray, anyelement)
schemas/util.sql: removed hstore(anyarray, anyelement), which did not support dynamic inlining, to avoid confusion over which hstore() function to use. use new hstore(text[], anyelement) instead (with explicit cast on the keys array if needed).
schemas/util.sql: added hstore(text[], anyelement), which dynamically inlines properly, unlike hstore(anyarray, anyelement). this can be selected by explicitly casting the keys array to text[], which now provides a 6x speed improvement (380 ms -> 60 ms) for map_nulls().
schemas/util.sql: fix_array(): turned off STRICT to allow dynamic inlining, which speeds up util.map_nulls() by 3x (1500 ms -> 500 ms)
schemas/util.sql: array_length(anyarray), array_length(anyarray, dimension integer): turned off STRICT to allow dynamic inlining, which speeds up util.map_nulls(). this requires adding a `CASE WHEN $1 IS NULL THEN NULL` statement to array_length(anyarray, dimension integer) to replace the functionality provided by STRICT.
schemas/util.sql: map_nulls(): turned off STRICT to allow dynamic inlining, which causes a 2x speed improvement1. (see r10352 for an explanation of dynamic inlining.) note that turning off STRICT disables NULL-skipping (avoiding running a function when all its params are NULL), so it should only be used when the NULL-skipping optimization is needed less than dynamic inlining....
schemas/util.sql: inlinable IMMUTABLE functions: avoid using config params (e.g. `SET search_path TO util`) because these prevent dynamic inlining (i.e. inlining of a function call with variable instead of constant arguments, by substituting the arguments into the function's body). dynamic inlining can speed up function evaluation significantly, because a (slow) call to a user-defined SQL function is avoided.
schemas/vegbien.my.sql: updated for new bin/repl text mode matching, which also affects non-regexps. this causes the replacement of a few more occurrences of PostgreSQL-only one-word typenames with their MySQL equivalents.
inputs/REMIB/Specimen/postprocess.sql: runtimes: documented the machine the times are from
inputs/REMIB/: switched to new-style import, using the steps at wiki.vegpath.org/Switching_to_new-style_import#stage-I-source-specific > "run the following for each datasource"
bugfix: bin/repl: text mode: repurpose this to match SQL identifiers, for use by inputs/input.Makefile %/postprocess.sql. %/postprocess.sql is the only place currently using this mode, so this will not affect other scripts.
bugfix: inputs/input.Makefile: %/postprocess.sql: need to run bin/repl in text mode (text=1) so that values to match are treated as literal strings rather than regular expressions. this difference is important for column names with spaces or special characters.
bugfix: inputs/Madidi/LocationObservation/map.csv: resolved Notes, Notes 2 -> locationRemarks collision by _alt()ing them together. note that _alt() is fine because only one of these is ever populated.
bugfix: schemas/util.sql: set_col_names(): need to generate error if destination column already exists (rather than suppressing it with try_create()), because this indicates a collision
bugfix: inputs/Madidi/IndividualObservation/map.csv: removed derived column FieldFamilyFullName#originalFamily, which should not be in the map table because it can contain only columns that are initially in the table before running postprocess.sql
schemas/util.sql: map table: added unique constraint on the to column as well, because the destination names also need to be distinct in order to be a valid set of column names
schemas/util.sql: map table: changed pkey to a unique constraint so pgAdmin would sort the entries in table order (matching the order they are in the staging table) instead of alphabetized by the pkey
bugfix: inputs/REMIB/Specimen/map.csv: state: changed output column name to stateProvince_verbatim to match the renaming in postprocess.sql
inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: removed out-of-date rerun time, which applied to doing all the deletes in the same statement (however, the current rerun time is approximately the same). note that index scans are not actually used (as the previous comment incorrectly stated) because the conditions for this filter are prefix-less regexps.
inputs/REMIB/Specimen/: translated single-column filters to postprocessing derived columns, using the steps at wiki.vegpath.org/Switching_to_new-style_import#stage-I-source-specific > "translate single-column filters to postprocessing derived columns". null-mapping filters now use wrappers around new util.map_nulls(). note that the verbatim columns input to the filters need to be renamed to avoid name collisions with their filtered columns, which must be VegCore terms for new-style import.
inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: also filter out non-numbers for long_sec, lat_min, lat_sec
inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: remove rows where long_min is not a number
inputs/REMIB/Specimen/postprocess.sql: change E'' to regular '' to avoid the need to double \ (instead ' would be doubled). E'' used to be necessary in previous versions of PostgreSQL to avoid a warning about escape string syntax.
inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: removed unnecessary () around `DELETE FROM :table WHERE long_deg ...`
inputs/REMIB/Specimen/postprocess.sql: removed coll_year, country, long_deg indexes because the frameshift filter conditions on these columns do not use index scans (because their regexp patterns do not contain a fixed prefix). eventually, some regexp patterns may be able to be modified to use prefixes.
bugfix: inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: can't OR together conditions to determine rows to delete, because if any condition is NULL instead of true/false, this will NULL out the entire WHERE condition and prevent any other true conditions from causing a deletion. the best way to fix this is to use a separate DELETE statement for each condition, so that NULLs only impact that particular condition's DELETE. unlike using a modified, NULL-insensitive OR, which would prevent the use of index scans, this allows indexes to be used for conditions that support them.
inputs/REMIB/Specimen/postprocess.sql: removed duplicate CREATE INDEX for the acronym column
bugfix: inputs/REMIB/Specimen/postprocess.sql: switched back to the input column names, since the renaming to *_verbatim is part of a later step
inputs/REMIB/Specimen/create.sql: moved filtering out of frameshifted rows to postprocess.sql, where it can happen in an idempotent DELETE. this allows filters to remove additional rows to easily be added on top of the existing filters, without needing to remake Specimen (which takes a long time, because of the many stage I derived columns that get added). the logical inversion inherent in the DELETE condition has been factored through rather than wrapped in NOT (...), because removal of frameshifted rows is more accurately specified as the detection of specific patterns that indicate frameshifting rather than the validation of all fields.