bin/my2pg: replace MySQL ` quotes with " quotes to support exports that were generated without ANSI_QUOTES mode. (this replacement only applies to schema exports, not data.) ANSI_QUOTES is only available with mysqldump --compatible modes that also include NO_TABLE_OPTIONS, which omits important table options such as comments. in particular, these comments are part of schemas/VegCore/VegCore.ERD.mwb but were not being included in VegCore.my.sql.
schemas/VegCore/VegCore.ERD.mwb: taxon_string: removed parsed_taxon_assertion field, since there may be more than one parsing (TNRS result) for a given taxon_string. the parsing relationship can better be represented by adding a parsed_taxon_assertion whose taxon_assertion.string points to the parsed taxon_string. getting the parsed_taxon_assertion for a taxon_string now requires joining on parsed_taxon_assertion using a backwards instead of forwards fkey, and filtering the corresponding assertions to include only the ones for TNRS (of the desired TNRS version). documented that taxon_assertion.string was previously the concatenated matched name, but is now the TNRS input name. the concatenated matched name is still in parsed_taxon_assertion.matched_taxon_concept->:taxon_name.unique_name.
schemas/VegCore/VegCore.my.sql: regenerated from .mwb schema, which apparently reverses the order of the fkeys (possibly a Linux MySQL bug?)
inputs/SpeciesLink/Specimen/map.csv: remapped Darwin Core synonyms to DUPLICATE. this avoids the need to translate these to postprocessing derived columns for new-style import, and also speeds up column-based import because there are less automatic alts to perform to resolve filter-less collisions. the svn diff was verified by replacing DUPLICATE#of:dwc_terms<term>#... with <term>, removing the comment, and checking that this removes the diff (except where VegCore has renamed a DwC term).
bugfix: inputs/SpeciesLink/Specimen/map.csv: *scientificName: remapped to scientificName instead of taxonName to match the DwC term's name (this is the same dwc_terms_scientificName mismapping that was fixed in r10434)
bugfix: inputs/SpeciesLink/Specimen/map.csv: dwc_terms_scientificName: remapped to scientificName instead of taxonName to match that DwC term name, as well as the mappings of other *scientificName terms
inputs/SpeciesLink/Specimen/map.csv: marked dwc_geospatial_VerbatimLatitude,Longitude as exact duplicates of dwc_terms_*
inputs/SpeciesLink/Specimen/map.csv: remapped identical _alt-ed fields to DUPLICATE. this avoids the need to translate these to postprocessing derived columns for new-style import, and also speeds up column-based import because there are less automatic _alts to perform to resolve filter-less collisions.
bugfix: inputs/SpeciesLink/Specimen/map.csv: *CollectorNumber: moved these to the same _alt group as recordNumber, because they are actually duplicates
correction: inputs/SpeciesLink/Specimen/map.csv: FieldNumber: fixed incorrect comment that these fields are identical to recordNumber, when instead they have the same *meaning but not the same values. instead, values are stored under either of the two terms. the previous conclusion had been based on an incorrect query, which used != instead of the NULL-sensitive IS NOT DISTINCT FROM.
planning/timeline/timeline.2013.xls: Adding derived columns: extended to overlap with all subtasks
planning/timeline/timeline.2013.xls: Geoscrubbing: split into separate re-run and automated pipeline tasks
planning/timeline/timeline.2013.xls: moved Data provider validations before Adding derived columns because ensuring that the source data is in the database is more important than the derived data, which can always be added later
planning/timeline/timeline.2013.xls: Data provider validations: added dot in July because some amount of datasource-level validation happens when mappings issues are discovered during the refactoring
bugfix: inputs/*/*/map.csv for specimen tables: remapped eventDate,day,month,year to *Collected, because a general date always applies to the observation itself rather than to any parent event (specimens don't have a parent event)
inputs/*/*/map.csv for IndividualObservation tables: also mapped eventDate,day,month,year to *Collected, because a general date always applies to the observation itself in addition to any parent event which it may be a part of
bugfix: inputs/XAL/Specimen/, NY/Ecatalog_all/: *JulianDay: remapped to dayOfYear instead of day (the day of the month)
inputs/SpeciesLink/Specimen/map.csv: remapped *dayOfYear-related terms to UNUSED
bugfix: inputs/SpeciesLink/Specimen/map.csv: remapped conceptual_darwin_2003_1_0_JulianDay, dwc_dwcore_DayOfYear to dayOfYear instead of day (the day of the month)
mappings/VegCore.htm: regenerated from wiki. added dayOfYear (=julianDay), which is different from startDayOfYear/endDayOfYear.
inputs/CTFS/: switched to new-style import, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource
inputs/CTFS/StemObservation/: translated collisions (missing filters) to postprocessing derived columns, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns
planning/timeline/timeline.2013.xls: rebalanced tasks across the remaining months, taking into account priority changes made in the conference call (e.g. that we should not be handling people's individual data requests (Brad, wiki.vegpath.org/2013-07-25_conference_call#Decisions-made))
planning/timeline/timeline.2013.xls: updated with additional tasks added in conference call: translate source-specific derived columns to plain SQL, flatten the datasources, automated geoscrubbing pipeline
planning/goals/BIEN_3_derived_data_products_NormalizedDB_only.docx: removed BIEN species-level phylogeny, which Brad says is out of scope for the BIEN DB
removed planning/workflow/bien3_architecture.odp because the current version is now in bien3_architecture.pptx
added planning/workflow/validation/TNRS_results.ppt symlink to inputs/test_taxonomic_names/_scrub/TNRS_results.ppt
inputs/test_taxonomic_names/_scrub/TNRS_results.ppt: highlighted the sample row and related rows
inputs/test_taxonomic_names/_scrub/TNRS_results.xls: moved arrows to TNRS_results.ppt so they can be changed more easily
inputs/test_taxonomic_names/_scrub/TNRS_results.ppt: TNRS.tnrs: added diagram labels for the various names and steps
inputs/test_taxonomic_names/_scrub/TNRS_results.xls: use "Poa annua var. eriolepis"->"Poaceae Poa annua L." as the synonym example instead of "Poa annua fo. lanuginosa"->"Poaceae Poa annua var. annua" because the input name is simpler and it's closer to the beginning of the list
inputs/test_taxonomic_names/_scrub/run: exports/make(): tnrs.csv: include Name_matched instead of Genus_matched+Specific_epithet_matched because this also contains lower ranks, which are used in the TNRS synonymizing
inputs/test_taxonomic_names/_scrub/TNRS_results.ppt: added annotations explaining the import steps
added inputs/test_taxonomic_names/_scrub/TNRS_results.ppt, containing the *.png screenshots with tables labeled
added inputs/test_taxonomic_names/_scrub/*.png, screenshots of the TNRS_results.xls tabs (LibreOffice does not preserve the formatting when pasting a spreadsheet to a PowerPoint as a table, and the table editing options are limited)
added inputs/test_taxonomic_names/_scrub/TNRS_results.xls with formatted versions of the *.csv tables
inputs/test_taxonomic_names/_scrub/run: exports/make(): subset the columns to include only the most important to demo how the data is represented
lib/sh/db.sh: mk_select(): support passing $cols as array instead of SQL string, which is easier to enter in a shell script (less quotes, \ , etc.)
lib/sh/db.sh: added cols2list()
lib/sh/util.sh: added is_array()
inputs/test_taxonomic_names/_scrub/run: exports/make(): allow specifying an explicit columns list for each table using cols=... (initially set to all columns)
added inputs/test_taxonomic_names/_scrub/*.csv exports
added inputs/test_taxonomic_names/_scrub/run, which exports the test_scrub-populated tables to CSV
lib/sh/db_make.sh: added pg_export_table_to_dir(), pg_export_tables_to_dir(). unlike db.sh pg_export_table_to_dir_no_header(), these functions are make-aware and will not clobber an existing file.
reran inputs/test_taxonomic_names/test_scrub, which generates the public.test_taxonomic_names sample schema
inputs/CTFS/Plot/map.csv: DescriptionOfSite: remapped to locationRemarks, not locality
inputs/CTFS/AggregateObservation/: translated multi-column filters to postprocessing derived columns, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns
schemas/vegbien.sql: geoscrub_input_new: updated for VegCore-renamed geoscrub_output column names
schemas/util.sql: added ?>= operator with is_more_complete_than() function
inputs/.geoscrub/: switched to new-style import, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource
inputs/.geoscrub/geoscrub_output/: translated single-column filters to postprocessing derived columns, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns
schemas/util.sql: SQL-language IMMUTABLE functions marked STRICT: removed STRICT to enable dynamic inlining, which speeds up the function up to 7x. STRICT was not removed where the function was particularly complex and the STRICT optimization would likely be more significant than inlining.
bugfix: inputs/BRIT/specimen_flat/postprocess.sql: diameterBreastHeight_cm, height_m: use newly NULL-mapped versions of columns instead of the *_verbatim columns
inputs/BRIT/: switched to new-style import, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource
inputs/BRIT/specimen_flat/: translated multi-column filters with _join() to postprocessing derived columns, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns
inputs/BRIT/specimen_flat/map.csv: Habitat_Summary: remapped to UNUSED
inputs/BRIT/specimen_flat/postprocess.sql: diameterBreastHeight_cm, height_m: updated runtimes
inputs/BRIT/specimen_flat/: DBH_*, Height_*: mapped NULL-equivalent values, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns
inputs/.../: translated multi-column filters with _avg() to postprocessing derived columns, using the steps at wiki.vegpath.org/Adding_new-style_import_to_a_datasource#Translating-filters-to-postprocessing-derived-columns
inputs/BRIT/specimen_flat/: translated single-column filters to postprocessing derived columns, using the steps at wiki.vegpath.org/Switching_to_new-style_import#stage-I-source-specific > "translate single-column filters to postprocessing derived columns"
/README.TXT: Maintenance: added instructions for what to do if http://vegbiendev.nceas.ucsb.edu/phppgadmin/ goes down (sometimes displaying a Not found error)
schemas/util.sql: schema comment: added note that IMMUTABLE SQL-language functions should never be declared STRICT, because this prevents them from being inlined. inlining can create a significant speed improvement (7x+), by avoiding function calls and enabling additional constant folding.
inputs/REMIB/Specimen/postprocess.sql: map_nulls() derived cols: documented total runtime (7.5 min on vegbiendev)
inputs/REMIB/Specimen/postprocess.sql: map_nulls() derived cols: updated runtimes for map_nulls() inlining, which created a speed improvement of 7x for the numeric columns and 2.5x for the text columns (292563.362->41929.772 ms and 83640.424->35690.797 ms, respectively). note that the map_nulls__coord__*() calls could be optimized further by combining the successive map_nulls() calls into one, with the hstores merged.
schemas/util.sql: map_nulls(): documented that inputs/REMIB/Specimen/postprocess.sql > country also shows that inlining is now happening properly. note that the speed improvement due to inlining is not as much, %wise, when the values util._map() is run on are long strings instead of the short strings used in the initial profiling. this is because a greater % of the time is spent in system functions such as hstore>text, which are not affected by the inlining because they are run either way.
schemas/util.sql: map_nulls(): use new nulls_map(). proper inlining (i.e. same runtime before and after change) has been verified with the following profiling query:SELECT util.map_nulls(array[1, 2, 3]::text[], v) FROM unnest(array_fill(1, array100000)) f (v)
schemas/util.sql: added nulls_map(), for use with _map()
lib/runscripts/table.run: postprocess(): added remake action that calls trim_table()
lib/runscripts/table.run: added trim_table(), which calls util.trim(regclass, regclass)
lib/runscripts/table.run: map_table(): added remake action that calls reset_col_names()
lib/runscripts/table.run: added reset_col_names(), which calls util.reset_col_names()
bugfix: lib/runscripts/table.run: map_table(): moved $map_table to global var so it can be used by other functions
bugfix: lib/runscripts/table.run: postprocess(): don't propagate $remake to remake_VegBIEN_mappings(), since this will cause map.csv to be remade, which is not related to the postprocessing.
lib/runscripts/table.run: map_table(): util.set_col_names_with_metadata(): removed unnecessary cast to regclass, which is performed implicitly. this used to be needed when the polymorphic util.rename_cols() was used instead.
schemas/util.sql: added trim(), which trims a table to include only original columns, as defined by a map table
schemas/util.sql: added derived_cols(), which gets table_'s derived columns (all the columns not in the names table)
schemas/util.sql: added eval2set()
schemas/util.sql: added drop_column()
inputs/REMIB/Specimen/postprocess.sql: map_nulls__*(): turned off STRICT to allow dynamic inlining, which speeds up the mk_derived_col() statements by 5x (342799.823 ms -> 71533.252 ms (6 min -> 1 min) for latitude_sec)
inputs/REMIB/Specimen/postprocess.sql: runtimes: updated for vegbiendev, before dynamic inlining. the times are about twice as fast as on starscream, so vegbiendev is faster at whatever is the limiting speed factor (probably not CPU, based on other benchmarks).
schemas/util.sql: map_nulls(): documented that due to dynamic inlining, this is just as fast as util._map() which it wraps. dynamic inlining now brings altogether a 40x speed improvement to map_nulls() (4000 ms -> 100 ms), and would likely bring a comparable improvement for other functions that are run repeatedly and call other user-defined functions.
bugfix: schemas/util.sql: map_nulls(): updated to use hstore(text[], anyelement), which has replaced hstore(anyarray, anyelement)
schemas/util.sql: removed hstore(anyarray, anyelement), which did not support dynamic inlining, to avoid confusion over which hstore() function to use. use new hstore(text[], anyelement) instead (with explicit cast on the keys array if needed).
schemas/util.sql: added hstore(text[], anyelement), which dynamically inlines properly, unlike hstore(anyarray, anyelement). this can be selected by explicitly casting the keys array to text[], which now provides a 6x speed improvement (380 ms -> 60 ms) for map_nulls().
schemas/util.sql: fix_array(): turned off STRICT to allow dynamic inlining, which speeds up util.map_nulls() by 3x (1500 ms -> 500 ms)
schemas/util.sql: array_length(anyarray), array_length(anyarray, dimension integer): turned off STRICT to allow dynamic inlining, which speeds up util.map_nulls(). this requires adding a `CASE WHEN $1 IS NULL THEN NULL` statement to array_length(anyarray, dimension integer) to replace the functionality provided by STRICT.
schemas/util.sql: map_nulls(): turned off STRICT to allow dynamic inlining, which causes a 2x speed improvement1. (see r10352 for an explanation of dynamic inlining.) note that turning off STRICT disables NULL-skipping (avoiding running a function when all its params are NULL), so it should only be used when the NULL-skipping optimization is needed less than dynamic inlining....
schemas/util.sql: inlinable IMMUTABLE functions: avoid using config params (e.g. `SET search_path TO util`) because these prevent dynamic inlining (i.e. inlining of a function call with variable instead of constant arguments, by substituting the arguments into the function's body). dynamic inlining can speed up function evaluation significantly, because a (slow) call to a user-defined SQL function is avoided.
schemas/vegbien.my.sql: updated for new bin/repl text mode matching, which also affects non-regexps. this causes the replacement of a few more occurrences of PostgreSQL-only one-word typenames with their MySQL equivalents.
inputs/REMIB/Specimen/postprocess.sql: runtimes: documented the machine the times are from
inputs/REMIB/: switched to new-style import, using the steps at wiki.vegpath.org/Switching_to_new-style_import#stage-I-source-specific > "run the following for each datasource"
bugfix: bin/repl: text mode: repurpose this to match SQL identifiers, for use by inputs/input.Makefile %/postprocess.sql. %/postprocess.sql is the only place currently using this mode, so this will not affect other scripts.
bugfix: inputs/input.Makefile: %/postprocess.sql: need to run bin/repl in text mode (text=1) so that values to match are treated as literal strings rather than regular expressions. this difference is important for column names with spaces or special characters.
bugfix: inputs/Madidi/LocationObservation/map.csv: resolved Notes, Notes 2 -> locationRemarks collision by _alt()ing them together. note that _alt() is fine because only one of these is ever populated.
bugfix: schemas/util.sql: set_col_names(): need to generate error if destination column already exists (rather than suppressing it with try_create()), because this indicates a collision
bugfix: inputs/Madidi/IndividualObservation/map.csv: removed derived column FieldFamilyFullName#originalFamily, which should not be in the map table because it can contain only columns that are initially in the table before running postprocess.sql
schemas/util.sql: map table: added unique constraint on the to column as well, because the destination names also need to be distinct in order to be a valid set of column names
schemas/util.sql: map table: changed pkey to a unique constraint so pgAdmin would sort the entries in table order (matching the order they are in the staging table) instead of alphabetized by the pkey
bugfix: inputs/REMIB/Specimen/map.csv: state: changed output column name to stateProvince_verbatim to match the renaming in postprocess.sql
inputs/REMIB/Specimen/postprocess.sql: remove frameshifted rows: removed out-of-date rerun time, which applied to doing all the deletes in the same statement (however, the current rerun time is approximately the same). note that index scans are not actually used (as the previous comment incorrectly stated) because the conditions for this filter are prefix-less regexps.