bugfix: *.sql: public.source_by_shortname(): need to wrap it in a nested SELECT because Postgres incorrectly does not constant-fold (inline) it, leading to a slowdown when it is therefore run many times. this is done using the steps at wiki.vegpath.org/Postgres_queries#wrap-function-call-in-nested-SELECT .
inputs/input.Makefile: add!: verify/: also svn:ignore *.tsv, *.txt
moved everything into /trunk/ to create the standard svn layout, for use with tools that require this (eg. git-svn). IMPORTANT: do NOT do an `svn up`. instead, re-use your working copy's existing files with `svn switch` (http://svnbook.red-bean.com/en/1.6/svn.ref.svn.c.switch.html).
inputs/GBIF/raw_occurrence_record_plants/map.csv: row_num: remapped to plain *row_num, like the other datasources that have this field
inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: Remove institutions that we have direct data for: rerun time: noted that this is only fast after manual vacuuming of the table (to remove the deleted rows from the index). autovacuum apparently does not run, although it should.
inputs/GBIF/raw_occurrence_record_plants/test.xml.ref: reran test, which added yearCollected/monthCollected/dayCollected
inputs/GBIF/raw_occurrence_record_plants/run: updated import() runtime (same), documented table cleanup runtime (1.5 h)
inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: CREATE INDEX ... specimenHolderInstitutions: documented runtime (45 min)
inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: Remove institutions that we have direct data for: documented runtime (3.5 min)
**/new_terms.csv, unmapped_terms.csv updated (using `make missing_mappings`)
copyright scrub: inputs/: removed data provider-owned schema and documentation files, which are not BIEN copyright and should not be part of what is submitted for open-sourcing. these files will remain accessible via the web interface (fs.vegpath.org), but will not be in the repository.
added inputs/GBIF/_src/0001000-131106143450413.zip.md5, GBIFPortalDB-2013-09-10.dump.gz.md5
inputs/GBIF/_src/GBIFPortalDB-2013-09-10.dump.gz.url: documented download time (5.5 h for an 18 GB file)
inputs/GBIF/_src/0001000-131106143450413.zip.url: documented download time (only 2 h for an 18 GB file)
added inputs/GBIF/_src/0001000-131106143450413.zip.url (DwC-A export), GBIFPortalDB-2013-09-10.dump.gz.url (raw data), portal_26_feb_2013.war.url (raw data portal)
inputs/GBIF/: added LOA files: _src/use_conditions/LetterOfAgreement_template.doc, BIEN LoA agreement annex.docx
fix: bin/map: put template: comment out the "Put template:" label so that the output is valid XML, and displays properly in a browser rather than showing a syntax error
bugfix: mappings/VegCore-VegBIEN.csv: nest all taxonoccurrences inside a stratum event, so that the parent locationevent is always fully populated before child locationevents point to it. (previously, a stub parent event was created when the child event was imported first, which blocked the fully-populated parent event from being inserted later on.) this uses auto-folding (for VegBank/CVS) and auto-forwarding (for other datasources) to prune empty stratum events for taxonoccurrences that don't have strata. (see wiki.vegpath.org/Auto-folding, wiki.vegpath.org/Auto-forwarding for more info about these normalization techniques.) note that the inserted row counts stay exactly the same for all datasources except VegBank (which was being fixed), indicating that this signficant change to the mappings did not change the semantics of the import of taxonoccurrences.
inputs/*/*/test.xml.ref: updated source.shortname for new datasource name, which now starts out with .new suffix
inputs/{.,}*/*.schema.sql: regenerated using the instructions in bin/my2pg. this primarily replaces timestamp with text/*timestamp*/ (to preserve indefinite dates).
bugfix: inputs/*/*/map.csv for specimen tables: remapped eventDate,day,month,year to *Collected, because a general date always applies to the observation itself rather than to any parent event (specimens don't have a parent event)
bugfix: inputs/*/*/map.csv (e.g. inputs/GBIF/raw_occurrence_record_plants/map.csv): remapped author to scientificNameAuthorship rather than authors, which it had gotten incorrectly automapped to. note that the VegCore term authors has now been renamed to data_authors to avoid ambiguity, but incorrect automappings resulting from it had not yet been fixed.
bugfix: inputs/GBIF/raw_occurrence_record_plants/run: updated herbaria.ih column names for staging table column renaming
bugfix: inputs/GBIF/table.run: need to include lib/runscripts/mysql.table.run instead of table.run (table.run was accidentally substituted when inputs/.NCBI/table.run was copied to all new-style datasources
inputs/*/Source/VegBIEN.csv: regenerated for new-style import, which uses a symlink to mappings/VegCore-VegBIEN.csv instead of a custom mapping using the original column names
inputs/*/*/map.csv for CSV tables with a row_num column: added missing row_num entry, which is needed by the staging table column renaming to make the order of the map.csv columns match the order in the staging table
bugfix: inputs/*/Source/map.csv: added missing row_num entry, which is needed by the staging table column renaming to make the order of the map.csv columns match the order in the staging table. the staging table column renaming is now used by all Source tables.
inputs/*/: added table.run for use by the table subdirs in new-style import. datasources without table subdirs do not need this.
bugfix: inputs/input.Makefile: %/VegBIEN.csv: for new-style datasources, use a symlink to mappings/VegCore-VegBIEN.csv directly instead of prefiltering VegCore-VegBIEN.csv to include only the columns in map.csv. prefiltering used to be performed as part of mapping the map.csv VegCore output terms to VegBIEN using bin/join, but is no longer needed because the staging table columns are now VegCore terms. instead, the full VegCore-VegBIEN.csv is needed so that derived columns added in stage I or II validations are detected by bin/map (rather than just the original source columns in map.csv).
bugfix: inputs/*/Source/data.csv for new-style datasources: need to include a blank row (plus a blank header) so that the metadata values are imported at least once instead of zero times, now that there is an installed staging table that will be iterated over. the blank row did not used to be necessary, because db_xml.put_table() has a special case for metadata-only tables with no installed table, which avoids iterating over the table's rows.
inputs/*/Source/ for new-style datasources: use an actual staging table instead of a metadata-only table, so that metadata values can be stored in the staging table instead of the map.csv (as will be required by new-style import)
added inputs/GBIF/_archive/
removed inputs/GBIF/Specimen/, which has been replaced by the refresh in raw_occurrence_record_plants/
added inputs/GBIF/map.csv, used to regenerate inputs/GBIF/raw_occurrence_record_plants/map.csv when raw_occurrence_record_plants is resubset
inputs/GBIF/run: inherit from lib/runscripts/datasrc_dir.run, which uses import_order.txt to forward calls to the subdirs
added blank runscripts inputs/GBIF/Source/run, Specimen/run because they are in import_order.txt (used by lib/runscripts/datasrc_dir.run)
added inputs/GBIF/_src/.rsync_filter.upload,download to prevent old versions of GBIFPortalDB-*.dump.gz from being downloaded to the local machine, while keeping them on jupiter. this avoids the need to store these files in ~/Documents/BIEN/large_files/ with symlinks from inputs/GBIF/_src/ to exclude them from the sync.
added inputs/GBIF/raw_occurrence_record_plants/.rsync_ignore with filters that have previously needed to be manually added whenever `make inputs/upload` was run
added inputs/GBIF/_MySQL/.rsync_ignore with filters from /README.TXT > Maintenance > to synchronize vegbiendev, jupiter, and your local machine. these filters will now be used with bin/sync_upload in addition to the periodic backup commands.
mappings/VegCore-VegBIEN.csv: genus->taxonlabel.taxonomicname: use new _filter_genus() (see r9882)
added inputs/GBIF/_MySQL/GBIFPortalDB-2013-02-20.data.0.preamble.sql
mappings/VegCore-VegBIEN.csv: genus->taxonlabel.taxonomicname: filter out genera that contain numbers (using new _filter_genus()), which break TNRS and prevent it from matching any other parts of the name. later, these genera can instead be moved to the end of the name, where TNRS will correctly match them as Unmatched_terms.
added inputs/GBIF/raw_occurrence_record_plants/table.tsv.md5
inputs/GBIF/raw_occurrence_record_plants/test.xml.ref: regenerated. updated for new staging table input columns, which are now the same as the output columns.
bugfix: inputs/input.Makefile: %/VegBIEN.csv: use header from map.csv instead of the new columns, so that source.shortname is set to GBIF instead of VegCore
inputs/input.Makefile: %/VegBIEN.csv: when a runscript is available, instead map the output columns of map.csv to VegBIEN, because the columns have been renamed in the staging table
inputs/GBIF/raw_occurrence_record_plants/VegBIEN.csv: regenerated, which adds row_num input col
bugfix: inputs/GBIF/import_order.txt, run: updated raw_occurrence_record/ to raw_occurrence_record_plants/
inputs/GBIF/raw_occurrence_record_plants/run: import() runtime: specified that this does not include table.tsv.gz/make()
inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: Remove institutions that we have direct data for: # duplicates: added revision #
inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: Remove institutions that we have direct data for: documented that there are 4.5 million duplicates (59,998,354 rows before - 55,417,646 rows after = 4,580,708)
inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: Remove institutions that we have direct data for: added rerun time (~0 thanks to index, so no problem doing the DELETE each time postprocess.sql is run)
*{.sh,run}: use simpler .rel() instead of `. "$(dirname "${BASH_SOURCE0}")"/...` for relative includes
bugfix: inputs/GBIF/_MySQL/MySQL_schema, MySQL_data: sed: put {} commands on their own line to work on Mac
bugfix: inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: updated column names to match the renamings in map.csv, which are now performed on the staging table itself
bugfix: inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: institution_code index: create it idempotently using create_if_not_exists() and an explicit index name, so that a duplicate index doesn't get added each time postprocess.sql is run
inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: add util to the search_path so that postprocess.sql will also work when run by inputs/input.Makefile, which only puts the datasource (GBIF) in the search_path
inputs/GBIF/raw_occurrence_record_plants/run: added import() runtime (5 h)
inputs/GBIF/raw_occurrence_record_plants/run: table.tsv.gz/make() runtime: noted that this excludes the upload time
inputs/GBIF/raw_occurrence_record_plants/run: added table.tsv.gz/upload() runtime (15 min)
added lib/runscripts/mysql.table.run (general to all MySQL datasources) and use it in inputs/GBIF/table.run
inputs/GBIF/raw_occurrence_record_plants/run: table.tsv/make(): to view runtime when using `screen`: keys used to scroll: added Ctrl-B/Ctrl-F for page-at-a-time scrolling (there are a lot of pages of output for the import() target!)
bugfix: inputs/GBIF/table.run: table.tsv.gz/make(): don't run table.tsv.gz/upload in test mode, to avoid clobbering the backup of a full table.tsv with a partial, testing table.tsv
bugfix: inputs/GBIF/table.run: table.tsv.gz/upload(): don't use inplace mode because it leaves a newer mtime when aborted, causing rsync to think that the partial upload is actually newer than the source. note that rsync's --partial-dir mode is just as capable of resuming an aborted upload (it will just use a file in .rsync-tmp instead). inplace mode is primarily designed for fixed-offset files which don't change much between edits, but this is not true for exports (or the gzips of them), which will change the file offsets of most data if even one row or column is added or removed.
bugfix: inputs/GBIF/table.run: table.tsv.gz/make(): run table.tsv.gz/upload here instead of in table.tsv/make() because it should not run until table.tsv.gz is finished being made, which is not the case in table.tsv/make() because table.tsv.gz/make is run in the background
inputs/GBIF/table.run: table.tsv.gz/upload(): moved before table.tsv.gz/make() so it can be used by it
bugfix: inputs/GBIF/table.run: table.tsv.gz/upload(): need overwrite=1 because the mtime of an aborted inplace upload is newer
inputs/GBIF/table.run: table.tsv*/upload(): renamed to table.tsv.gz/upload() to upload only table.tsv.gz, not table.tsv, in order to save bandwidth
bugfix: inputs/GBIF/table.run: table.tsv*/upload(): need to run put in live mode (live=1)
inputs/GBIF/table.run: table.tsv/make(): run table.tsv*/upload when the file make is done so that the file is backed up to jupiter
inputs/GBIF/table.run: added table.tsv*/upload()
inputs/GBIF/raw_occurrence_record_plants/run: table.tsv/make(): documented how to view the runtime when using `screen` (press Ctrl-A [ , use up-arrow, and then press Esc to leave copy mode)
inputs/GBIF/raw_occurrence_record_plants/run: herbaria_filter/make(): use new ih_herbarium table instead of the herbaria_filter.ih.csv_ file directly
inputs/GBIF/raw_occurrence_record_plants/run: added ih_herbarium/make(), which stores the IH herbaria
bugfix: inputs/GBIF/raw_occurrence_record_plants/run: table/make(): also filter out rows with a non-plant family (as described at http://vegpath.org/wiki/2013-06-06_conference_call#GBIF-subsetting-fix-raw_occurrence_record-filter-formula), since some institutions have both animal and plant rows, even though they are in IH or in the 80% list. (note that NULL families are OK.)
*{.sh,run}: use mysql instead of mysql_ANSI because mysql is now an alias to mysql_ANSI (since ANSI mode still supports key MySQL features, like `` quotes)
inputs/GBIF/raw_occurrence_record_plants/run: table.tsv/make(): documented that incremental output is provided right away with --quick (unbuffered), but takes awhile to become visible in Macfusion sshfs. this can be tested with `while true; do stat inputs/GBIF/raw_occurrence_record_plants/table.tsv; sleep 2; done` running concurrently with `./inputs/GBIF/raw_occurrence_record_plants/run table.tsv/make` on vegbiendev:/home/bien/svn .
inputs/GBIF/raw_occurrence_record_plants/run: table.tsv/make(): use new raw_occurrence_record_plants view from table/make()
bugfix: inputs/GBIF/raw_occurrence_record_plants/run: table/make(): added make of prerequisites
bugfix: inputs/GBIF/raw_occurrence_record_plants/run: table/make(): don't reset $table to plant_fraction_for_herbaria_filter for commands that use $table
inputs/GBIF/raw_occurrence_record_plants/run: added table/make(), which makes the filter view
inputs/GBIF/raw_occurrence_record/: renamed to raw_occurrence_record_plants because it's actually only the plants in raw_occurrence_record, not all of raw_occurrence_record. also, this will allow us to create a separate raw_occurrence_record_plants view whose name matches the folder and does not collide with the raw_occurrence_record table.
inputs/GBIF/raw_occurrence_record/run: herbaria_filter/make(): added runtime, which is ~0 since it just needs to do CSV import and index scans
inputs/GBIF/raw_occurrence_record/run: herbaria_filter/make(): time the population of herbaria_filter
inputs/GBIF/raw_occurrence_record/run: plant_fraction/make(): updated runtime. added rows affected count to runtime so if the number of rows it's related to (in this case, institution_code) changes, the runtime can be expected to change accordingly.
bugfix: inputs/GBIF/raw_occurrence_record/run: plant_fraction/make(): plant_fraction column: COUNT counts non-NULL rather than true values (which counter-intuitively includes false, because it's non-NULL), so need to add NULLIF around the boolean expression to turn it into a NULL-or-not expression. see http://vegpath.org/wiki/2013-06-06_conference_call#GBIF-subsetting-fix-plant_fraction-SQL-bug .
inputs/GBIF/raw_occurrence_record/run: table.tsv.gz/make(): documented runtime (35 min)
inputs/GBIF/table.run: table.tsv/make(): remake table.tsv.gz/make() after table.tsv is made
inputs/GBIF/table.run: added table.tsv.gz/make()
lib/runscripts/subdir.run: auto-append -remake to all targets in remake mode
*{.sh,run}: make() calls: removed no longer applicable silent=1, which is now handled automatically by the log_level mechanism
inputs/GBIF/table.run: table.tsv.md5/make(): only use %-remake if remaking
bugfix: inputs/GBIF/raw_occurrence_record/run: plant_fraction_for_herbaria_filter/make(): need to make prerequisites first (plant_fraction/make)
bugfix: inputs/GBIF/table.run: table.tsv.md5/make(): use %-remake to ensure that the .md5 file is remade, regardless of the .md5 file's mtime relative to table.tsv. you would generally expect table.tsv's new mtime to be newer than the .md5 file's (thus triggering make to run), but if you e.g. ran svn up after making the table.tsv, this might not be the case.
inputs/GBIF/table.run: table.tsv/make(): use new set_large_table to prevent table.tsv from being deleted on error for full export runs (while still deleting it on error for sample subset runs)
bugfix: inputs/GBIF/_MySQL/GBIFPortalDB-2013-02-20.data.sql.run: ^.preamble.sql/make(): need to run set_make_vars even though the make vars are not used, because set_make_vars sets $_remake, which is needed by self_make
bugfix: *run: overriding targets: use new self_make to properly progagate the $remake flag to the overridden target, so that the target itself is not skipped
bugfix: inputs/GBIF/raw_occurrence_record/run: table.tsv/make(): added check_target_exists so table.tsv would not be overwritten if it already existed
bugfix: inputs/GBIF/table.run: table.tsv/make(): force table.tsv.md5 to be remade (using remake=1) because the table.tsv contents will have changed
inputs/GBIF/raw_occurrence_record/run: plant_fraction/make(): documented runtime (1 hr)