moved everything into /trunk/ to create the standard svn layout, for use with tools that require this (eg. git-svn). IMPORTANT: do NOT do an `svn up`. instead, re-use your working copy's existing files with `svn switch` (http://svnbook.red-bean.com/en/1.6/svn.ref.svn.c.switch.html).
inputs/GBIF/raw_occurrence_record_plants/map.csv: row_num: remapped to plain *row_num, like the other datasources that have this field
inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: Remove institutions that we have direct data for: rerun time: noted that this is only fast after manual vacuuming of the table (to remove the deleted rows from the index). autovacuum apparently does not run, although it should.
inputs/GBIF/raw_occurrence_record_plants/test.xml.ref: reran test, which added yearCollected/monthCollected/dayCollected
inputs/GBIF/raw_occurrence_record_plants/run: updated import() runtime (same), documented table cleanup runtime (1.5 h)
inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: CREATE INDEX ... specimenHolderInstitutions: documented runtime (45 min)
inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: Remove institutions that we have direct data for: documented runtime (3.5 min)
fix: bin/map: put template: comment out the "Put template:" label so that the output is valid XML, and displays properly in a browser rather than showing a syntax error
bugfix: mappings/VegCore-VegBIEN.csv: nest all taxonoccurrences inside a stratum event, so that the parent locationevent is always fully populated before child locationevents point to it. (previously, a stub parent event was created when the child event was imported first, which blocked the fully-populated parent event from being inserted later on.) this uses auto-folding (for VegBank/CVS) and auto-forwarding (for other datasources) to prune empty stratum events for taxonoccurrences that don't have strata. (see wiki.vegpath.org/Auto-folding, wiki.vegpath.org/Auto-forwarding for more info about these normalization techniques.) note that the inserted row counts stay exactly the same for all datasources except VegBank (which was being fixed), indicating that this signficant change to the mappings did not change the semantics of the import of taxonoccurrences.
inputs/*/*/test.xml.ref: updated source.shortname for new datasource name, which now starts out with .new suffix
bugfix: inputs/*/*/map.csv for specimen tables: remapped eventDate,day,month,year to *Collected, because a general date always applies to the observation itself rather than to any parent event (specimens don't have a parent event)
bugfix: inputs/*/*/map.csv (e.g. inputs/GBIF/raw_occurrence_record_plants/map.csv): remapped author to scientificNameAuthorship rather than authors, which it had gotten incorrectly automapped to. note that the VegCore term authors has now been renamed to data_authors to avoid ambiguity, but incorrect automappings resulting from it had not yet been fixed.
bugfix: inputs/GBIF/raw_occurrence_record_plants/run: updated herbaria.ih column names for staging table column renaming
bugfix: inputs/input.Makefile: %/VegBIEN.csv: for new-style datasources, use a symlink to mappings/VegCore-VegBIEN.csv directly instead of prefiltering VegCore-VegBIEN.csv to include only the columns in map.csv. prefiltering used to be performed as part of mapping the map.csv VegCore output terms to VegBIEN using bin/join, but is no longer needed because the staging table columns are now VegCore terms. instead, the full VegCore-VegBIEN.csv is needed so that derived columns added in stage I or II validations are detected by bin/map (rather than just the original source columns in map.csv).
added inputs/GBIF/raw_occurrence_record_plants/.rsync_ignore with filters that have previously needed to be manually added whenever `make inputs/upload` was run
mappings/VegCore-VegBIEN.csv: genus->taxonlabel.taxonomicname: use new _filter_genus() (see r9882)
mappings/VegCore-VegBIEN.csv: genus->taxonlabel.taxonomicname: filter out genera that contain numbers (using new _filter_genus()), which break TNRS and prevent it from matching any other parts of the name. later, these genera can instead be moved to the end of the name, where TNRS will correctly match them as Unmatched_terms.
added inputs/GBIF/raw_occurrence_record_plants/table.tsv.md5
inputs/GBIF/raw_occurrence_record_plants/test.xml.ref: regenerated. updated for new staging table input columns, which are now the same as the output columns.
bugfix: inputs/input.Makefile: %/VegBIEN.csv: use header from map.csv instead of the new columns, so that source.shortname is set to GBIF instead of VegCore
inputs/input.Makefile: %/VegBIEN.csv: when a runscript is available, instead map the output columns of map.csv to VegBIEN, because the columns have been renamed in the staging table
inputs/GBIF/raw_occurrence_record_plants/VegBIEN.csv: regenerated, which adds row_num input col
inputs/GBIF/raw_occurrence_record_plants/run: import() runtime: specified that this does not include table.tsv.gz/make()
inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: Remove institutions that we have direct data for: # duplicates: added revision #
inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: Remove institutions that we have direct data for: documented that there are 4.5 million duplicates (59,998,354 rows before - 55,417,646 rows after = 4,580,708)
inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: Remove institutions that we have direct data for: added rerun time (~0 thanks to index, so no problem doing the DELETE each time postprocess.sql is run)
bugfix: inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: updated column names to match the renamings in map.csv, which are now performed on the staging table itself
bugfix: inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: institution_code index: create it idempotently using create_if_not_exists() and an explicit index name, so that a duplicate index doesn't get added each time postprocess.sql is run
inputs/GBIF/raw_occurrence_record_plants/postprocess.sql: add util to the search_path so that postprocess.sql will also work when run by inputs/input.Makefile, which only puts the datasource (GBIF) in the search_path
inputs/GBIF/raw_occurrence_record_plants/run: added import() runtime (5 h)
inputs/GBIF/raw_occurrence_record_plants/run: table.tsv.gz/make() runtime: noted that this excludes the upload time
inputs/GBIF/raw_occurrence_record_plants/run: added table.tsv.gz/upload() runtime (15 min)
inputs/GBIF/raw_occurrence_record_plants/run: table.tsv/make(): to view runtime when using `screen`: keys used to scroll: added Ctrl-B/Ctrl-F for page-at-a-time scrolling (there are a lot of pages of output for the import() target!)
inputs/GBIF/raw_occurrence_record_plants/run: table.tsv/make(): documented how to view the runtime when using `screen` (press Ctrl-A [ , use up-arrow, and then press Esc to leave copy mode)
inputs/GBIF/raw_occurrence_record_plants/run: herbaria_filter/make(): use new ih_herbarium table instead of the herbaria_filter.ih.csv_ file directly
inputs/GBIF/raw_occurrence_record_plants/run: added ih_herbarium/make(), which stores the IH herbaria
bugfix: inputs/GBIF/raw_occurrence_record_plants/run: table/make(): also filter out rows with a non-plant family (as described at http://vegpath.org/wiki/2013-06-06_conference_call#GBIF-subsetting-fix-raw_occurrence_record-filter-formula), since some institutions have both animal and plant rows, even though they are in IH or in the 80% list. (note that NULL families are OK.)
*{.sh,run}: use mysql instead of mysql_ANSI because mysql is now an alias to mysql_ANSI (since ANSI mode still supports key MySQL features, like `` quotes)
inputs/GBIF/raw_occurrence_record_plants/run: table.tsv/make(): documented that incremental output is provided right away with --quick (unbuffered), but takes awhile to become visible in Macfusion sshfs. this can be tested with `while true; do stat inputs/GBIF/raw_occurrence_record_plants/table.tsv; sleep 2; done` running concurrently with `./inputs/GBIF/raw_occurrence_record_plants/run table.tsv/make` on vegbiendev:/home/bien/svn .
inputs/GBIF/raw_occurrence_record_plants/run: table.tsv/make(): use new raw_occurrence_record_plants view from table/make()
bugfix: inputs/GBIF/raw_occurrence_record_plants/run: table/make(): added make of prerequisites
bugfix: inputs/GBIF/raw_occurrence_record_plants/run: table/make(): don't reset $table to plant_fraction_for_herbaria_filter for commands that use $table
inputs/GBIF/raw_occurrence_record_plants/run: added table/make(), which makes the filter view
inputs/GBIF/raw_occurrence_record/: renamed to raw_occurrence_record_plants because it's actually only the plants in raw_occurrence_record, not all of raw_occurrence_record. also, this will allow us to create a separate raw_occurrence_record_plants view whose name matches the folder and does not collide with the raw_occurrence_record table.
inputs/GBIF/raw_occurrence_record/run: herbaria_filter/make(): added runtime, which is ~0 since it just needs to do CSV import and index scans
inputs/GBIF/raw_occurrence_record/run: herbaria_filter/make(): time the population of herbaria_filter
inputs/GBIF/raw_occurrence_record/run: plant_fraction/make(): updated runtime. added rows affected count to runtime so if the number of rows it's related to (in this case, institution_code) changes, the runtime can be expected to change accordingly.
bugfix: inputs/GBIF/raw_occurrence_record/run: plant_fraction/make(): plant_fraction column: COUNT counts non-NULL rather than true values (which counter-intuitively includes false, because it's non-NULL), so need to add NULLIF around the boolean expression to turn it into a NULL-or-not expression. see http://vegpath.org/wiki/2013-06-06_conference_call#GBIF-subsetting-fix-plant_fraction-SQL-bug .
inputs/GBIF/raw_occurrence_record/run: table.tsv.gz/make(): documented runtime (35 min)
bugfix: inputs/GBIF/raw_occurrence_record/run: plant_fraction_for_herbaria_filter/make(): need to make prerequisites first (plant_fraction/make)
bugfix: *run: overriding targets: use new self_make to properly progagate the $remake flag to the overridden target, so that the target itself is not skipped
bugfix: inputs/GBIF/raw_occurrence_record/run: table.tsv/make(): added check_target_exists so table.tsv would not be overwritten if it already existed
inputs/GBIF/raw_occurrence_record/run: plant_fraction/make(): documented runtime (1 hr)
inputs/GBIF/raw_occurrence_record/run: herbaria_filter/make(): removed no longer needed explicit clear of $remake, which is now done by make.sh instead
inputs/GBIF/raw_occurrence_record/run: added herbaria_filter/seal()
inputs/GBIF/raw_occurrence_record/run: herbaria_filter/make(): changed "from IH" to "contains all of IH" because not all rows are now from IH
inputs/GBIF/raw_occurrence_record/run: herbaria_filter/make(): renamed acronym->institution_code to match the column name in raw_occurrence_record rather than in IH
inputs/GBIF/raw_occurrence_record/run: removed no longer used herbaria_filter.plant_fraction.csv_/make(). use plant_fraction_for_herbaria_filter view instead.
inputs/GBIF/raw_occurrence_record/run: herbaria_filter/make(): use the plant_fraction_for_herbaria_filter view directly instead of first exporting it to a CSV
inputs/GBIF/raw_occurrence_record/run: herbaria_filter/make(): if remaking, turn off remake mode after doing this target's rm operations, so that prerequisite targets are not also remade
added inputs/GBIF/raw_occurrence_record/postprocess.sql, which removes institutions that we have direct data for
inputs/GBIF/raw_occurrence_record/run: herbaria_filter/make(): skip table if already exists (unless remaking), like plant_fraction/make()
inputs/GBIF/raw_occurrence_record/run: herbaria_filter.plant_fraction.csv_/make(): use new plant_fraction_for_herbaria_filter view
inputs/GBIF/raw_occurrence_record/run: added plant_fraction_for_herbaria_filter/make(). note that for simplicity, plant_fraction_for_herbaria_filter is a view instead of a table.
inputs/GBIF/raw_occurrence_record/run: .table/(): renamed to */*() because a target named after a table refers to the table unless it has an explicit file extension
inputs/GBIF/raw_occurrence_record/run: plant_fraction.table/*(): renamed to plant_fraction/*() because a target named after a table refers to the table unless it has an explicit file extension
inputs/GBIF/raw_occurrence_record/run: added plant_fraction.table/seal(), which uses new mysql_seal_table()
inputs/GBIF/raw_occurrence_record/run: plant_fraction: added index on plant_fraction for fast extraction of herbaria by fraction threshold
inputs/GBIF/raw_occurrence_record/run: tables: set ENGINE to MyISAM and DEFAULT CHARSET to utf8 to match the other GBIF tables. (note that MyISAM is not the default, but is needed to avoid row sort order problems and other issues with InnoDB.)
inputs/GBIF/raw_occurrence_record/run: plant_fraction.table/make(): in remaking mode, drop the table first
inputs/GBIF/raw_occurrence_record/run: plant_fraction.table/make(): only create and populate the table if it doesn't already exist, to avoid clobbering existing data. the noclobber functionality uses new skip_table(), which is the table analog of require_not_exists().
inputs/GBIF/raw_occurrence_record/run: herbaria_filter.table/make(): inline the PRIMARY KEY statement with its column
bugfix: inputs/GBIF/raw_occurrence_record/run: plant_fraction.table/make(): create the table once with "IF NOT EXISTS" and then populate it with INSERT SELECT, to avoid locking it while it's being repopulated. dropping and recreating the table with CREATE TABLE AS prevented phpMyAdmin from even reading the database's tables list, because it was unable to fetch a rowcount for plant_fraction.
inputs/GBIF/raw_occurrence_record/run: herbaria_filter.ih.csv_/make(): don't use any outer limit value, so that all the IH herbaria are always used. this also ensures that the first GBIF rows will be from an IH herbarium.
inputs/GBIF/raw_occurrence_record/run: herbaria_filter.table/make(): herbaria_filter: don't explicitly set ENGINE or DEFAULT CHARSET, because these should be set to the database values instead so that collations, etc. match
inputs/GBIF/raw_occurrence_record/run: herbaria_filter.table/make(): also include the exported plant_fraction herbaria
inputs/GBIF/raw_occurrence_record/run: added herbaria_filter.plant_fraction.csv_/make(), which exports the plant_fraction herbaria whose plant_fraction >= 0.8
inputs/GBIF/raw_occurrence_record/run: added plant_fraction.table/make(), which contains the plant fraction for each herbarium
bugfix: inputs/GBIF/raw_occurrence_record/run: herbaria_filter.table/make(): need to use append=1 with mysql_import so the output table doesn't get re-truncated when additional parts are added
inputs/GBIF/raw_occurrence_record/run: herbaria_filter.table/make(): specify the different parts used to create the table in an array
inputs/GBIF/raw_occurrence_record/run: renamed herbaria_filter.csv_ to herbaria_filter.ih.csv_ to allow for other tables that get combined into herbaria_filter
lib/sh/db.sh: mysql_import(): automatically ensure the table is empty (i.e. using truncate()), unless append=1 is specified. extra calls to truncate() now that this happens automatically have also been removed.
inputs/GBIF/raw_occurrence_record/run: dynamically generate herbaria_filter.csv_ from herbaria.ih in new target herbaria_filter.csv_/make()
inputs/GBIF/raw_occurrence_record/run: store the herbaria filter in a MySQL table loaded from a CSV instead of getting it from a hardcoded list of IN (...) values
inputs/GBIF/raw_occurrence_record/run: renamed herbaria.sql to herbaria.data.sql so it wouldn't be added to svn by `make inputs/GBIF/raw_occurrence_record/add` or `make inputs/add`
inputs/GBIF/raw_occurrence_record/run: table.tsv/make(): exclude deleted rows (i.e. where the deleted timestamp is non-NULL)
inputs/GBIF/raw_occurrence_record/header.csv: regenerated using ./run. since the table is reimported as a CSV, it uses bin/csv2db, which prepends an additional row_num column.
inputs/GBIF/raw_occurrence_record/run: table.tsv/make(): remove explicit cols list to include all cols. the file size of the generated table.tsv will increase by ~3x, but should remain reasonably-sized compared to our available disk space.
bugfix: inputs/GBIF/raw_occurrence_record/run: table.tsv/make(): need \ line continuation after vars so they only apply to the command rather than being set as global vars
inputs/GBIF/raw_occurrence_record/run: moved table.tsv.md5/make() and invocation of it to inputs/GBIF/table.run because it's general to all tables (which would all use table.tsv for this datasource). use $target_filename in calling table.tsv.md5/make from table.tsv/make.
lib/runscripts/table.run: input_make(): renamed to table_make() to make it clear that the target names are relative to the table subdir itself, not the datasrc dir. it was previously called input_make because it used inputs/input.Makefile directly, but now will use any Makefile in the datasrc dir.
inputs/GBIF/raw_occurrence_record/run: table.tsv.md5/make(): don't add extra .md5 extension to $target_filename because it already has the extension as part of the target name (now that this command is run in its own make target rather than in table.tsv/make())
inputs/GBIF/raw_occurrence_record/run: inputs/GBIF/raw_occurrence_record/run: added check_target_exists so you know why make skipped the file (for other, non-silent targets, it would also avoid make's verbose output when the file exists)
inputs/GBIF/raw_occurrence_record/run: table.tsv/make(): moved making of table.tsv.md5 to separate function
inputs/GBIF/raw_occurrence_record/run: table.tsv/make(): also add md5 sum for table.tsv
inputs/GBIF/raw_occurrence_record/run: table.tsv/make(): added back filter kw args, which had gotten deleted in a commit without update (although actually, svn should not allow a commit without update, so the working copy may have gotten corrupted)
inputs/GBIF/raw_occurrence_record/run table.tsv/make() and functions used by it: added usage comments for cmd line usage, caller usage, and declaring function usage
inputs/GBIF/raw_occurrence_record/run: table.tsv/make(): cols: also include scientific_name, which is preferable as a TNRS input because it also contains lower ranks
inputs/GBIF/raw_occurrence_record/run: table.tsv/make(): cols: also include id, institution_code, collection_code, catalogue_number
inputs/GBIF/raw_occurrence_record/run: table.tsv/make(): added filter for institution_codes in herbaria.ih (in PostgreSQL)