/ - Changes - BIEN 3 - NCEAS Projects

root @ 9533

#	Date	Author	Comment
9533	05/23/2013 08:22 PM	Aaron Marcuse-Kubitza	added lib/sh/binsearch.sh
9532	05/23/2013 06:27 PM	Aaron Marcuse-Kubitza	bugfix: README.TXT: Full database import: screen: need to unset TMOUT, version after running `screen` rather than before so they take effect within the `screen` shell
9531	05/23/2013 06:25 PM	Aaron Marcuse-Kubitza	README.TXT: Full database import: after running `screen`: run `set -o ignoreeof` to prevent Ctrl+D from exiting `screen` to keep attached jobs
9530	05/23/2013 04:40 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: documented how to estimate total runtime. note that our tnrs_db wrapper in inputs/.TNRS/tnrs/tnrs.make uses inputs/.TNRS/tnrs/logs/tnrs.make.log.sql as the log file.
9529	05/23/2013 03:33 PM	Aaron Marcuse-Kubitza	inputs/.TNRS/schema.sql, data.sql: updated TNRS CSV columns to preserve Name_matched_accepted_family even though it isn't present in the current TNRS CSVs. this way, Name_matched_accepted_family can still be used for previously-scrubbed names, and family_matched can be added back to analytical_stem_view. (now that bin/tnrs_db uses an explicit columns list in COPY TO, the absence of a column in the CSV is no longer a problem.)
9528	05/23/2013 03:28 PM	Aaron Marcuse-Kubitza	README.TXT: updating TNRS CSV columns: use the entire "COPY tnrs ..." statement instead of just the body of it so that the explicit columns list is included. this way, the COPY statement will cause an error if the TNRS schema was changed but inputs/.TNRS/data.sql was not yet updated.
9527	05/23/2013 03:00 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: removed unused imports
9526	05/23/2013 02:55 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: cumulative_tnrs_profiler: use tnrs.tnrs_request()'s new cumulative_profiler param instead of doing the profiling manually. this also ensures that there isn't extra time between when the cumulative profiler starts/stops and when the per-request profiler starts/stops (because Profiler's new add_subprofiler() method is used).
9525	05/23/2013 02:53 PM	Aaron Marcuse-Kubitza	lib/tnrs.py: single_tnrs_request(): added support for a cumulative profiler using the cumulative_profiler kw param
9524	05/23/2013 02:53 PM	Aaron Marcuse-Kubitza	lib/profiling.py: Profiler: added add_subprofiler(), for use with cumulative profilers
9523	05/23/2013 02:48 PM	Aaron Marcuse-Kubitza	lib/profiling.py: Profiler: added add_time() and use it instead of `self.total +=`
9522	05/23/2013 02:38 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: tnrs_profiler: renamed to cumulative_tnrs_profiler to distinguish it from the tnrs_profiler used by tnrs.tnrs_request(), which just profiles the current request
9521	05/23/2013 02:36 PM	Aaron Marcuse-Kubitza	bugfix: bin/tnrs_db: cumulative profiler: use len(names) instead of this_ct (cur.rowcount) in case the actual # rows fetched differed from the rowcount
9520	05/23/2013 02:32 PM	Aaron Marcuse-Kubitza	lib/tnrs.py: repeated_tnrs_request(): renamed to tnrs_request() since this is the function that should usually be used, to ensure that debugging information is output in the case of an error. (the TNRS request must be made again to output this information.)
9519	05/23/2013 02:30 PM	Aaron Marcuse-Kubitza	lib/tnrs.py: tnrs_request(): renamed to single_tnrs_request() to distinguish it from repeated_tnrs_request()
9518	05/23/2013 02:25 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: removed no longer used $wait flag (which caused tnrs_db to wait max_pause for new rows to be added), because tnrs_db is now invoked automatically after each import by the import_scrub target (in inputs/input.Makefile) and does not need to run as a daemon. note that when scrub is invoked, it is possible that a previous datasource's import has already scrubbed the names for this import, because tnrs_db runs until all rows in tnrs_input_name are scrubbed....
9517	05/23/2013 02:14 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: removed no longer needed explicit population of the Time_submitted, which is now done automatically by the tnrs table. however, this requires starting the transaction before submitting data, so Time_submitted is correctly set to the submission time rather than the insertion time. the setting of the correct time can be tested by inserting `time.sleep(n_sec)` after the TNRS request and checking that the Time_submitted is close to the time tnrs_db was run instead of n_sec seconds later.
9516	05/23/2013 02:09 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: start transaction before submitting data, so Time_submitted is correctly set to the submission time rather than the insertion time. these may differ by several minutes if TNRS is slow. the setting of the correct time can be tested by inserting `time.sleep(n_sec)` after the TNRS request, removing the explicit setting of Time_submitted, and checking that the Time_submitted is close to the time tnrs_db was run instead of n_sec seconds later.
9515	05/23/2013 02:05 PM	Aaron Marcuse-Kubitza	bugfix: bin/tnrs_db: wrap just the TNRS request and the storing of the response data in a function (undoing part of r9514), because the transaction start time for Time_submitted should not be until the TNRS request is actually made (it often takes several minutes to materialize the next set of input names on a full DB)
9514	05/23/2013 01:56 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: Iterate over unscrubbed verbatim taxonlabels: put loop body in a function (which returns whether or not the loop should continue), so that the loop body can easily be wrapped in a transaction using sql.with_savepoint()
9513	05/23/2013 01:19 PM	Aaron Marcuse-Kubitza	inputs/.TNRS/schema.sql: tnrs.Time_submitted: set default to now() (the timestamp of the start of the current transaction, http://www.postgresql.org/docs/9.1/static/functions-datetime.html) so that it would automatically be populated when rows are added. note that because the start of the current transaction instead of the exact time at insertion is used, all rows inserted in the same transaction (e.g. as part of the same batch) will have the same value for this, linking them together.
9512	05/23/2013 01:10 PM	Aaron Marcuse-Kubitza	inputs/.TNRS/schema.sql: tnrs_populate_derived_fields(): renamed to tnrs_populate_fields() so it can be used to populate other fields as well
9511	05/23/2013 01:07 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: removed no longer needed explicit appending of derived cols, and instead use append_csv()'s new support for importing CSVs whose columns are a subset of the full table
9510	05/23/2013 12:56 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: ColInsertFilters: use the simpler literal value option for the mk_value param
9509	05/23/2013 12:55 PM	Aaron Marcuse-Kubitza	lib/csvs.py: ColInsertFilter: support using a literal value instead of a function for the mk_value param, since this is the most common use case
9508	05/23/2013 12:43 PM	Aaron Marcuse-Kubitza	lib/sql_io.py: append_csv(): support importing CSVs whose columns are a subset of the full table and/or in a different order. when the header exactly matches the columns, the explicit column list will still be omitted as an optimization. this uses code from r4927.
9507	05/23/2013 12:15 PM	Aaron Marcuse-Kubitza	bugfix: lib/runscripts/util.run: need to include sh/make.sh for all runscripts that use make-style commands
9506	05/23/2013 12:12 PM	Aaron Marcuse-Kubitza	*{.sh,run}: use new top_make instead of `make --directory="$top_dir"`
9505	05/23/2013 12:11 PM	Aaron Marcuse-Kubitza	lib/sh/make.sh: added top_make()
9504	05/23/2013 11:54 AM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Postprocessing: populated entries for analytical DB for last 4 imports, and for backup, backup test for last import. note that the combined import time for the last import is 3.5 days, compared to 3 days for the column-based import portion.
9503	05/22/2013 11:47 PM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Postprocessing: added (empty) entries for analytical DB, backup, backup test
9502	05/21/2013 11:18 PM	Aaron Marcuse-Kubitza	inputs/GBIF/Specimen/postprocess.sql, inputs/REMIB/Specimen/postprocess.sql: updated for providers in r9459, which adds TEX
9501	05/21/2013 11:10 PM	Aaron Marcuse-Kubitza	inputs///postprocess.sql: Remove institutions that we have direct data for: query to obtain list: updated for current schema
9500	05/21/2013 10:49 PM	Aaron Marcuse-Kubitza	inputs/import.stats.xls: Updated import times. GBIF has been refreshed (with the range modeling column subset), and column-based import now takes 3 days for 88.4 million rows.
9499	05/21/2013 10:27 PM	Aaron Marcuse-Kubitza	README.TXT: Full database import: added warning to perform every single step listed, to avoid breaking column-based import
9498	05/21/2013 10:26 PM	Aaron Marcuse-Kubitza	README.TXT: Full database import: Publish the new import: added warning to be sure you have done every single verification step before proceeding. otherwise, a previous valid import could incorrectly be overwritten with a broken one.
9497	05/21/2013 09:07 PM	Aaron Marcuse-Kubitza	bugfix: README.TXT: Full database import: To run TNRS/remake analytical DB: need to run `export version=<version>` before the command which uses it rather than after
9496	05/21/2013 08:26 PM	Aaron Marcuse-Kubitza	added backups/*.md5
9495	05/21/2013 08:22 PM	Aaron Marcuse-Kubitza	added backups/TNRS.2013-5-21.backup.md5
9494	05/21/2013 07:42 PM	Aaron Marcuse-Kubitza	README.TXT: Datasource setup: For MySQL inputs: For .sql exports: added steps to grant privileges to the bien user. the privileges list excludes UPDATE, DELETE, ALTER, DROP to prevent bugs in the import scripts from accidentally deleting data.
9493	05/21/2013 07:37 PM	Aaron Marcuse-Kubitza	inputs/.TNRS/schema.sql, data.sql: updated for new TNRS CSV columns (see bug at https://pods.iplantcollaborative.org/jira/browse/TNRS-183). note that these columns may eventually change back (comment by Naim at https://pods.iplantcollaborative.org/jira/browse/TNRS-183#comment-34444).
9492	05/21/2013 07:33 PM	Aaron Marcuse-Kubitza	README.TXT: Full database import: added steps to check that TNRS ran successfully, and fix errors (due to column changes in the TNRS CSV) if it didn't
9491	05/21/2013 07:24 PM	Aaron Marcuse-Kubitza	inputs/test_taxonomic_names/test_scrub: use sh's -e (errexit) mode so errors in an invoked script cause the script to abort instead of burying the error in more output
9490	05/21/2013 07:19 PM	Aaron Marcuse-Kubitza	inputs/test_taxonomic_names/test_scrub: documented that `make schemas/"$public"/uninstall` removes the previous results (since it may be confusing why it's prompting the user to uninstall the schema that is an output of the program)
9489	05/21/2013 07:16 PM	Aaron Marcuse-Kubitza	inputs/test_taxonomic_names/test_scrub: don't need to run the import twice anymore because the accepted names are now included in the tnrs_input_name view that TNRS runs on
9488	05/21/2013 07:09 PM	Aaron Marcuse-Kubitza	inputs/test_taxonomic_names/test_scrub: updated for current TNRS schema
9487	05/21/2013 06:47 PM	Aaron Marcuse-Kubitza	bugfix: inputs/test_taxonomic_names/test_scrub: unset $n so it doesn't limit the # rows. it is set to 2 in the default test environment, so must be unset for n-sensitive programs that should be unlimited.
9486	05/21/2013 06:40 PM	Aaron Marcuse-Kubitza	inputs/test_taxonomic_names/test_scrub: updated for current TNRS schema
9485	05/21/2013 01:44 PM	Aaron Marcuse-Kubitza	inputs/GBIF/raw_occurrence_record/run: herbaria_filter.table/make(): also include the exported plant_fraction herbaria
9484	05/21/2013 01:43 PM	Aaron Marcuse-Kubitza	inputs/GBIF/raw_occurrence_record/run: added herbaria_filter.plant_fraction.csv_/make(), which exports the plant_fraction herbaria whose plant_fraction >= 0.8
9483	05/21/2013 01:42 PM	Aaron Marcuse-Kubitza	inputs/GBIF/raw_occurrence_record/run: added plant_fraction.table/make(), which contains the plant fraction for each herbarium
9482	05/21/2013 01:37 PM	Aaron Marcuse-Kubitza	lib/sh/db.sh: added mk_drop()
9481	05/21/2013 01:00 PM	Aaron Marcuse-Kubitza	lib/sh/util.sh: to_file(): log $stdout so users can tell which file is being created by the command. for some reason, can't use `local redirs=(">$stdout")` because the redirections don't seem to be applied. can't yet use `log+ -2 echo_vars stdout` because log+ does not yet support negative adjustments (they cause PS4 to be emptied out before being re-prepended to).
9480	05/21/2013 12:54 PM	Aaron Marcuse-Kubitza	bugfix: lib/sh/util.sh: log+(): adjustment < 0: need to enclose -$1 in $(()) so it gets evaluated before being used as an array index
9479	05/21/2013 12:16 PM	Aaron Marcuse-Kubitza	lib/sh/local.sh: psql(): documented that --output is actually for query results, not echoed statements (and thus must be redirected back to fd 1 while fd 1 with the statements gets sent to the logging port)
9478	05/21/2013 12:14 PM	Aaron Marcuse-Kubitza	lib/sh/local.sh: psql(): documented why can't use fd 11
9477	05/21/2013 12:09 PM	Aaron Marcuse-Kubitza	lib/sh/local.sh: use @redirs instead of manual redirection to set up --output fd, so that the redirection will be echoed along with the command. for some reason, this requires switching to fd 13 instead of 11, because fd 11 gives a "/dev/fd/11: Bad file descriptor" error when 11 is set with exec right before the command instead of on the subshell the command is executed in. (13 was chosen rather than 12 because *2 is for errors, while 3 (or 3**) is for logging.)
9476	05/21/2013 04:18 AM	Aaron Marcuse-Kubitza	bugfix: lib/sh/db.sh: pg_export_table_to_dir_no_header(): inlined $(pg_header) so setting $cols wouldn't affect pg_export_table_no_header(), which uses it as a kw param
9475	05/20/2013 10:44 PM	Aaron Marcuse-Kubitza	bugfix: lib/sh/util.sh: to_file(): require_not_exists check: missing `test` in `if "$if_not_exists"`
9474	05/20/2013 10:39 PM	Aaron Marcuse-Kubitza	lib/sh/util.sh: command(): log the function call using echo_func to assist debugging. (use a higher log_level because it's internal.)
9473	05/20/2013 09:29 PM	Aaron Marcuse-Kubitza	lib/sh/util.sh: command(): support custom redirections, which will be echoed along with the command
9472	05/20/2013 08:48 PM	Aaron Marcuse-Kubitza	lib/sh/util.sh: to_file(): reworded confusing \|\| conditional for require_not_exists into an if statement
9471	05/20/2013 08:21 PM	Aaron Marcuse-Kubitza	bugfix: inputs/GBIF/raw_occurrence_record/run: herbaria_filter.table/make(): need to use append=1 with mysql_import so the output table doesn't get re-truncated when additional parts are added
9470	05/20/2013 07:28 PM	Aaron Marcuse-Kubitza	bugfix: lib/sh/db.sh: load new aliases before mk_select(), which uses mk_table_esc
9469	05/20/2013 07:27 PM	Aaron Marcuse-Kubitza	lib/runscripts/table.run: include make.sh so runscripts based on it can use make-related utils
9468	05/20/2013 06:52 PM	Aaron Marcuse-Kubitza	lib/sh/db.sh: added mk_select() and use it in mk_select_var
9467	05/20/2013 06:46 PM	Aaron Marcuse-Kubitza	lib/sh/db.sh: added limit() and use it instead of `${limit:+LIMIT $limit}`
9466	05/20/2013 06:44 PM	Aaron Marcuse-Kubitza	lib/sh/db.sh: added mysql_truncate() and use it instead of `mk_truncate\|mysql_ANSI`
9465	05/20/2013 06:42 PM	Aaron Marcuse-Kubitza	lib/sh/db.sh: truncate(): renamed to mk_truncate() because it actually just creates a TRUNCATE statement, rather than also executing it
9464	05/20/2013 06:38 PM	Aaron Marcuse-Kubitza	lib/sh/db.sh: use_local/use_remote: unset $prefix after using it so it isn't unintentionally applied as a kw param for a later function
9463	05/20/2013 04:18 PM	Aaron Marcuse-Kubitza	lib/sh/db.sh: mk_select: renamed to mk_select_var since it actually sets a var in the local context rather than returning a query
9462	05/20/2013 03:40 PM	Aaron Marcuse-Kubitza	inputs/GBIF/raw_occurrence_record/run: herbaria_filter.table/make(): specify the different parts used to create the table in an array
9461	05/20/2013 03:19 PM	Aaron Marcuse-Kubitza	inputs/GBIF/raw_occurrence_record/run: renamed herbaria_filter.csv_ to herbaria_filter.ih.csv_ to allow for other tables that get combined into herbaria_filter
9460	05/20/2013 03:13 PM	Aaron Marcuse-Kubitza	bugfix: lib/sh/db.sh: mk_select: ensure newline before LIMIT clause, in case caller provided custom query which did not have trailing newline
9459	05/17/2013 06:00 PM	Aaron Marcuse-Kubitza	bugfix: mappings/VegCore-VegBIEN.csv: place.geovalid: added missing /1 after _alt
9458	05/17/2013 05:55 PM	Aaron Marcuse-Kubitza	bugfix: lib/sql.py: parse_exception(): typed_name_re: added back matching of names without "", since these are used by some error messages (ones that contain () after the function name)
9457	05/17/2013 05:41 PM	Aaron Marcuse-Kubitza	bugfix: lib/sql.py: parse_exception(): typed_name_re: need to allow " within the matched name, since there are now "" around the entire identifer that was passed to Postgres, which may itself include " . always require "" around the matched name, to ensure that the whole name is matched by .+? e.g. when followed by () for a function call. the version of Postgres we currently use apparently no longer has error messages without the "", so we don't need a separate regexp for quoted and unquoted names.
9456	05/17/2013 03:43 PM	Aaron Marcuse-Kubitza	lib/sh/db.sh: mysql_import(): automatically ensure the table is empty (i.e. using truncate()), unless append=1 is specified. extra calls to truncate() now that this happens automatically have also been removed.
9455	05/17/2013 01:13 PM	Aaron Marcuse-Kubitza	bin/map: by_col: ensure verbosity is at least 2 in live mode (using new ints.set_min() instead of max() for clarity). documented that live column-based import MUST be run with verbosity 2+ (3 preferred) to provide debugging information for often-complex errors. without this, debugging is effectively impossible.
9454	05/17/2013 01:08 PM	Aaron Marcuse-Kubitza	added lib/ints.py with renamings of max()->set_min(), min()->set_max() for easier understandability of the set-ceiling/set-floor use cases of min()/max()
9453	05/17/2013 12:57 PM	Aaron Marcuse-Kubitza	bin/map: Set default verbosity: by_col: documented that showing all queries is primarily to assist debugging, not profiling
9452	05/17/2013 11:59 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: logging: named it `log++`
9451	05/17/2013 11:59 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: logging: verbosities: level 0: documented that log++ also suppresses external command output for full support of cron jobs
9450	05/17/2013 11:57 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: logging: documented `make` equivalents of the various verbosities, where available. (many of the verbosities, such as level 1, are sorely needed in make to avoid excessive output.) verbosities (and `make` equivalents): 0: just print errors. useful for cron jobs....
9449	05/17/2013 04:03 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: die_e(): benign errors: increase log_level so that a benign non-zero exit status will only be displayed at debug verbosities (2+) (it is confusing otherwise)
9448	05/17/2013 03:36 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: try(): always run the command with benign_error=1 so that any die_e() doesn't prematurely indicate that a particular exit status was an error
9447	05/17/2013 03:34 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: die_e(): support benign errors using $benign_error flag that should be logged as info messages instead of errors
9446	05/17/2013 03:30 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: die(): documented that msg can't use $() (because it would reset $?)
9445	05/17/2013 03:19 AM	Aaron Marcuse-Kubitza	inputs/bien_web/observation/VegBIEN.csv, unmapped_terms.csv: regenerated
9444	05/17/2013 03:01 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: command(): 2>&$err_fd: add to _redirs after echoing command so it isn't echoed at the end of every command (since this redirection is frequently applied)
9443	05/17/2013 02:55 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: sed: use case statement instead of test to determine flag letter, to easily allow matching multiple `uname` OSes or adding additional flag letters
9442	05/17/2013 02:46 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: die(): documented that its msg can use $?, because it has not yet been overridden by another command
9441	05/17/2013 02:45 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: die_e(): use die(), which performs the necessary save_e/rethrow. this requires using $? instead of $e for the exit status, because $e has not yet been set.
9440	05/17/2013 02:42 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: inlined log_e() into die_e() because that's the only place it's used
9439	05/17/2013 02:37 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: command(): print "command exited with error" message using new die_e() if command returns false. this requires removing manual die_e()/log_e() calls elsewhere.
9438	05/17/2013 02:34 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: command(): moved increase of indent inside () so that error-handling statements after () will use the outer log_level
9437	05/17/2013 02:31 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: added die_e(), which logs that a command exited with an error
9436	05/17/2013 02:18 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: command(): determine redirections before echoing the command so they can be logged along with the command, instead of as separate exec statements. (these had a higher log_level to avoid cluttering the output with `exec` lines, which usually suppressed the redirections completely.) inline the command__set_fds() nested func so the redirections are all in one place.
9435	05/17/2013 01:54 AM	Aaron Marcuse-Kubitza	lib/sh/util.sh: use simpler `if can_log; then indent; fi` instead of `can_log && indent \|\| true`. however, the `&& indent \|\| true` syntax is still required in aliases such as echo_func which need to allow prefixing the command with a wrapper command or kw param assignments.
9434	05/16/2013 09:28 PM	Aaron Marcuse-Kubitza	inputs/GBIF/raw_occurrence_record/run: dynamically generate herbaria_filter.csv_ from herbaria.ih in new target herbaria_filter.csv_/make()

Project

General

Profile

root @ 9533