Project

General

Profile

Statistics
| Revision:

# Date Author Comment
9581 05/24/2013 12:22 PM Aaron Marcuse-Kubitza

removed inputs/GBIF/_MySQL/MySQL.data.sql*, since we are using the much faster exported TSVs instead (see raw_occurrence_record/table.tsv). this also avoids confusion between GBIFPortalDB-2013-02-20.data.sql* and MySQL.data.sql* when loading data into MySQL.

9580 05/24/2013 12:18 PM Aaron Marcuse-Kubitza

bugfix: inputs/GBIF/_MySQL/MySQL.data.sql.run: moved to GBIFPortalDB-2013-02-20.data.sql.run since it's actually the raw input file, not the ANSI export of it, that needs to be imported

9579 05/24/2013 12:16 PM Aaron Marcuse-Kubitza

lib/sh/resume_import.sh: get_pkey_at_pos(): changed $quote to ` to work with inputs/GBIF/_MySQL/GBIFPortalDB-2013-02-20.data.sql

9578 05/24/2013 11:50 AM Aaron Marcuse-Kubitza

lib/sh/db.sh: mysql(): added $log_queries flag, which can be turned off to avoid using --verbose. this is useful when running bulk INSERT statements.

9577 05/24/2013 11:35 AM Aaron Marcuse-Kubitza

lib/sh/local.sh: added mysql_local()

9576 05/24/2013 11:24 AM Aaron Marcuse-Kubitza

lib/sh/local.sh: added mysql_root()

9575 05/24/2013 11:24 AM Aaron Marcuse-Kubitza

lib/sh/local.sh: added $root_user, $root_password

9574 05/24/2013 11:22 AM Aaron Marcuse-Kubitza

lib/sh/db.sh: added use_root alias (similar to use_local/use_remote)

9573 05/24/2013 11:21 AM Aaron Marcuse-Kubitza

added inputs/GBIF/_MySQL/GBIFPortalDB-2013-02-20.schema.z.clean_up.sql, which removes duplicated and unnecessary indexes in raw_occurrence_record

9572 05/24/2013 11:20 AM Aaron Marcuse-Kubitza

added inputs/GBIF/_MySQL/GBIFPortalDB-2013-02-20.schema.0.preamble.sql

9571 05/24/2013 11:02 AM Aaron Marcuse-Kubitza

bugfix: lib/sh/resume_import.sh: sql_preamble(): also stop at first "-- Table structure for table" line (when using a full dumpfile rather than a data-only subset)

9570 05/24/2013 10:58 AM Aaron Marcuse-Kubitza

lib/sh/resume_import.sh: resume_import(): run connection preamble (first few lines of dumpfile) before continuing with main file at offset, so that connection setting are reapplied

9569 05/24/2013 06:45 AM Aaron Marcuse-Kubitza

lib/sh/resume_import.sh: is_pkey_imported__int(): use echo_stdout so the user can see the result of the > function in each iteration

9568 05/24/2013 06:42 AM Aaron Marcuse-Kubitza

added lib/sh/resume_import.sh and use it in inputs/GBIF/_MySQL/MySQL.data.sql.run

9567 05/24/2013 06:32 AM Aaron Marcuse-Kubitza

inputs/GBIF/_MySQL/MySQL.data.sql.run: is_pkey_imported__int(): made pkey name configurable in $pkey_name

9566 05/24/2013 05:32 AM Aaron Marcuse-Kubitza

inputs/GBIF/_MySQL/MySQL.data.sql.run: import_resume_pos() run time: removed seconds because the precision is likely only to the nearest half-minute

9565 05/24/2013 05:31 AM Aaron Marcuse-Kubitza

inputs/GBIF/_MySQL/MySQL.data.sql.run: documented that import_resume_pos() takes 6 min to run, with 37 iterations

9564 05/24/2013 05:20 AM Aaron Marcuse-Kubitza

added inputs/GBIF/_MySQL/MySQL.data.sql.run, with helper functions for resuming the import to MySQL from where it left off. this is very useful if the import is interrupted for any reason, because otherwise, the entire import would have to be run again from the start, taking 40-50 hours. import_resume_pos() uses new binsearch() to find where in the file the import left off, based on which pkeys have already been imported. (GBIF pkeys are unfortnately not in any order in the input file, nor are they in insertion order in the imported table, because MySQL instead clusters the table by the pkey. this necessitates a much more complex solution to resuming a partial import.)

9563 05/24/2013 05:14 AM Aaron Marcuse-Kubitza

lib/sh/binsearch.sh: binsearch(): also echo_vars the iter_num, to track how close binsearch is to finding the value (it will always take the same # iters, log2(max - min) )

9562 05/24/2013 05:11 AM Aaron Marcuse-Kubitza

lib/sh/binsearch.sh: binsearch(): also echo_vars the min/max so these can be used as shortcut inputs if binsearch is run again

9561 05/24/2013 04:58 AM Aaron Marcuse-Kubitza

bugfix: lib/sh/util.sh: caching: cache_key for function inputs: need to use `declare -p kw_param` instead of "$kw_param" because declare accepts a param name, not value`

9560 05/24/2013 03:40 AM Aaron Marcuse-Kubitza

lib/sh/binsearch.sh: binsearch(): doc comment: fixed typo in "truncates"

9559 05/24/2013 03:17 AM Aaron Marcuse-Kubitza

bugfix: lib/sh/util.sh: func_override(): need to match shortest _* suffix instead of longest in case the function being overridden itself contained _

9558 05/24/2013 01:51 AM Aaron Marcuse-Kubitza

bugfix: lib/sh/util.sh: file_size: Linux: need % in %s

9557 05/24/2013 01:43 AM Aaron Marcuse-Kubitza

lib/sh/db.sh: mysql(): added $data_only flag which enables --skip-column-names and $output_data

9556 05/24/2013 01:41 AM Aaron Marcuse-Kubitza

bugfix: lib/sh/util.sh: file_size: need to use --format instead of -f on Linux

9555 05/24/2013 01:22 AM Aaron Marcuse-Kubitza

added lib/runscripts/table_dir.run and use it in table.run

9554 05/24/2013 01:20 AM Aaron Marcuse-Kubitza

inputs/GBIF/raw_occurrence_record/run: herbaria_filter.ih.csv_/make(): don't use any outer limit value, so that all the IH herbaria are always used. this also ensures that the first GBIF rows will be from an IH herbarium.

9553 05/24/2013 01:17 AM Aaron Marcuse-Kubitza

inputs/GBIF/raw_occurrence_record/run: herbaria_filter.table/make(): herbaria_filter: don't explicitly set ENGINE or DEFAULT CHARSET, because these should be set to the database values instead so that collations, etc. match

9552 05/24/2013 12:50 AM Aaron Marcuse-Kubitza

lib/sh/util.sh: filesystem: added file_size alias

9551 05/24/2013 12:34 AM Aaron Marcuse-Kubitza

lib/sh/util.sh: exceptions: added signals-related functions ignore_sig(), piped_cmd() and helper sig_e()

9550 05/23/2013 11:40 PM Aaron Marcuse-Kubitza

lib/sh/util.sh: $sed_cmd: don't use `command`, which causes sed calls (which are usually internal) to always be logged. instead, use echo_run wherever sed needs to be logged.

9549 05/23/2013 11:38 PM Aaron Marcuse-Kubitza

lib/sh/util.sh: echo_run(): added trailing-space alias to alias-expand next word, which is a command

9548 05/23/2013 11:31 PM Aaron Marcuse-Kubitza

lib/sh/binsearch.sh: binsearch(): echo $i at log_level 1 so it's displayed by default, as a progress indicator

9547 05/23/2013 11:30 PM Aaron Marcuse-Kubitza

lib/sh/binsearch.sh: binsearch(): echo $i at log_level 1 so it's displayed by default, as a progress indicator

9546 05/23/2013 11:29 PM Aaron Marcuse-Kubitza

lib/sh/binsearch.sh: binsearch(): echo the command being run using new echo_run()

9545 05/23/2013 11:25 PM Aaron Marcuse-Kubitza

lib/sh/util.sh: log+: set PS4 from $log_level instead of relative to its previous value. this allows PS4 to work properly at negative log_levels, in spite of the inability to store a "negative" value in a prefix string.

9544 05/23/2013 11:23 PM Aaron Marcuse-Kubitza

lib/sh/util.sh: added float_set_min()

9543 05/23/2013 11:22 PM Aaron Marcuse-Kubitza

lib/sh/util.sh: log+(): log_level: set it using simpler $(()), since log_level will never be fractional (although verbosity can be). log_level may of course be fractional in invoked scripts, but that does not affect util.sh.

9542 05/23/2013 10:44 PM Aaron Marcuse-Kubitza

lib/sh/util.sh: log++: also track a numeric log_level var, which follows the PS4 prefix

9541 05/23/2013 10:35 PM Aaron Marcuse-Kubitza

inputs/.TNRS/schema.sql: MatchedTaxon: matchedFamily: use Accepted_family when the Name_matched_accepted_family is not provided, as it's omitted by the current TNRS CSV schema

9540 05/23/2013 09:54 PM Aaron Marcuse-Kubitza

lib/sh/util.sh: log+(): PS4: split if statement onto multiple lines for clarity

9539 05/23/2013 09:44 PM Aaron Marcuse-Kubitza

lib/sh/util.sh: added back echo_run(), usable for internal commands where command() would be used for external commands

9538 05/23/2013 09:33 PM Aaron Marcuse-Kubitza

lib/sh/util.sh: added int2bool()

9537 05/23/2013 09:25 PM Aaron Marcuse-Kubitza

*{.sh,run}: use new `|| ignore` instead of ignore_e/end_try

9536 05/23/2013 09:25 PM Aaron Marcuse-Kubitza

lib/sh/util.sh: added ignore(), which uses ||-syntax

9535 05/23/2013 09:13 PM Aaron Marcuse-Kubitza

lib/sh/util.sh: ignore(): renamed to ignore_e() so ignore() can be used for a simpler, ||-based command

9534 05/23/2013 09:09 PM Aaron Marcuse-Kubitza

bugfix: lib/sh/util.sh: catch(): need && between test and e=0 so e=0 is only run if $e was equal to the desired value

9533 05/23/2013 08:22 PM Aaron Marcuse-Kubitza

added lib/sh/binsearch.sh

9532 05/23/2013 06:27 PM Aaron Marcuse-Kubitza

bugfix: README.TXT: Full database import: screen: need to unset TMOUT, version after running `screen` rather than before so they take effect within the `screen` shell

9531 05/23/2013 06:25 PM Aaron Marcuse-Kubitza

README.TXT: Full database import: after running `screen`: run `set -o ignoreeof` to prevent Ctrl+D from exiting `screen` to keep attached jobs

9530 05/23/2013 04:40 PM Aaron Marcuse-Kubitza

bin/tnrs_db: documented how to estimate total runtime. note that our tnrs_db wrapper in inputs/.TNRS/tnrs/tnrs.make uses inputs/.TNRS/tnrs/logs/tnrs.make.log.sql as the log file.

9529 05/23/2013 03:33 PM Aaron Marcuse-Kubitza

inputs/.TNRS/schema.sql, data.sql: updated TNRS CSV columns to preserve Name_matched_accepted_family even though it isn't present in the current TNRS CSVs. this way, Name_matched_accepted_family can still be used for previously-scrubbed names, and family_matched can be added back to analytical_stem_view. (now that bin/tnrs_db uses an explicit columns list in COPY TO, the absence of a column in the CSV is no longer a problem.)

9528 05/23/2013 03:28 PM Aaron Marcuse-Kubitza

README.TXT: updating TNRS CSV columns: use the entire "COPY tnrs ..." statement instead of just the body of it so that the explicit columns list is included. this way, the COPY statement will cause an error if the TNRS schema was changed but inputs/.TNRS/data.sql was not yet updated.

9527 05/23/2013 03:00 PM Aaron Marcuse-Kubitza

bin/tnrs_db: removed unused imports

9526 05/23/2013 02:55 PM Aaron Marcuse-Kubitza

bin/tnrs_db: cumulative_tnrs_profiler: use tnrs.tnrs_request()'s new cumulative_profiler param instead of doing the profiling manually. this also ensures that there isn't extra time between when the cumulative profiler starts/stops and when the per-request profiler starts/stops (because Profiler's new add_subprofiler() method is used).

9525 05/23/2013 02:53 PM Aaron Marcuse-Kubitza

lib/tnrs.py: single_tnrs_request(): added support for a cumulative profiler using the cumulative_profiler kw param

9524 05/23/2013 02:53 PM Aaron Marcuse-Kubitza

lib/profiling.py: Profiler: added add_subprofiler(), for use with cumulative profilers

9523 05/23/2013 02:48 PM Aaron Marcuse-Kubitza

lib/profiling.py: Profiler: added add_time() and use it instead of `self.total +=`

9522 05/23/2013 02:38 PM Aaron Marcuse-Kubitza

bin/tnrs_db: tnrs_profiler: renamed to cumulative_tnrs_profiler to distinguish it from the tnrs_profiler used by tnrs.tnrs_request(), which just profiles the current request

9521 05/23/2013 02:36 PM Aaron Marcuse-Kubitza

bugfix: bin/tnrs_db: cumulative profiler: use len(names) instead of this_ct (cur.rowcount) in case the actual # rows fetched differed from the rowcount

9520 05/23/2013 02:32 PM Aaron Marcuse-Kubitza

lib/tnrs.py: repeated_tnrs_request(): renamed to tnrs_request() since this is the function that should usually be used, to ensure that debugging information is output in the case of an error. (the TNRS request must be made again to output this information.)

9519 05/23/2013 02:30 PM Aaron Marcuse-Kubitza

lib/tnrs.py: tnrs_request(): renamed to single_tnrs_request() to distinguish it from repeated_tnrs_request()

9518 05/23/2013 02:25 PM Aaron Marcuse-Kubitza

bin/tnrs_db: removed no longer used $wait flag (which caused tnrs_db to wait max_pause for new rows to be added), because tnrs_db is now invoked automatically after each import by the import_scrub target (in inputs/input.Makefile) and does not need to run as a daemon. note that when scrub is invoked, it is possible that a previous datasource's import has already scrubbed the names for this import, because tnrs_db runs until all rows in tnrs_input_name are scrubbed....

9517 05/23/2013 02:14 PM Aaron Marcuse-Kubitza

bin/tnrs_db: removed no longer needed explicit population of the Time_submitted, which is now done automatically by the tnrs table. however, this requires starting the transaction before submitting data, so Time_submitted is correctly set to the submission time rather than the insertion time. the setting of the correct time can be tested by inserting `time.sleep(n_sec)` after the TNRS request and checking that the Time_submitted is close to the time tnrs_db was run instead of n_sec seconds later.

9516 05/23/2013 02:09 PM Aaron Marcuse-Kubitza

bin/tnrs_db: start transaction before submitting data, so Time_submitted is correctly set to the submission time rather than the insertion time. these may differ by several minutes if TNRS is slow. the setting of the correct time can be tested by inserting `time.sleep(n_sec)` after the TNRS request, removing the explicit setting of Time_submitted, and checking that the Time_submitted is close to the time tnrs_db was run instead of n_sec seconds later.

9515 05/23/2013 02:05 PM Aaron Marcuse-Kubitza

bugfix: bin/tnrs_db: wrap just the TNRS request and the storing of the response data in a function (undoing part of r9514), because the transaction start time for Time_submitted should not be until the TNRS request is actually made (it often takes several minutes to materialize the next set of input names on a full DB)

9514 05/23/2013 01:56 PM Aaron Marcuse-Kubitza

bin/tnrs_db: Iterate over unscrubbed verbatim taxonlabels: put loop body in a function (which returns whether or not the loop should continue), so that the loop body can easily be wrapped in a transaction using sql.with_savepoint()

9513 05/23/2013 01:19 PM Aaron Marcuse-Kubitza

inputs/.TNRS/schema.sql: tnrs.Time_submitted: set default to now() (the timestamp of the start of the current transaction, http://www.postgresql.org/docs/9.1/static/functions-datetime.html) so that it would automatically be populated when rows are added. note that because the start of the current transaction instead of the exact time at insertion is used, all rows inserted in the same transaction (e.g. as part of the same batch) will have the same value for this, linking them together.

9512 05/23/2013 01:10 PM Aaron Marcuse-Kubitza

inputs/.TNRS/schema.sql: tnrs_populate_derived_fields(): renamed to tnrs_populate_fields() so it can be used to populate other fields as well

9511 05/23/2013 01:07 PM Aaron Marcuse-Kubitza

bin/tnrs_db: removed no longer needed explicit appending of derived cols, and instead use append_csv()'s new support for importing CSVs whose columns are a subset of the full table

9510 05/23/2013 12:56 PM Aaron Marcuse-Kubitza

bin/tnrs_db: ColInsertFilters: use the simpler literal value option for the mk_value param

9509 05/23/2013 12:55 PM Aaron Marcuse-Kubitza

lib/csvs.py: ColInsertFilter: support using a literal value instead of a function for the mk_value param, since this is the most common use case

9508 05/23/2013 12:43 PM Aaron Marcuse-Kubitza

lib/sql_io.py: append_csv(): support importing CSVs whose columns are a subset of the full table and/or in a different order. when the header exactly matches the columns, the explicit column list will still be omitted as an optimization. this uses code from r4927.

9507 05/23/2013 12:15 PM Aaron Marcuse-Kubitza

bugfix: lib/runscripts/util.run: need to include sh/make.sh for all runscripts that use make-style commands

9506 05/23/2013 12:12 PM Aaron Marcuse-Kubitza

*{.sh,run}: use new top_make instead of `make --directory="$top_dir"`

9505 05/23/2013 12:11 PM Aaron Marcuse-Kubitza

lib/sh/make.sh: added top_make()

9504 05/23/2013 11:54 AM Aaron Marcuse-Kubitza

inputs/import.stats.xls: Postprocessing: populated entries for analytical DB for last 4 imports, and for backup, backup test for last import. note that the combined import time for the last import is 3.5 days, compared to 3 days for the column-based import portion.

9503 05/22/2013 11:47 PM Aaron Marcuse-Kubitza

inputs/import.stats.xls: Postprocessing: added (empty) entries for analytical DB, backup, backup test

9502 05/21/2013 11:18 PM Aaron Marcuse-Kubitza

inputs/GBIF/Specimen/postprocess.sql, inputs/REMIB/Specimen/postprocess.sql: updated for providers in r9459, which adds TEX

9501 05/21/2013 11:10 PM Aaron Marcuse-Kubitza

inputs/*/*/postprocess.sql: Remove institutions that we have direct data for: query to obtain list: updated for current schema

9500 05/21/2013 10:49 PM Aaron Marcuse-Kubitza

inputs/import.stats.xls: Updated import times. GBIF has been refreshed (with the range modeling column subset), and column-based import now takes 3 days for 88.4 million rows.

9499 05/21/2013 10:27 PM Aaron Marcuse-Kubitza

README.TXT: Full database import: added warning to perform every single step listed, to avoid breaking column-based import

9498 05/21/2013 10:26 PM Aaron Marcuse-Kubitza

README.TXT: Full database import: Publish the new import: added warning to be sure you have done every single verification step before proceeding. otherwise, a previous valid import could incorrectly be overwritten with a broken one.

9497 05/21/2013 09:07 PM Aaron Marcuse-Kubitza

bugfix: README.TXT: Full database import: To run TNRS/remake analytical DB: need to run `export version=<version>` before the command which uses it rather than after

9496 05/21/2013 08:26 PM Aaron Marcuse-Kubitza

added backups/*.md5

9495 05/21/2013 08:22 PM Aaron Marcuse-Kubitza

added backups/TNRS.2013-5-21.backup.md5

9494 05/21/2013 07:42 PM Aaron Marcuse-Kubitza

README.TXT: Datasource setup: For MySQL inputs: For .sql exports: added steps to grant privileges to the bien user. the privileges list excludes UPDATE, DELETE, ALTER, DROP to prevent bugs in the import scripts from accidentally deleting data.

9493 05/21/2013 07:37 PM Aaron Marcuse-Kubitza

inputs/.TNRS/schema.sql, data.sql: updated for new TNRS CSV columns (see bug at https://pods.iplantcollaborative.org/jira/browse/TNRS-183). note that these columns may eventually change back (comment by Naim at https://pods.iplantcollaborative.org/jira/browse/TNRS-183#comment-34444).

9492 05/21/2013 07:33 PM Aaron Marcuse-Kubitza

README.TXT: Full database import: added steps to check that TNRS ran successfully, and fix errors (due to column changes in the TNRS CSV) if it didn't

9491 05/21/2013 07:24 PM Aaron Marcuse-Kubitza

inputs/test_taxonomic_names/test_scrub: use sh's -e (errexit) mode so errors in an invoked script cause the script to abort instead of burying the error in more output

9490 05/21/2013 07:19 PM Aaron Marcuse-Kubitza

inputs/test_taxonomic_names/test_scrub: documented that `make schemas/"$public"/uninstall` removes the previous results (since it may be confusing why it's prompting the user to uninstall the schema that is an output of the program)

9489 05/21/2013 07:16 PM Aaron Marcuse-Kubitza

inputs/test_taxonomic_names/test_scrub: don't need to run the import twice anymore because the accepted names are now included in the tnrs_input_name view that TNRS runs on

9488 05/21/2013 07:09 PM Aaron Marcuse-Kubitza

inputs/test_taxonomic_names/test_scrub: updated for current TNRS schema

9487 05/21/2013 06:47 PM Aaron Marcuse-Kubitza

bugfix: inputs/test_taxonomic_names/test_scrub: unset $n so it doesn't limit the # rows. it is set to 2 in the default test environment, so must be unset for n-sensitive programs that should be unlimited.

9486 05/21/2013 06:40 PM Aaron Marcuse-Kubitza

inputs/test_taxonomic_names/test_scrub: updated for current TNRS schema

9485 05/21/2013 01:44 PM Aaron Marcuse-Kubitza

inputs/GBIF/raw_occurrence_record/run: herbaria_filter.table/make(): also include the exported plant_fraction herbaria

9484 05/21/2013 01:43 PM Aaron Marcuse-Kubitza

inputs/GBIF/raw_occurrence_record/run: added herbaria_filter.plant_fraction.csv_/make(), which exports the plant_fraction herbaria whose plant_fraction >= 0.8

9483 05/21/2013 01:42 PM Aaron Marcuse-Kubitza

inputs/GBIF/raw_occurrence_record/run: added plant_fraction.table/make(), which contains the plant fraction for each herbarium

9482 05/21/2013 01:37 PM Aaron Marcuse-Kubitza

lib/sh/db.sh: added mk_drop()