Project

General

Profile

  • svn:executable: *

# Date Author Comment
13850 06/25/2014 03:33 PM Aaron Marcuse-Kubitza

inputs/.TNRS/schema.sql: tnrs: renamed to tnrs_match to distinguish it from other TNRS-related tables

11970 01/20/2014 11:33 AM Aaron Marcuse-Kubitza

moved everything into /trunk/ to create the standard svn layout, for use with tools that require this (eg. git-svn). IMPORTANT: do NOT do an `svn up`. instead, re-use your working copy's existing files with `svn switch` (http://svnbook.red-bean.com/en/1.6/svn.ref.svn.c.switch.html).

10742 08/26/2013 08:45 PM Aaron Marcuse-Kubitza

bin/tnrs_db: add entry to new batch table

9999 06/22/2013 12:23 AM Aaron Marcuse-Kubitza

bin/tnrs_db: documented total runtime (10 days)

9998 06/21/2013 11:58 PM Aaron Marcuse-Kubitza

bin/tnrs_db: documented current runtime (162 ms/name)

9530 05/23/2013 04:40 PM Aaron Marcuse-Kubitza

bin/tnrs_db: documented how to estimate total runtime. note that our tnrs_db wrapper in inputs/.TNRS/tnrs/tnrs.make uses inputs/.TNRS/tnrs/logs/tnrs.make.log.sql as the log file.

9527 05/23/2013 03:00 PM Aaron Marcuse-Kubitza

bin/tnrs_db: removed unused imports

9526 05/23/2013 02:55 PM Aaron Marcuse-Kubitza

bin/tnrs_db: cumulative_tnrs_profiler: use tnrs.tnrs_request()'s new cumulative_profiler param instead of doing the profiling manually. this also ensures that there isn't extra time between when the cumulative profiler starts/stops and when the per-request profiler starts/stops (because Profiler's new add_subprofiler() method is used).

9522 05/23/2013 02:38 PM Aaron Marcuse-Kubitza

bin/tnrs_db: tnrs_profiler: renamed to cumulative_tnrs_profiler to distinguish it from the tnrs_profiler used by tnrs.tnrs_request(), which just profiles the current request

9521 05/23/2013 02:36 PM Aaron Marcuse-Kubitza

bugfix: bin/tnrs_db: cumulative profiler: use len(names) instead of this_ct (cur.rowcount) in case the actual # rows fetched differed from the rowcount

9520 05/23/2013 02:32 PM Aaron Marcuse-Kubitza

lib/tnrs.py: repeated_tnrs_request(): renamed to tnrs_request() since this is the function that should usually be used, to ensure that debugging information is output in the case of an error. (the TNRS request must be made again to output this information.)

9518 05/23/2013 02:25 PM Aaron Marcuse-Kubitza

bin/tnrs_db: removed no longer used $wait flag (which caused tnrs_db to wait max_pause for new rows to be added), because tnrs_db is now invoked automatically after each import by the import_scrub target (in inputs/input.Makefile) and does not need to run as a daemon. note that when scrub is invoked, it is possible that a previous datasource's import has already scrubbed the names for this import, because tnrs_db runs until all rows in tnrs_input_name are scrubbed....

9517 05/23/2013 02:14 PM Aaron Marcuse-Kubitza

bin/tnrs_db: removed no longer needed explicit population of the Time_submitted, which is now done automatically by the tnrs table. however, this requires starting the transaction before submitting data, so Time_submitted is correctly set to the submission time rather than the insertion time. the setting of the correct time can be tested by inserting `time.sleep(n_sec)` after the TNRS request and checking that the Time_submitted is close to the time tnrs_db was run instead of n_sec seconds later.

9516 05/23/2013 02:09 PM Aaron Marcuse-Kubitza

bin/tnrs_db: start transaction before submitting data, so Time_submitted is correctly set to the submission time rather than the insertion time. these may differ by several minutes if TNRS is slow. the setting of the correct time can be tested by inserting `time.sleep(n_sec)` after the TNRS request, removing the explicit setting of Time_submitted, and checking that the Time_submitted is close to the time tnrs_db was run instead of n_sec seconds later.

9515 05/23/2013 02:05 PM Aaron Marcuse-Kubitza

bugfix: bin/tnrs_db: wrap just the TNRS request and the storing of the response data in a function (undoing part of r9514), because the transaction start time for Time_submitted should not be until the TNRS request is actually made (it often takes several minutes to materialize the next set of input names on a full DB)

9514 05/23/2013 01:56 PM Aaron Marcuse-Kubitza

bin/tnrs_db: Iterate over unscrubbed verbatim taxonlabels: put loop body in a function (which returns whether or not the loop should continue), so that the loop body can easily be wrapped in a transaction using sql.with_savepoint()

9511 05/23/2013 01:07 PM Aaron Marcuse-Kubitza

bin/tnrs_db: removed no longer needed explicit appending of derived cols, and instead use append_csv()'s new support for importing CSVs whose columns are a subset of the full table

9510 05/23/2013 12:56 PM Aaron Marcuse-Kubitza

bin/tnrs_db: ColInsertFilters: use the simpler literal value option for the mk_value param

7293 01/18/2013 07:38 AM Aaron Marcuse-Kubitza

inputs/.TNRS/schema.sql: tnrs: Added Max_score column for use in filtering out names that will be rejected by taxondetermination's constraints

7291 01/18/2013 07:14 AM Aaron Marcuse-Kubitza

tnrs_db: Support multiple appended columns in the tnrs table

7133 01/09/2013 09:57 AM Aaron Marcuse-Kubitza

inputs/.TNRS/schema.sql: tnrs: Added Accepted_scientific_name field which will contain the joined-together accepted name that gets re-parsed by TNRS

5837 10/30/2012 03:40 AM Aaron Marcuse-Kubitza

tnrs_db: Fetching names to scrub: Omit sql.select() fields param because it will be filled in with its default value

5787 10/25/2012 03:50 PM Aaron Marcuse-Kubitza

tnrs_db: Making TNRS request: Fixed bug where needed to remove else block now that there is no except block

5761 10/24/2012 06:31 PM Aaron Marcuse-Kubitza

tnrs_db: Removed tnrs.InvalidResponse exception handler that retries the query because the current query does not track which names have been submitted to but not processed by TNRS, so the error would continue to happen repeatedly

5737 10/23/2012 09:34 AM Aaron Marcuse-Kubitza

inputs/.TNRS/tnrs/: Added Time_submitted column at beginning and populate it in tnrs_db with the time the batch TNRS request was submitted

5669 10/19/2012 03:34 PM Aaron Marcuse-Kubitza

tnrs_db: Use new tnrs_input_name view to avoid hardcoding changing schema information

5661 10/18/2012 04:55 PM Aaron Marcuse-Kubitza

tnrs_db: Updated with schema changes

5643 10/18/2012 01:05 PM Aaron Marcuse-Kubitza

tnrs_db: Fixed bug where needed to remove internal identifyingtaxonomicname duplicates as well as duplicates with existing Name_submitted values, to avoid violating the TNRS.tnrs pkey constraint when the scrubbed names are later inserted. Note that the taxonlabel_0_unique_identifying_name unique index is not sufficient to prevent internal duplicates, because it includes the creator_id (and thus allows multiple instances of the same name defined by different creators).

5640 10/18/2012 12:36 PM Aaron Marcuse-Kubitza

tnrs_db: Updated with schema changes

5590 10/17/2012 11:00 AM Aaron Marcuse-Kubitza

sql_io.py: append_csv(): Take a reader and header rather than a stream_info and stream to allow callers to use the simpler csvs.reader_and_header() function. This also allows callers to pass in a wrapped CSV reader for filtering, etc.

5589 10/17/2012 10:44 AM Aaron Marcuse-Kubitza

csv2db, tnrs_db: Removed ProgressInputStream wrapper around input stream, which is no longer needed (and causes overlapping output) now that sql_io.append_csv() prints # rows read

5467 10/12/2012 12:29 PM Aaron Marcuse-Kubitza

schemas/vegbien.sql: taxonconcept: Renamed canon_concept_id to matched_concept_id, because this is actually the closest-match taxonconcept in the match hierarchy (datasource concept -> parsed concept -> matched concept -> accepted concept) rather than the accepted synonym, which goes in accepted_concept_id

5413 10/10/2012 10:01 AM Aaron Marcuse-Kubitza

schemas/vegbien.sql: taxonconcept: Renamed canon_taxonconcept_id to canon_concept_id to shorten the name, which is used often

5324 10/09/2012 07:42 PM Aaron Marcuse-Kubitza

tnrs_db: Moved "Processing # taxonconcepts" log message to before waiting or exiting if no taxonconcepts left, so that it would be printed right after the query is run and say that no taxonconcepts were found

5323 10/09/2012 07:39 PM Aaron Marcuse-Kubitza

tnrs_db: Updated comments and log messages for schema changes

5322 10/09/2012 07:33 PM Aaron Marcuse-Kubitza

tnrs_db: Updated query for schema changes

5214 10/03/2012 01:11 PM Aaron Marcuse-Kubitza

tnrs_db: Made wait option default to off to facilitate running tnrs_db by itself, rather than as part of an import

5213 10/03/2012 01:08 PM Aaron Marcuse-Kubitza

tnrs_db: Added wait option to have tnrs_db exit as soon as no more names are available. This is useful for running tnrs_db when there is no concurrent import running, and therefore no need to wait for new data.

5212 10/03/2012 01:00 PM Aaron Marcuse-Kubitza

tnrs_db: Fixed the time of the "Waited" message so it that the total_pause (containing the next wait) would be incremented after the message was displayed. Split the "Waited" and "Waiting" messages into two separate messages.

5159 10/01/2012 10:44 PM Aaron Marcuse-Kubitza

tnrs_db: Updated query for new three-level taxonpath hierarchy, where the concatenated name is now stored in identifyingtaxonomicname instead of taxonomicnamewithauthor

5156 10/01/2012 10:12 PM Aaron Marcuse-Kubitza

tnrs_db: Adjusted pause, max_pause so the daemon waits longer before exiting, because after the initial TNRS run, most names have already been scrubbed and new names may not be added until the end of the import (in the case of a very large new datasource)

5153 10/01/2012 09:15 PM Aaron Marcuse-Kubitza

tnrs_db: pause: Increased to 30 min because if no new names are available in TNRS.tnrs, there is no need to check every minute for new names (which clutters up the log file output). The pause feature is designed to allow tnrs_db to run in parallel with the import process, and process new names as they are made available, which only happens once for each partition of each datasource.

5152 10/01/2012 09:11 PM Aaron Marcuse-Kubitza

tnrs_db: Fixed bug where the new filtering out of already-scrubbed names caused names to be skipped, because the loop would both advance by the number of rows found and those rows would no longer be returned by the query, causing only every other set of rows to be processed

5126 09/28/2012 02:17 PM Aaron Marcuse-Kubitza

tnrs_db: Exclude taxonomic names which have already been scrubbed, by using a filter-out LEFT JOIN on TNRS.tnrs

5124 09/28/2012 01:27 PM Aaron Marcuse-Kubitza

tnrs_db: tnrs_profiler: Use iter_text='name' for consistency with tnrs.tnrs_request()'s own profiler's iter_text

5123 09/28/2012 01:25 PM Aaron Marcuse-Kubitza

tnrs_db: Print cumulative profiling information after every TNRS request, rather than just at the end

5121 09/28/2012 01:15 PM Aaron Marcuse-Kubitza

TNRS-related programs: Use "names" instead of "taxons" for variable names because what's being submitted are actually verbatim taxonomic names, not official references to specific taxa

5101 09/28/2012 09:51 AM Aaron Marcuse-Kubitza

tnrs_db: Moved lower max_taxons limit to tnrs.py because it's really required to avoid crashing the TNRS server and should apply to all callers

5100 09/28/2012 09:35 AM Aaron Marcuse-Kubitza

tnrs_db: Print log message with # of taxonpaths being sent to TNRS

5099 09/28/2012 09:30 AM Aaron Marcuse-Kubitza

tnrs_db: Fixed bug where InvalidResponse was missing module name

5098 09/28/2012 09:29 AM Aaron Marcuse-Kubitza

tnrs_db: Profile the TNRS requests. This involves using a finally block to ensure that the profiling stats are printed even if the program exits with an error.

5097 09/28/2012 09:13 AM Aaron Marcuse-Kubitza

tnrs_db: Reduced the chunk size to avoid slowing down the TNRS server

5095 09/28/2012 09:01 AM Aaron Marcuse-Kubitza

tnrs_db: Added log messages for Making TNRS request and Storing TNRS response data so that if the TNRS daemon pauses, it's obvious which step it's waiting on

5093 09/28/2012 08:43 AM Aaron Marcuse-Kubitza

tnrs_db: If tnrs.repeated_tnrs_request() stil throws InvalidResponse, skip the current set in case its data caused the error. Note that it will still be tried again the next time tnrs_db is run.

5089 09/28/2012 08:17 AM Aaron Marcuse-Kubitza

tnrs_client, tnrs_db: Use new tnrs.repeated_tnrs_request()

5079 09/27/2012 11:25 AM Aaron Marcuse-Kubitza

Added tnrs_db to scrub the taxonpaths in VegBIEN using TNRS