/trunk/bin/tnrs_db - Changes - BIEN 3 - NCEAS Projects

root/trunk/bin/tnrs_db @ 12330

svn:executable: *

#	Date	Author	Comment
11970	01/20/2014 11:33 AM	Aaron Marcuse-Kubitza	moved everything into /trunk/ to create the standard svn layout, for use with tools that require this (eg. git-svn). IMPORTANT: do NOT do an `svn up`. instead, re-use your working copy's existing files with `svn switch` (http://svnbook.red-bean.com/en/1.6/svn.ref.svn.c.switch.html).
10742	08/26/2013 08:45 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: add entry to new batch table
9999	06/22/2013 12:23 AM	Aaron Marcuse-Kubitza	bin/tnrs_db: documented total runtime (10 days)
9998	06/21/2013 11:58 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: documented current runtime (162 ms/name)
9530	05/23/2013 04:40 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: documented how to estimate total runtime. note that our tnrs_db wrapper in inputs/.TNRS/tnrs/tnrs.make uses inputs/.TNRS/tnrs/logs/tnrs.make.log.sql as the log file.
9527	05/23/2013 03:00 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: removed unused imports
9526	05/23/2013 02:55 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: cumulative_tnrs_profiler: use tnrs.tnrs_request()'s new cumulative_profiler param instead of doing the profiling manually. this also ensures that there isn't extra time between when the cumulative profiler starts/stops and when the per-request profiler starts/stops (because Profiler's new add_subprofiler() method is used).
9522	05/23/2013 02:38 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: tnrs_profiler: renamed to cumulative_tnrs_profiler to distinguish it from the tnrs_profiler used by tnrs.tnrs_request(), which just profiles the current request
9521	05/23/2013 02:36 PM	Aaron Marcuse-Kubitza	bugfix: bin/tnrs_db: cumulative profiler: use len(names) instead of this_ct (cur.rowcount) in case the actual # rows fetched differed from the rowcount
9520	05/23/2013 02:32 PM	Aaron Marcuse-Kubitza	lib/tnrs.py: repeated_tnrs_request(): renamed to tnrs_request() since this is the function that should usually be used, to ensure that debugging information is output in the case of an error. (the TNRS request must be made again to output this information.)
9518	05/23/2013 02:25 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: removed no longer used $wait flag (which caused tnrs_db to wait max_pause for new rows to be added), because tnrs_db is now invoked automatically after each import by the import_scrub target (in inputs/input.Makefile) and does not need to run as a daemon. note that when scrub is invoked, it is possible that a previous datasource's import has already scrubbed the names for this import, because tnrs_db runs until all rows in tnrs_input_name are scrubbed....
9517	05/23/2013 02:14 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: removed no longer needed explicit population of the Time_submitted, which is now done automatically by the tnrs table. however, this requires starting the transaction before submitting data, so Time_submitted is correctly set to the submission time rather than the insertion time. the setting of the correct time can be tested by inserting `time.sleep(n_sec)` after the TNRS request and checking that the Time_submitted is close to the time tnrs_db was run instead of n_sec seconds later.
9516	05/23/2013 02:09 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: start transaction before submitting data, so Time_submitted is correctly set to the submission time rather than the insertion time. these may differ by several minutes if TNRS is slow. the setting of the correct time can be tested by inserting `time.sleep(n_sec)` after the TNRS request, removing the explicit setting of Time_submitted, and checking that the Time_submitted is close to the time tnrs_db was run instead of n_sec seconds later.
9515	05/23/2013 02:05 PM	Aaron Marcuse-Kubitza	bugfix: bin/tnrs_db: wrap just the TNRS request and the storing of the response data in a function (undoing part of r9514), because the transaction start time for Time_submitted should not be until the TNRS request is actually made (it often takes several minutes to materialize the next set of input names on a full DB)
9514	05/23/2013 01:56 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: Iterate over unscrubbed verbatim taxonlabels: put loop body in a function (which returns whether or not the loop should continue), so that the loop body can easily be wrapped in a transaction using sql.with_savepoint()
9511	05/23/2013 01:07 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: removed no longer needed explicit appending of derived cols, and instead use append_csv()'s new support for importing CSVs whose columns are a subset of the full table
9510	05/23/2013 12:56 PM	Aaron Marcuse-Kubitza	bin/tnrs_db: ColInsertFilters: use the simpler literal value option for the mk_value param
7293	01/18/2013 07:38 AM	Aaron Marcuse-Kubitza	inputs/.TNRS/schema.sql: tnrs: Added Max_score column for use in filtering out names that will be rejected by taxondetermination's constraints
7291	01/18/2013 07:14 AM	Aaron Marcuse-Kubitza	tnrs_db: Support multiple appended columns in the tnrs table
7133	01/09/2013 09:57 AM	Aaron Marcuse-Kubitza	inputs/.TNRS/schema.sql: tnrs: Added Accepted_scientific_name field which will contain the joined-together accepted name that gets re-parsed by TNRS
5837	10/30/2012 03:40 AM	Aaron Marcuse-Kubitza	tnrs_db: Fetching names to scrub: Omit sql.select() fields param because it will be filled in with its default value
5787	10/25/2012 03:50 PM	Aaron Marcuse-Kubitza	tnrs_db: Making TNRS request: Fixed bug where needed to remove else block now that there is no except block
5761	10/24/2012 06:31 PM	Aaron Marcuse-Kubitza	tnrs_db: Removed tnrs.InvalidResponse exception handler that retries the query because the current query does not track which names have been submitted to but not processed by TNRS, so the error would continue to happen repeatedly
5737	10/23/2012 09:34 AM	Aaron Marcuse-Kubitza	inputs/.TNRS/tnrs/: Added Time_submitted column at beginning and populate it in tnrs_db with the time the batch TNRS request was submitted
5669	10/19/2012 03:34 PM	Aaron Marcuse-Kubitza	tnrs_db: Use new tnrs_input_name view to avoid hardcoding changing schema information
5661	10/18/2012 04:55 PM	Aaron Marcuse-Kubitza	tnrs_db: Updated with schema changes
5643	10/18/2012 01:05 PM	Aaron Marcuse-Kubitza	tnrs_db: Fixed bug where needed to remove internal identifyingtaxonomicname duplicates as well as duplicates with existing Name_submitted values, to avoid violating the TNRS.tnrs pkey constraint when the scrubbed names are later inserted. Note that the taxonlabel_0_unique_identifying_name unique index is not sufficient to prevent internal duplicates, because it includes the creator_id (and thus allows multiple instances of the same name defined by different creators).
5640	10/18/2012 12:36 PM	Aaron Marcuse-Kubitza	tnrs_db: Updated with schema changes
5590	10/17/2012 11:00 AM	Aaron Marcuse-Kubitza	sql_io.py: append_csv(): Take a reader and header rather than a stream_info and stream to allow callers to use the simpler csvs.reader_and_header() function. This also allows callers to pass in a wrapped CSV reader for filtering, etc.
5589	10/17/2012 10:44 AM	Aaron Marcuse-Kubitza	csv2db, tnrs_db: Removed ProgressInputStream wrapper around input stream, which is no longer needed (and causes overlapping output) now that sql_io.append_csv() prints # rows read
5467	10/12/2012 12:29 PM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: taxonconcept: Renamed canon_concept_id to matched_concept_id, because this is actually the closest-match taxonconcept in the match hierarchy (datasource concept -> parsed concept -> matched concept -> accepted concept) rather than the accepted synonym, which goes in accepted_concept_id
5413	10/10/2012 10:01 AM	Aaron Marcuse-Kubitza	schemas/vegbien.sql: taxonconcept: Renamed canon_taxonconcept_id to canon_concept_id to shorten the name, which is used often
5324	10/09/2012 07:42 PM	Aaron Marcuse-Kubitza	tnrs_db: Moved "Processing # taxonconcepts" log message to before waiting or exiting if no taxonconcepts left, so that it would be printed right after the query is run and say that no taxonconcepts were found
5323	10/09/2012 07:39 PM	Aaron Marcuse-Kubitza	tnrs_db: Updated comments and log messages for schema changes
5322	10/09/2012 07:33 PM	Aaron Marcuse-Kubitza	tnrs_db: Updated query for schema changes
5214	10/03/2012 01:11 PM	Aaron Marcuse-Kubitza	tnrs_db: Made wait option default to off to facilitate running tnrs_db by itself, rather than as part of an import
5213	10/03/2012 01:08 PM	Aaron Marcuse-Kubitza	tnrs_db: Added wait option to have tnrs_db exit as soon as no more names are available. This is useful for running tnrs_db when there is no concurrent import running, and therefore no need to wait for new data.
5212	10/03/2012 01:00 PM	Aaron Marcuse-Kubitza	tnrs_db: Fixed the time of the "Waited" message so it that the total_pause (containing the next wait) would be incremented after the message was displayed. Split the "Waited" and "Waiting" messages into two separate messages.
5159	10/01/2012 10:44 PM	Aaron Marcuse-Kubitza	tnrs_db: Updated query for new three-level taxonpath hierarchy, where the concatenated name is now stored in identifyingtaxonomicname instead of taxonomicnamewithauthor
5156	10/01/2012 10:12 PM	Aaron Marcuse-Kubitza	tnrs_db: Adjusted pause, max_pause so the daemon waits longer before exiting, because after the initial TNRS run, most names have already been scrubbed and new names may not be added until the end of the import (in the case of a very large new datasource)
5153	10/01/2012 09:15 PM	Aaron Marcuse-Kubitza	tnrs_db: pause: Increased to 30 min because if no new names are available in TNRS.tnrs, there is no need to check every minute for new names (which clutters up the log file output). The pause feature is designed to allow tnrs_db to run in parallel with the import process, and process new names as they are made available, which only happens once for each partition of each datasource.
5152	10/01/2012 09:11 PM	Aaron Marcuse-Kubitza	tnrs_db: Fixed bug where the new filtering out of already-scrubbed names caused names to be skipped, because the loop would both advance by the number of rows found and those rows would no longer be returned by the query, causing only every other set of rows to be processed
5126	09/28/2012 02:17 PM	Aaron Marcuse-Kubitza	tnrs_db: Exclude taxonomic names which have already been scrubbed, by using a filter-out LEFT JOIN on TNRS.tnrs
5124	09/28/2012 01:27 PM	Aaron Marcuse-Kubitza	tnrs_db: tnrs_profiler: Use iter_text='name' for consistency with tnrs.tnrs_request()'s own profiler's iter_text
5123	09/28/2012 01:25 PM	Aaron Marcuse-Kubitza	tnrs_db: Print cumulative profiling information after every TNRS request, rather than just at the end
5121	09/28/2012 01:15 PM	Aaron Marcuse-Kubitza	TNRS-related programs: Use "names" instead of "taxons" for variable names because what's being submitted are actually verbatim taxonomic names, not official references to specific taxa
5101	09/28/2012 09:51 AM	Aaron Marcuse-Kubitza	tnrs_db: Moved lower max_taxons limit to tnrs.py because it's really required to avoid crashing the TNRS server and should apply to all callers
5100	09/28/2012 09:35 AM	Aaron Marcuse-Kubitza	tnrs_db: Print log message with # of taxonpaths being sent to TNRS
5099	09/28/2012 09:30 AM	Aaron Marcuse-Kubitza	tnrs_db: Fixed bug where InvalidResponse was missing module name
5098	09/28/2012 09:29 AM	Aaron Marcuse-Kubitza	tnrs_db: Profile the TNRS requests. This involves using a finally block to ensure that the profiling stats are printed even if the program exits with an error.
5097	09/28/2012 09:13 AM	Aaron Marcuse-Kubitza	tnrs_db: Reduced the chunk size to avoid slowing down the TNRS server
5095	09/28/2012 09:01 AM	Aaron Marcuse-Kubitza	tnrs_db: Added log messages for Making TNRS request and Storing TNRS response data so that if the TNRS daemon pauses, it's obvious which step it's waiting on
5093	09/28/2012 08:43 AM	Aaron Marcuse-Kubitza	tnrs_db: If tnrs.repeated_tnrs_request() stil throws InvalidResponse, skip the current set in case its data caused the error. Note that it will still be tried again the next time tnrs_db is run.
5089	09/28/2012 08:17 AM	Aaron Marcuse-Kubitza	tnrs_client, tnrs_db: Use new tnrs.repeated_tnrs_request()
5079	09/27/2012 11:25 AM	Aaron Marcuse-Kubitza	Added tnrs_db to scrub the taxonpaths in VegBIEN using TNRS

Project

General

Profile

root/trunk/bin/tnrs_db @ 12330