moved everything into /trunk/ to create the standard svn layout, for use with tools that require this (eg. git-svn). IMPORTANT: do NOT do an `svn up`. instead, re-use your working copy's existing files with `svn switch` (http://svnbook.red-bean.com/en/1.6/svn.ref.svn.c.switch.html).
bin/tnrs_db: add entry to new batch table
bin/tnrs_db: documented total runtime (10 days)
bin/tnrs_db: documented current runtime (162 ms/name)
bin/tnrs_db: documented how to estimate total runtime. note that our tnrs_db wrapper in inputs/.TNRS/tnrs/tnrs.make uses inputs/.TNRS/tnrs/logs/tnrs.make.log.sql as the log file.
bin/tnrs_db: removed unused imports
bin/tnrs_db: cumulative_tnrs_profiler: use tnrs.tnrs_request()'s new cumulative_profiler param instead of doing the profiling manually. this also ensures that there isn't extra time between when the cumulative profiler starts/stops and when the per-request profiler starts/stops (because Profiler's new add_subprofiler() method is used).
bin/tnrs_db: tnrs_profiler: renamed to cumulative_tnrs_profiler to distinguish it from the tnrs_profiler used by tnrs.tnrs_request(), which just profiles the current request
bugfix: bin/tnrs_db: cumulative profiler: use len(names) instead of this_ct (cur.rowcount) in case the actual # rows fetched differed from the rowcount
lib/tnrs.py: repeated_tnrs_request(): renamed to tnrs_request() since this is the function that should usually be used, to ensure that debugging information is output in the case of an error. (the TNRS request must be made again to output this information.)
bin/tnrs_db: removed no longer used $wait flag (which caused tnrs_db to wait max_pause for new rows to be added), because tnrs_db is now invoked automatically after each import by the import_scrub target (in inputs/input.Makefile) and does not need to run as a daemon. note that when scrub is invoked, it is possible that a previous datasource's import has already scrubbed the names for this import, because tnrs_db runs until all rows in tnrs_input_name are scrubbed....
bin/tnrs_db: removed no longer needed explicit population of the Time_submitted, which is now done automatically by the tnrs table. however, this requires starting the transaction before submitting data, so Time_submitted is correctly set to the submission time rather than the insertion time. the setting of the correct time can be tested by inserting `time.sleep(n_sec)` after the TNRS request and checking that the Time_submitted is close to the time tnrs_db was run instead of n_sec seconds later.
bin/tnrs_db: start transaction before submitting data, so Time_submitted is correctly set to the submission time rather than the insertion time. these may differ by several minutes if TNRS is slow. the setting of the correct time can be tested by inserting `time.sleep(n_sec)` after the TNRS request, removing the explicit setting of Time_submitted, and checking that the Time_submitted is close to the time tnrs_db was run instead of n_sec seconds later.
bugfix: bin/tnrs_db: wrap just the TNRS request and the storing of the response data in a function (undoing part of r9514), because the transaction start time for Time_submitted should not be until the TNRS request is actually made (it often takes several minutes to materialize the next set of input names on a full DB)
bin/tnrs_db: Iterate over unscrubbed verbatim taxonlabels: put loop body in a function (which returns whether or not the loop should continue), so that the loop body can easily be wrapped in a transaction using sql.with_savepoint()
bin/tnrs_db: removed no longer needed explicit appending of derived cols, and instead use append_csv()'s new support for importing CSVs whose columns are a subset of the full table
bin/tnrs_db: ColInsertFilters: use the simpler literal value option for the mk_value param
inputs/.TNRS/schema.sql: tnrs: Added Max_score column for use in filtering out names that will be rejected by taxondetermination's constraints
tnrs_db: Support multiple appended columns in the tnrs table
inputs/.TNRS/schema.sql: tnrs: Added Accepted_scientific_name field which will contain the joined-together accepted name that gets re-parsed by TNRS
tnrs_db: Fetching names to scrub: Omit sql.select() fields param because it will be filled in with its default value
tnrs_db: Making TNRS request: Fixed bug where needed to remove else block now that there is no except block
tnrs_db: Removed tnrs.InvalidResponse exception handler that retries the query because the current query does not track which names have been submitted to but not processed by TNRS, so the error would continue to happen repeatedly
inputs/.TNRS/tnrs/: Added Time_submitted column at beginning and populate it in tnrs_db with the time the batch TNRS request was submitted
tnrs_db: Use new tnrs_input_name view to avoid hardcoding changing schema information
tnrs_db: Updated with schema changes
tnrs_db: Fixed bug where needed to remove internal identifyingtaxonomicname duplicates as well as duplicates with existing Name_submitted values, to avoid violating the TNRS.tnrs pkey constraint when the scrubbed names are later inserted. Note that the taxonlabel_0_unique_identifying_name unique index is not sufficient to prevent internal duplicates, because it includes the creator_id (and thus allows multiple instances of the same name defined by different creators).
sql_io.py: append_csv(): Take a reader and header rather than a stream_info and stream to allow callers to use the simpler csvs.reader_and_header() function. This also allows callers to pass in a wrapped CSV reader for filtering, etc.
csv2db, tnrs_db: Removed ProgressInputStream wrapper around input stream, which is no longer needed (and causes overlapping output) now that sql_io.append_csv() prints # rows read
schemas/vegbien.sql: taxonconcept: Renamed canon_concept_id to matched_concept_id, because this is actually the closest-match taxonconcept in the match hierarchy (datasource concept -> parsed concept -> matched concept -> accepted concept) rather than the accepted synonym, which goes in accepted_concept_id
schemas/vegbien.sql: taxonconcept: Renamed canon_taxonconcept_id to canon_concept_id to shorten the name, which is used often
tnrs_db: Moved "Processing # taxonconcepts" log message to before waiting or exiting if no taxonconcepts left, so that it would be printed right after the query is run and say that no taxonconcepts were found
tnrs_db: Updated comments and log messages for schema changes
tnrs_db: Updated query for schema changes
tnrs_db: Made wait option default to off to facilitate running tnrs_db by itself, rather than as part of an import
tnrs_db: Added wait option to have tnrs_db exit as soon as no more names are available. This is useful for running tnrs_db when there is no concurrent import running, and therefore no need to wait for new data.
tnrs_db: Fixed the time of the "Waited" message so it that the total_pause (containing the next wait) would be incremented after the message was displayed. Split the "Waited" and "Waiting" messages into two separate messages.
tnrs_db: Updated query for new three-level taxonpath hierarchy, where the concatenated name is now stored in identifyingtaxonomicname instead of taxonomicnamewithauthor
tnrs_db: Adjusted pause, max_pause so the daemon waits longer before exiting, because after the initial TNRS run, most names have already been scrubbed and new names may not be added until the end of the import (in the case of a very large new datasource)
tnrs_db: pause: Increased to 30 min because if no new names are available in TNRS.tnrs, there is no need to check every minute for new names (which clutters up the log file output). The pause feature is designed to allow tnrs_db to run in parallel with the import process, and process new names as they are made available, which only happens once for each partition of each datasource.
tnrs_db: Fixed bug where the new filtering out of already-scrubbed names caused names to be skipped, because the loop would both advance by the number of rows found and those rows would no longer be returned by the query, causing only every other set of rows to be processed
tnrs_db: Exclude taxonomic names which have already been scrubbed, by using a filter-out LEFT JOIN on TNRS.tnrs
tnrs_db: tnrs_profiler: Use iter_text='name' for consistency with tnrs.tnrs_request()'s own profiler's iter_text
tnrs_db: Print cumulative profiling information after every TNRS request, rather than just at the end
TNRS-related programs: Use "names" instead of "taxons" for variable names because what's being submitted are actually verbatim taxonomic names, not official references to specific taxa
tnrs_db: Moved lower max_taxons limit to tnrs.py because it's really required to avoid crashing the TNRS server and should apply to all callers
tnrs_db: Print log message with # of taxonpaths being sent to TNRS
tnrs_db: Fixed bug where InvalidResponse was missing module name
tnrs_db: Profile the TNRS requests. This involves using a finally block to ensure that the profiling stats are printed even if the program exits with an error.
tnrs_db: Reduced the chunk size to avoid slowing down the TNRS server
tnrs_db: Added log messages for Making TNRS request and Storing TNRS response data so that if the TNRS daemon pauses, it's obvious which step it's waiting on
tnrs_db: If tnrs.repeated_tnrs_request() stil throws InvalidResponse, skip the current set in case its data caused the error. Note that it will still be tried again the next time tnrs_db is run.
tnrs_client, tnrs_db: Use new tnrs.repeated_tnrs_request()
Added tnrs_db to scrub the taxonpaths in VegBIEN using TNRS