lib/tnrs.py: max_names: raised back up to 500 now that a workaround for the Internal Server Errors is in place (https://github.com/iPlantCollaborativeOpenSource/TNRS/issues/7)
fix: lib/tnrs.py: max_names: lowered to 50 because the dev TNRS server is now always crashing with an Internal Server Error when scrubbing 500 names at a time (https://github.com/iPlantCollaborativeOpenSource/TNRS/issues/7)
fix: lib/tnrs.py: Constrain by Source: turn it on so that the download settings reflect what TNRS actually used, while this is broken
fix: lib/tnrs.py: max_names: reduced back to 500 because even 5000 crashes the dev TNRS server
lib/tnrs.py: max_names: reduced to 5000 because 100,000 causes an internal server error
lib/tnrs.py: switched to downloading all matches per name, as is needed to implement #917. note that this will break the parts of the schema that use the tnrs table, until Brad's match-picking algorithm can be implemented, but this tradeoff is necessary to be able to begin scrubbing sooner (Martha; wiki.vegpath.org/2014-05-29_conference_call#TNRS)
lib/tnrs.py: max_names: increased to 100000 because the dev server can handle more names (no simultaneous users), as decided in the conference call (wiki.vegpath.org/2014-05-29_conference_call#TNRS)
lib/tnrs.py: commented out the value of max_names that is not active, for clarity
lib/tnrs.py: sources: updated to list/sort order in issue #917
lib/tnrs.py: use the TNRS dev server (with private URL in tnrs.url) instead of the live server, since that contains datasources that we need
lib/tnrs.py: configure the server separately from the base URL
lib/tnrs.py: retrieval_request_template: taxonomic_constraint, source_sorting: documented their meaning and why they need to be on/off
moved everything into /trunk/ to create the standard svn layout, for use with tools that require this (eg. git-svn). IMPORTANT: do NOT do an `svn up`. instead, re-use your working copy's existing files with `svn switch` (http://svnbook.red-bean.com/en/1.6/svn.ref.svn.c.switch.html).
lib/tnrs.py: HTTP requests: rewrapped lines
lib/tnrs.py: updated HTTP requests to match current web app
bugfix: lib/tnrs.py: download_request_template: changed dirty to true (to match the current web app), which is apparently needed to apply the source_sorting setting to the downloaded TSV in addition to the GUI results
lib/tnrs.py: retrieval_request_template: turned source_sorting back off, because it causes any match from the first source to always be used, even if it has a lower match score than the match from the other source. (Brad confirms that this should be off.) I think we had this on originally to ensure that only Tropicos results were used when available, rather than USDA when it was a better match. * note that due to a bug in the web app, this change will not actually be effective, because the source_sorting option is only applied to the GUI results, not the downloaded TSV. *
lib/tnrs.py: submission_request_template: include GCC in addition to Tropicos, because it provides more synonyms than Tropicos for Asteraceae, and the accepted names still match the Tropicos backbone (https://projects.nceas.ucsb.edu/nceas/projects/bien/wiki/2013-06-13_conference_call#include-GCC-when-running-TNRS)
lib/tnrs.py: single_tnrs_request(): added support for a cumulative profiler using the cumulative_profiler kw param
lib/tnrs.py: repeated_tnrs_request(): renamed to tnrs_request() since this is the function that should usually be used, to ensure that debugging information is output in the case of an error. (the TNRS request must be made again to output this information.)
lib/tnrs.py: tnrs_request(): renamed to single_tnrs_request() to distinguish it from repeated_tnrs_request()
tnrs.py: retrieval_request_template: Turn on taxonomic_constraint (to match family before genus) and source_sorting (to always return any result from the first source before returning results from any other sources, regardless of match %)
tnrs.py: submission_request_template: Use just Tropicos as the name source, as Brad says "GCC is for only one family (Asteraceae)" and USDA's "taxonomy is of lower quality and sometimes conflicts with Tropicos"
tnrs.py: encode_map: Added hidden minus sign, which TNRS removes
tnrs.py: encode_map: Added × (times), which TNRS replaces with x
tnrs.py: encode_map: Added " and ', which TNRS removes when at the beginning or end
tnrs.py: encode_map: Documented why each character needs to be encoded
tnrs.py: encode_map: Removed '&', which is actually not a special character for TNRS (although ';' is)
tnrs.py: encode_map: Added '_', which TNRS replaces with space
tnrs.py: repeated_tnrs_request(): Also retry request in debug mode if an HTTPError is thrown, so that debugging info can also be obtained if there is a bug in the TNRS client
tnrs.py: encode(): Also prepend special padding string to empty and whitespace-only strings because these names are otherwise ignored by TNRS (no response row)
tnrs.py: tnrs_request(): Rewrapped lines (became >80 chars after adding profiling)
tnrs.py: tnrs_request(): Use new encode() and TnrsOutputStream to escape TNRS-invalid characters
tnrs.py: Added encode(), decode(), decode_for_tsv(), and TnrsOutputStream to handle escaping TNRS-invalid characters
tnrs.py: gwt_encode(): Escape special characters in the string instead of removing them, so that TNRS receives the original name rather than a modified version. This will help make the submitted names match up with the returned Name_submitted.
tnrs.py: tnrs_request(): Added comment that names containing only whitespace characters are ignored by TNRS and do not receive a response row. Our tnrs_db and reimport pipeline handles the necessary re-matching-up by just creating taxonpaths for each Name_submitted, and then letting the data import process on the following import attach to the prepopulated taxonpaths.
tnrs.py: max_pause: Changed to 30 min because TNRS sometimes freezes for ~10 min. The freezing usually happens while the data is being uploaded rather than when it's being retrieved, so that the max_pause would not apply, but to be on the safe side, requests should not time out unnecessarily.
TNRS-related programs: Use "names" instead of "taxons" for variable names because what's being submitted are actually verbatim taxonomic names, not official references to specific taxa
tnrs.py: tnrs_request(): Profile the TNRS request
tnrs.py: tnrs_request(): Fixed bug where initial_headers needed to be copied instead of just assigned to headers, because initial_headers is a global constant and should not be changed when the Cookie header is added
tnrs.py: repeated_tnrs_request(): Just retry the request once with with debug turned on, to avoid cluttering the log output with the verbose debug info of multiple failed requests if the error is not resolved on retry
tnrs.py: tnrs_request(): repeated_tnrs_request(): Print all suppressed exceptions to stderr
tnrs.py: tnrs_request(): parse_response(): Include both the response headers and the response body in the InvalidResponse message
tnrs_db: Moved lower max_taxons limit to tnrs.py because it's really required to avoid crashing the TNRS server and should apply to all callers
repeated_tnrs_request(): When retrying after an invalid response, output protocol info for debugging
tnrs.py: Added repeated_tnrs_request() to retry a TNRS request which returned an invalid response
tnrs.py: parse_response(): Raise custom InvalidResponse exception instead of SystemExit, so callers can catch the exception and respond to it
tnrs.py: tnrs_request(): Return the CSV stream directly instead of reading it into a string
tnrs.py: tnrs_request(): Moved CSV-download-specific functionality from do_request() to the Download section
tnrs.py: tnrs_request(): Return the response instead of printing it to stdout
Added tnrs.py