2014-08-14 conference calls¶
Martha's notes¶
In brief, here's the status of things based on this morning's call with Aaron and the afternoon discussion of the TNRS problem.
Taxon names (in the analytical tables and the TNRS table)
- Although all BIEN species names have been run through TNRS (with TPL), the names in the analytical tables still have problems because the workflow does not yet include a workaround/fix for of the bug in TNRS nor does it yet use the TPL to trump Tropicos synonyms.
- We do have a less onerous plan for addressing this. Aaron will select the problem names (names for which the matched name with the best score isn't selected as the best match) from the TNRS database and to rescrub only those names (here's what's different) and to parse the JSON output file instead of the loading the CSV output file into a db and then trying to implement Jerry Lu's complex algorithm to sort through the output. Both the best match name and the correct synonym can easily be parsed from the JSON file. (additional details will follow in a separate email).
- This is now Aaron's top priority to be trumped only by the performance work.
Dates (in the analytical tables and VegBIEN)
- On Monday, I said the dates problems had been fixed. That proved incorrect.
- In the analytical tables, the SALVIAS and HVAA datasets still have some records in which dates have problems.
- Fixing the dates is on Aaron's to do list. SALVIAS is a higher priority than HVAA, which is a small dataset from Chile.
Performance
- Next Monday or Tuesday, Nick will migrate Aaron's BIEN work environment to faster machines.
- This is the absolute highest priority and will trump the Taxon name work early next week.
upcoming¶
- the next conference call is next week at the usual time (Th. 9am PT/9am Tucson/12pm ET)
availability¶
- please add your availability for summer 2014 to the *spreadsheet*:
Loading Google Spreadsheet...
info¶
TNRS match-picking bug¶
- after discussion with Nicole, we discovered that there is actually a much faster workaround: you can just use the JSON which is output by the web app when clicking the "+n more" link next to a submitted name. unlike the GWT returned by the initial results, the "+n more" results are parseable JSON, with the same column names as in the CSV (but crucially, the correct selected match).
- this is a much better alternative to implementing Jerry Lu's match-picking algorithm from the README or attempting to translate the entire match-picking PHP code to run on the TNRS output
- note that the actual bug will remain unfixed, because Naim was not able to find where the problem is caused
TPL synonyms bug¶
- this is fixed automatically by Jerry Lu's match-picking algorithm, which makes it unnecessary to code Brad's TPL synonym formula manually. this has been confirmed by Nicole.
- most likely, Brad only reported the TPL synonyms bug because this was being done incorrectly due to the match-picking bug. (if the selected match is wrong, the selected synonym will of course be wrong as well.)
stuck progress bar bug¶
- sometimes, when scrubbing names, the web app gets stuck on the parsing progress bar for several minutes and then returns "0"
- Andy Edmonds' explanation for why this happens:
I believe its a miss-behaving cron job. The cron cleans up old TNRS jobs, and I think its cleaning up a little more than necessary.
decisions¶
TNRS match-picking bug¶
- working around the match-picking bug is in fact critical, because it causes some names to use the wrong match (Martha)
- rather than re-scrubbing all the names, we will locate and re-scrub just the names that are affected by the bug
TPL synonyms bug¶
- we don't need to do any manual coding for Brad's TPL synonym formula because this is handled automatically by Jerry Lu's match-picking algorithm (after a small fix to the dev TNRS)
missing dates¶
- a lower priority than the TNRS match-picking bug, but work on this if blocked on that (Martha)
to do for Martha¶
get Nicole to provide the workaround for the TNRS match-picking bugafter discussion, we actually discovered a much better workaround than the original one, when it became clear that it's possible to get JSON results out of the web app
to do for Aaron¶
TNRS match-picking bug¶
write query to find which names are affected by it
this would normally require implementing the match-picking algorithm itself in order to find out which selected matches are wrong. however, this is precisely what we want to avoid because it is time-consuming to implement. therefore, in the interest of time we will instead implement a simplified version of the match-picking algorithm, which may also flag some correct matches as incorrect (ie. have false positives).
300,000 of 3 million1 names (10%) affected:
names list (13 MB):SELECT COUNT(*) FROM ( SELECT DISTINCT ON ("*Name_submitted") * FROM "TNRS".taxon_match ORDER BY "*Name_submitted", "*Name_score" DESC, "*Overall_score" DESC ) s WHERE "*Selected" = 'false' ; -- runtime: 30 min ("1215021 ms") @r14492 @vegbiendev -- 297255
$ ssh aaronmk@vegbiendev.nceas.ucsb.edu <<'eof_ssh' psql --quiet -U bien -d vegbien <<'eof_psql' COPY ( SELECT "*Name_submitted" FROM ( SELECT DISTINCT ON ("*Name_submitted") * FROM "TNRS".taxon_match ORDER BY "*Name_submitted", "*Name_score" DESC, "*Overall_score" DESC ) s WHERE "*Selected" = 'false' -- runtime: 11 min ("18:16:08" - "18:04:53") ) TO STDOUT CSV /*HEADER*/; eof_psql eof_ssh
- implement JSON workaround
- copy
TNRS.taxon_match_input
toTNRS.taxon_match_input_json
, but use the column names used by the JSON output format- the JSON columns can be retrieved by running:
bin/tnrs_client ''
- the JSON columns can be retrieved by running:
- copy
TNRS.taxon_match_input__copy_to
toTNRS.taxon_match_input_json__copy_to
, but use the column names fromTNRS.taxon_match_input_json
- change
bin/tnrs_db
to useuse_tnrs_export
=False
andtnrs_data
=taxon_match_input_json__copy_to
- copy
- remove the affected names:
DELETE FROM "TNRS".taxon_match WHERE "*Name_submitted" IN ( SELECT "*Name_submitted" FROM ( SELECT DISTINCT ON ("*Name_submitted") * FROM "TNRS".taxon_match ORDER BY "*Name_submitted", "*Name_score" DESC, "*Overall_score" DESC ) s WHERE "*Selected" = 'false' );
- re-scrub the affected names
1 2983244
SELECT COUNT(DISTINCT "*Name_submitted") FROM "TNRS".taxon_match
TPL synonyms bug¶
- run test cases to verify that this is automatically fixed by fixing the TNRS match-picking bug
missing dates¶
implement HVAA fix- troubleshoot SALVIAS bug