Project

General

Profile

2014-08-14 conference calls

Martha's notes

In brief, here's the status of things based on this morning's call with Aaron and the afternoon discussion of the TNRS problem.

Taxon names (in the analytical tables and the TNRS table)

  • Although all BIEN species names have been run through TNRS (with TPL), the names in the analytical tables still have problems because the workflow does not yet include a workaround/fix for of the bug in TNRS nor does it yet use the TPL to trump Tropicos synonyms.
  • We do have a less onerous plan for addressing this. Aaron will select the problem names (names for which the matched name with the best score isn't selected as the best match) from the TNRS database and to rescrub only those names (here's what's different) and to parse the JSON output file instead of the loading the CSV output file into a db and then trying to implement Jerry Lu's complex algorithm to sort through the output. Both the best match name and the correct synonym can easily be parsed from the JSON file. (additional details will follow in a separate email).
  • This is now Aaron's top priority to be trumped only by the performance work.

Dates (in the analytical tables and VegBIEN)

  • On Monday, I said the dates problems had been fixed. That proved incorrect.
  • In the analytical tables, the SALVIAS and HVAA datasets still have some records in which dates have problems.
  • Fixing the dates is on Aaron's to do list. SALVIAS is a higher priority than HVAA, which is a small dataset from Chile.

Performance

  • Next Monday or Tuesday, Nick will migrate Aaron's BIEN work environment to faster machines.
  • This is the absolute highest priority and will trump the Taxon name work early next week.

upcoming

  • the next conference call is next week at the usual time (Th. 9am PT/9am Tucson/12pm ET)

availability

  • please add your availability for summer 2014 to the *spreadsheet*:

Loading Google Spreadsheet...

info

TNRS match-picking bug

  • after discussion with Nicole, we discovered that there is actually a much faster workaround: you can just use the JSON which is output by the web app when clicking the "+n more" link next to a submitted name. unlike the GWT returned by the initial results, the "+n more" results are parseable JSON, with the same column names as in the CSV (but crucially, the correct selected match).
  • note that the actual bug will remain unfixed, because Naim was not able to find where the problem is caused

TPL synonyms bug

  • this is fixed automatically by Jerry Lu's match-picking algorithm, which makes it unnecessary to code Brad's TPL synonym formula manually. this has been confirmed by Nicole.
  • most likely, Brad only reported the TPL synonyms bug because this was being done incorrectly due to the match-picking bug. (if the selected match is wrong, the selected synonym will of course be wrong as well.)

stuck progress bar bug

  • sometimes, when scrubbing names, the web app gets stuck on the parsing progress bar for several minutes and then returns "0"
  • Andy Edmonds' explanation for why this happens:

I believe its a miss-behaving cron job. The cron cleans up old TNRS jobs, and I think its cleaning up a little more than necessary.

decisions

TNRS match-picking bug

  • working around the match-picking bug is in fact critical, because it causes some names to use the wrong match (Martha)
  • rather than re-scrubbing all the names, we will locate and re-scrub just the names that are affected by the bug

TPL synonyms bug

missing dates

  • a lower priority than the TNRS match-picking bug, but work on this if blocked on that (Martha)

to do for Martha

  • get Nicole to provide the workaround for the TNRS match-picking bug after discussion, we actually discovered a much better workaround than the original one, when it became clear that it's possible to get JSON results out of the web app

to do for Aaron

TNRS match-picking bug

  1. write query to find which names are affected by it
    this would normally require implementing the match-picking algorithm itself in order to find out which selected matches are wrong. however, this is precisely what we want to avoid because it is time-consuming to implement. therefore, in the interest of time we will instead implement a simplified version of the match-picking algorithm, which may also flag some correct matches as incorrect (ie. have false positives).
    300,000 of 3 million1 names (10%) affected:
    SELECT COUNT(*) FROM
    (
        SELECT
        DISTINCT ON ("*Name_submitted")
        *
        FROM "TNRS".taxon_match
        ORDER BY "*Name_submitted", "*Name_score" DESC, "*Overall_score" DESC
    ) s
    WHERE "*Selected" = 'false'
    ; -- runtime: 30 min ("1215021 ms") @r14492 @vegbiendev
    -- 297255
    
    names list (13 MB):
    $ ssh aaronmk@vegbiendev.nceas.ucsb.edu <<'eof_ssh'
    psql --quiet -U bien -d vegbien <<'eof_psql'
    COPY (
    SELECT "*Name_submitted" FROM
    (
        SELECT
        DISTINCT ON ("*Name_submitted")
        *
        FROM "TNRS".taxon_match
        ORDER BY "*Name_submitted", "*Name_score" DESC, "*Overall_score" DESC
    ) s
    WHERE "*Selected" = 'false'
    -- runtime: 11 min ("18:16:08" - "18:04:53")
    ) TO STDOUT CSV /*HEADER*/;
    eof_psql
    eof_ssh
    
  2. implement JSON workaround
    1. copy TNRS.taxon_match_input to TNRS.taxon_match_input_json, but use the column names used by the JSON output format
      • the JSON columns can be retrieved by running:
        bin/tnrs_client ''
        
    2. copy TNRS.taxon_match_input__copy_to to TNRS.taxon_match_input_json__copy_to, but use the column names from TNRS.taxon_match_input_json
    3. change bin/tnrs_db to use use_tnrs_export = False and tnrs_data = taxon_match_input_json__copy_to
  3. remove the affected names:
    DELETE FROM "TNRS".taxon_match WHERE "*Name_submitted" IN (
        SELECT "*Name_submitted" 
        FROM
        (
            SELECT
            DISTINCT ON ("*Name_submitted")
            *
            FROM "TNRS".taxon_match
            ORDER BY "*Name_submitted", "*Name_score" DESC, "*Overall_score" DESC
        ) s
        WHERE "*Selected" = 'false'
    );
    
  4. re-scrub the affected names

1 2983244

SELECT COUNT(DISTINCT "*Name_submitted") FROM "TNRS".taxon_match

TPL synonyms bug

missing dates

  1. implement HVAA fix
  2. troubleshoot SALVIAS bug