Project

General

Profile

Actions

Bug #950

closed

fix view_full_occurrence_individual_view rows with is_geovalid NULL

Added by Aaron Marcuse-Kubitza about 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
High
Start date:
09/05/2014
Due date:
% Done:

100%

Estimated time:

Description

test case

eg. this happens for CVS rows:

datasource country state_province county latitude longitude is_geovalid
CVS United States TENNESSEE Sevier 35.654350967 -83.444906936 <NULL>
SET enable_seqscan = off;
SET enable_mergejoin = off;
SELECT * FROM view_full_occurrence_individual_view WHERE datasource = 'CVS' LIMIT 1;

and FIA rows:

datasource country state_province county latitude longitude is_geovalid
FIA United States Alabama Covington 31.39 -86.36 <NULL>
FIA United States Alabama Escambia 31.17 -86.72 <NULL>
SET enable_seqscan = off;
SET enable_mergejoin = off;
SELECT
DISTINCT ON (country, state_province, county)
*
FROM (SELECT * FROM view_full_occurrence_individual_view WHERE datasource = 'FIA' LIMIT 1000) s
;

info

since is_geovalid is NULL, and state_province is falling back to the unscrubbed value, this indicates it is unable to find a geoscrub.geoscrub_output row to join to

however, there is a matching row in the geoscrub DB's result table:

decimallatitude decimallongitude country stateprovince county countrystd stateprovincestd countystd latlonvalidity countryvalidity stateprovincevalidity countyvalidity
35.654350967 -83.444906936 United States TENNESSEE Sevier United States Tennessee Sevier 1 3 3 3
SELECT * FROM geoscrub WHERE
    country = 'United States'
AND    stateprovince = 'TENNESSEE'
AND    county = 'Sevier'
AND    decimallatitude = 35.654350967
AND    decimallongitude = -83.444906936
LIMIT 1
;

this suggests that the problem is in the transfer from the geoscrub DB to vegbien

grep -E '^35\.654350967,-83\.444906936,United States,TENNESSEE,Sevier.*$' inputs/.geoscrub/geoscrub_output/geoscrub.csv # returns no rows

because this row is not present in the extract, the problem is in the export from the geoscrub DB

then determine if decimal truncation is causing the problem:

grep -E '^35\.65.*,-83\.44.*,United States,TENNESSEE,Sevier.*$' inputs/.geoscrub/geoscrub_output/geoscrub.csv # returns no rows

but this doesn't appear to be the case

trying just the first coordinate:

grep -E '^35\.65.*$' inputs/.geoscrub/geoscrub_output/geoscrub.csv # returns many rows
grep -E ^35.654350967 inputs/.geoscrub/geoscrub_output/geoscrub.csv # returns 1 row:
---
35.654350967,-83.444906936,United States,Tennessee,Sevier,United States,Tennessee,Sevier,1,3,3,3
---

the returned row has the wrong value for stateprovince (the scrubbed value instead of the input value), so the problem is in the export of the placenames, not the coordinates

regenerating the extract:

ssh -t vegbiendev.nceas.ucsb.edu exec sudo -u aaronmk -i
rm=1 inputs/.geoscrub/geoscrub_output/geoscrub.csv.run export_
grep -E ^35.654350967 inputs/.geoscrub/geoscrub_output/geoscrub.csv # returns 1 row:
---
35.654350967,-83.444906936,United States,TENNESSEE,Sevier,United States,Tennessee,Sevier,1,3,3,3
---

the returned row is now correct, so there was likely some problem in the exporting of the geoscrub DB results

the next step will be to figure out if the other unmatched rows are also fixed by the reload


this appears to be fixed by the extra_float_digits fix for the 1st bug of #955. CVS, which has numerous rows with the maximum # of floating-point digits, now has is_geovalid populated for these rows.

Actions

Also available in: Atom PDF