Bug #955: fix geovalidation duplicated rows - BIEN 3 - NCEAS Projects

Actions

Copy link

Bug #955

closed

fix geovalidation duplicated rows

Added by Aaron Marcuse-Kubitza over 10 years ago. Updated over 10 years ago.

Status:

Resolved

Priority:

High

Assignee:

Aaron Marcuse-Kubitza

Start date:

Due date:

% Done:

100%

Estimated time:

Description

error¶

$ rm=1 inputs/.geoscrub/run &
SELECT create_if_not_exists($$CREATE UNIQUE INDEX $$||:table_str||%'__unique'||$$ ON $$||:table_str||$$ (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
)$$); -- runtime: 2 min ("106693.873") @starscream
psql:/dev/fd/40:26: NOTICE:
SET /*LOCAL*/ search_path TO geoscrub, util;
CREATE UNIQUE INDEX "geoscrub_output__unique" ON "geoscrub_output" (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
)

psql:/dev/fd/40:26: ERROR:  could not create unique index "geoscrub_output__unique" 
Time: 9300.448 ms

real    0m49.819s
user    0m0.114s
sys 0m0.101s
# # # # # # # # # # # # #  to see error details, prepend `vb=2` to the command 
make[1]: *** [geoscrub_output/postprocess] Error 1
make[1]: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub'
make: *** [geoscrub_output/install] Error 2
make: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub'
________________________________________________________________________________
$ =1 /bin/bash[2] Thu 21:52:20 aaronmk@vegbiendev:~/bien

fix¶

try-create the unique constraint on each table that the geovalidation data passes through:

vegbien.geoscrub.geoscrub_output

CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
);

ERROR:  could not create unique index "geoscrub_output__unique" 
DETAIL:  Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated.

geoscrub.public.geoscrub:

-- in geoscrub DB:
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub (
(array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county])
);

ERROR:  could not create unique index "geoscrub_output__unique" 
DETAIL:  Key ((ARRAY[decimallatitude]), (ARRAY[decimallongitude]), (ARRAY[country]), (ARRAY[stateprovince]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated.

vegbien.public.geoscrub_input:

CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub_input (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
);

Query returned successfully with no result in 39009 ms.

the duplication occurs between vegbien.public.geoscrub_input and geoscrub.public.geoscrub, so it happens in the flat-file export/reimport of the table

check the flat file for duplicates:

$ wc -l exports/geoscrub_input.csv
3673412 exports/geoscrub_input.csv

$ <exports/geoscrub_input.csv sort -u|wc -l
3673402

there are duplicated lines, so check if these are duplicated rows as well:

<exports/geoscrub_input.csv sort|uniq --repeated
Argentina,Neuquen,"Los Lagos
Mexico,Michoacán,"Múgica
Mexico,Michoacán,"Tacámbaro
United States,TEXAS,WILSON,29.1,98.1666666666667

only 1 of these is a complete row (the others are beginnings of multiline rows), so there is 1 duplicate row

try re-exporting geoscrub_input.csv:

$ rm=1 exports/geoscrub_input.csv.run

<exports/geoscrub_input.csv sort|uniq --repeated
Argentina,Neuquen,"Los Lagos
Mexico,Michoacán,"Múgica
Mexico,Michoacán,"Tacámbaro
United States,TEXAS,WILSON,29.1,98.1666666666667

the row is still duplicated, so the duplication must be occurring in the flat-file export process

determine whether this row is represented differently in the DB than as plain text:

SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON');

this produces 2 identical rows:

country	stateProvince	county	decimalLatitude	decimalLongitude
United States	TEXAS	WILSON	29.1	98.1666666666667
United States	TEXAS	WILSON	29.1	98.1666666666667

this suggests that the problem is with the export of decimals

there is a Postgres setting extra_float_digits which controls this:

SET extra_float_digits = 3; -- the maximum
SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON');

this shows the difference:

country	stateProvince	county	decimalLatitude	decimalLongitude
United States	TEXAS	WILSON	29.1000000000000014	98.1666666666666714
United States	TEXAS	WILSON	29.0999999999999979	98.1666666666666714

after setting extra_float_digits globally, recheck the flat-file:

$ rm=1 exports/geoscrub_input.csv.run

<exports/geoscrub_input.csv sort|uniq --repeated
Argentina,Neuquen,"Los Lagos
Mexico,Michoacán,"Múgica
Mexico,Michoacán,"Tacámbaro

there are no more duplicate rows, so the problem appears to be solved

2nd bug¶

after rerunning the geoscrubbing process, however, the error still occurs. again, try-create the unique constraint on each table that the geovalidation data passes through:

vegbien.public.geoscrub_input:
now has a permament unique constraint for this

geoscrub.public.geoscrub:

-- in geoscrub DB:
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub (
(array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county])
);

Query returned successfully with no result in 42299 ms.

vegbien.geoscrub.geoscrub_output

CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
);

ERROR:  could not create unique index "geoscrub_output__unique" 
DETAIL:  Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({43}, {-1}, {ES}, {NULL}, {NULL}) is duplicated.

try restarting Postgres and reexporting, to ensure the extra_float_digits config param is in effect:

make postgres_restart
rm=1 inputs/.geoscrub/geoscrub_output/geoscrub.csv.run export_ &

check the flat file:

<inputs/.geoscrub/geoscrub_output/geoscrub.csv sort|uniq --repeated
19.0288889999999995,-102.055278000000001,Mexico,Michoacán,"Múgica
 Departamento",Argentina,Neuquén,,1,3,3,0
Múgica Municipio",Mexico,Michoacán,,1,3,3,0
 Tacámbaro Municipio",Mexico,Michoacán,,1,3,3,0

none of these are complete rows, so there are no duplicate rows in the flat file

look further at the row that was reported as duplicated:

-- in geoscrub DB:
SELECT * FROM geoscrub
WHERE
    (ARRAY[country], ARRAY[stateprovince], ARRAY[county]) = (ARRAY['ES'], ARRAY[NULL], ARRAY[NULL])
AND round(decimallatitude::numeric, 4) = 43
AND round(decimallongitude::numeric, 4) = -1
;

1 row retrieved.

this only returns one row, so the duplication is not likely here. check vegbien.geoscrub.geoscrub_output:

SELECT * FROM geoscrub.geoscrub_output
WHERE
    (ARRAY[country], ARRAY["stateProvince"], ARRAY[county]) = (ARRAY['ES'], ARRAY[NULL], ARRAY[NULL])
AND round("decimalLatitude"::numeric, 4) = 43
AND round("decimalLongitude"::numeric, 4) = -1
;

2 rows retrieved.

*row_num	decimalLatitude	decimalLongitude	country	stateProvince	county
965351	43	-1	ES	<NULL>	<NULL>
965376	43	-1	ES	<NULL>	<NULL>

this returns multiple rows, so the problem may be in the flat file

check the flat file for rows like the duplicated one:

grep -E '4(3|2\.9[0-9]*),-(1|0\.9[0-9]*),ES,,,' inputs/.geoscrub/geoscrub_output/geoscrub.csv
42.9699999999999989,-0.910000000000000031,ES,,,Spain,,,1,2,-1,-1
43,-1,ES,,,Spain,,,1,2,-1,-1

there is no strange rounding going on, so the problem may be in the import process

examine the duplicated rows in the flat file by row_num:

map row #s to line #s for use with bin/row:

grep -En '43,-1,ES,,,' inputs/.geoscrub/geoscrub_output/geoscrub.csv
965379:43,-1,ES,,,Spain,,,1,2,-1,-1

line 965379 is most likely row 965351 (the 1st duplicated row from above)

check for where the 2nd duplicated row would be likely to be:

n=$(((965376 - 965351)*2)) bin/row $((965379 - 1)) <inputs/.geoscrub/geoscrub_output/geoscrub.csv # *2: account for potential embedded \n; `- 1`: header is row 0, not 1
43,-1,ES,,,Spain,,,1,2,-1,-1
[...]

use a spreadsheet editor to get the row 25 rows after the 1st one (row 26):

42.9609985351560013

-0.958999991416930042

but this doesn't have coordinates (43, -1) from above, so this seems to be a different row

examine the corresponding intervening rows in the DB:

SELECT * FROM geoscrub.geoscrub_output
WHERE "*row_num" BETWEEN 965351 AND 965376+1
;

the row after the 2nd occurrence corresponds to the 1st row after the seemingly-1st occurrence in the CSV, so it appears that grep is actually providing the line # of the 2nd occurrence. since there is only one occurrence in the flat file, this means that some row must be getting modified to equal the 1st.

2nd bug's cause¶

the presence of Na as the stateProvince suggests that this might also be present as NA, which is a recently-added NULL-equivalent string (see lib/sql_io.py > null_strs). to fix this, we would need to remove NA as a general-purpose NULL-equivalent string. however, NA is required to load the Cyrille_traits data, so this instead needs to be disablable just for the import of the geoscrubbing data and/or enablable just for the Cyrille_traits data.