Bug #955
closedfix geovalidation duplicated rows
100%
Description
error¶
$ rm=1 inputs/.geoscrub/run &
SELECT create_if_not_exists($$CREATE UNIQUE INDEX $$||:table_str||%'__unique'||$$ ON $$||:table_str||$$ (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
)$$); -- runtime: 2 min ("106693.873") @starscream
psql:/dev/fd/40:26: NOTICE:
SET /*LOCAL*/ search_path TO geoscrub, util;
CREATE UNIQUE INDEX "geoscrub_output__unique" ON "geoscrub_output" (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
)
psql:/dev/fd/40:26: ERROR: could not create unique index "geoscrub_output__unique"
Time: 9300.448 ms
real 0m49.819s
user 0m0.114s
sys 0m0.101s
# # # # # # # # # # # # # to see error details, prepend `vb=2` to the command
make[1]: *** [geoscrub_output/postprocess] Error 1
make[1]: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub'
make: *** [geoscrub_output/install] Error 2
make: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub'
________________________________________________________________________________
$ =1 /bin/bash[2] Thu 21:52:20 aaronmk@vegbiendev:~/bien
fix¶
try-create the unique constraint on each table that the geovalidation data passes through:
vegbien.geoscrub.geoscrub_output
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output ( (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) ); ERROR: could not create unique index "geoscrub_output__unique" DETAIL: Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated.
geoscrub.public.geoscrub
:-- in geoscrub DB: CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub ( (array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county]) ); ERROR: could not create unique index "geoscrub_output__unique" DETAIL: Key ((ARRAY[decimallatitude]), (ARRAY[decimallongitude]), (ARRAY[country]), (ARRAY[stateprovince]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated.
vegbien.public.geoscrub_input
:CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub_input ( (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) ); Query returned successfully with no result in 39009 ms.
the duplication occurs between vegbien.public.geoscrub_input
and geoscrub.public.geoscrub
, so it happens in the flat-file export/reimport of the table
check the flat file for duplicates:
$ wc -l exports/geoscrub_input.csv 3673412 exports/geoscrub_input.csv $ <exports/geoscrub_input.csv sort -u|wc -l 3673402
there are duplicated lines, so check if these are duplicated rows as well:
<exports/geoscrub_input.csv sort|uniq --repeated Argentina,Neuquen,"Los Lagos Mexico,Michoacán,"Múgica Mexico,Michoacán,"Tacámbaro United States,TEXAS,WILSON,29.1,98.1666666666667
only 1 of these is a complete row (the others are beginnings of multiline rows), so there is 1 duplicate row
try re-exporting geoscrub_input.csv
:
$ rm=1 exports/geoscrub_input.csv.run <exports/geoscrub_input.csv sort|uniq --repeated Argentina,Neuquen,"Los Lagos Mexico,Michoacán,"Múgica Mexico,Michoacán,"Tacámbaro United States,TEXAS,WILSON,29.1,98.1666666666667
the row is still duplicated, so the duplication must be occurring in the flat-file export process
determine whether this row is represented differently in the DB than as plain text:
SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON');
this produces 2 identical rows:
country | stateProvince | county | decimalLatitude | decimalLongitude |
United States | TEXAS | WILSON | 29.1 | 98.1666666666667 |
United States | TEXAS | WILSON | 29.1 | 98.1666666666667 |
this suggests that the problem is with the export of decimals
there is a Postgres setting extra_float_digits
which controls this:
SET extra_float_digits = 3; -- the maximum
SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON');
this shows the difference:
country | stateProvince | county | decimalLatitude | decimalLongitude |
United States | TEXAS | WILSON | 29.1000000000000014 | 98.1666666666666714 |
United States | TEXAS | WILSON | 29.0999999999999979 | 98.1666666666666714 |
after setting extra_float_digits
globally, recheck the flat-file:
$ rm=1 exports/geoscrub_input.csv.run <exports/geoscrub_input.csv sort|uniq --repeated Argentina,Neuquen,"Los Lagos Mexico,Michoacán,"Múgica Mexico,Michoacán,"Tacámbaro
there are no more duplicate rows, so the problem appears to be solved
2nd bug¶
after rerunning the geoscrubbing process, however, the error still occurs. again, try-create the unique constraint on each table that the geovalidation data passes through:
vegbien.public.geoscrub_input
:
now has a permament unique constraint for thisgeoscrub.public.geoscrub
:-- in geoscrub DB: CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub ( (array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county]) ); Query returned successfully with no result in 42299 ms.
vegbien.geoscrub.geoscrub_output
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output ( (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) ); ERROR: could not create unique index "geoscrub_output__unique" DETAIL: Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({43}, {-1}, {ES}, {NULL}, {NULL}) is duplicated.
try restarting Postgres and reexporting, to ensure the extra_float_digits
config param is in effect:
make postgres_restart rm=1 inputs/.geoscrub/geoscrub_output/geoscrub.csv.run export_ &
check the flat file:
<inputs/.geoscrub/geoscrub_output/geoscrub.csv sort|uniq --repeated 19.0288889999999995,-102.055278000000001,Mexico,Michoacán,"Múgica Departamento",Argentina,Neuquén,,1,3,3,0 Múgica Municipio",Mexico,Michoacán,,1,3,3,0 Tacámbaro Municipio",Mexico,Michoacán,,1,3,3,0
none of these are complete rows, so there are no duplicate rows in the flat file
look further at the row that was reported as duplicated:
-- in geoscrub DB:
SELECT * FROM geoscrub
WHERE
(ARRAY[country], ARRAY[stateprovince], ARRAY[county]) = (ARRAY['ES'], ARRAY[NULL], ARRAY[NULL])
AND round(decimallatitude::numeric, 4) = 43
AND round(decimallongitude::numeric, 4) = -1
;
1 row retrieved.
this only returns one row, so the duplication is not likely here. check vegbien.geoscrub.geoscrub_output
:
SELECT * FROM geoscrub.geoscrub_output
WHERE
(ARRAY[country], ARRAY["stateProvince"], ARRAY[county]) = (ARRAY['ES'], ARRAY[NULL], ARRAY[NULL])
AND round("decimalLatitude"::numeric, 4) = 43
AND round("decimalLongitude"::numeric, 4) = -1
;
2 rows retrieved.
*row_num | decimalLatitude | decimalLongitude | country | stateProvince | county |
965351 | 43 | -1 | ES | <NULL> | <NULL> |
965376 | 43 | -1 | ES | <NULL> | <NULL> |
this returns multiple rows, so the problem may be in the flat file
check the flat file for rows like the duplicated one:
grep -E '4(3|2\.9[0-9]*),-(1|0\.9[0-9]*),ES,,,' inputs/.geoscrub/geoscrub_output/geoscrub.csv 42.9699999999999989,-0.910000000000000031,ES,,,Spain,,,1,2,-1,-1 43,-1,ES,,,Spain,,,1,2,-1,-1
there is no strange rounding going on, so the problem may be in the import process
examine the duplicated rows in the flat file by row_num
:
map row #s to line #s for use with bin/row
:
grep -En '43,-1,ES,,,' inputs/.geoscrub/geoscrub_output/geoscrub.csv 965379:43,-1,ES,,,Spain,,,1,2,-1,-1
line 965379 is most likely row 965351 (the 1st duplicated row from above)
check for where the 2nd duplicated row would be likely to be:
n=$(((965376 - 965351)*2)) bin/row $((965379 - 1)) <inputs/.geoscrub/geoscrub_output/geoscrub.csv # *2: account for potential embedded \n; `- 1`: header is row 0, not 1 43,-1,ES,,,Spain,,,1,2,-1,-1 [...]
use a spreadsheet editor to get the row 25 rows after the 1st one (row 26):
42.9609985351560013 | -0.958999991416930042 | ES | Na |
but this doesn't have coordinates (43, -1) from above, so this seems to be a different row
examine the corresponding intervening rows in the DB:
SELECT * FROM geoscrub.geoscrub_output
WHERE "*row_num" BETWEEN 965351 AND 965376+1
;
the row after the 2nd occurrence corresponds to the 1st row after the seemingly-1st occurrence in the CSV, so it appears that grep is actually providing the line # of the 2nd occurrence. since there is only one occurrence in the flat file, this means that some row must be getting modified to equal the 1st.
2nd bug's cause¶
the presence of Na
as the stateProvince
suggests that this might also be present as NA
, which is a recently-added NULL-equivalent string (see lib/sql_io.py
> null_strs
). to fix this, we would need to remove NA
as a general-purpose NULL-equivalent string. however, NA
is required to load the Cyrille_traits
data, so this instead needs to be disablable just for the import of the geoscrubbing data and/or enablable just for the Cyrille_traits
data.