Project

General

Profile

Actions

Bug #955

closed

fix geovalidation duplicated rows

Added by Aaron Marcuse-Kubitza about 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
High
Start date:
Due date:
% Done:

100%

Estimated time:

Description

error

$ rm=1 inputs/.geoscrub/run &
SELECT create_if_not_exists($$CREATE UNIQUE INDEX $$||:table_str||%'__unique'||$$ ON $$||:table_str||$$ (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
)$$); -- runtime: 2 min ("106693.873") @starscream
psql:/dev/fd/40:26: NOTICE:
SET /*LOCAL*/ search_path TO geoscrub, util;
CREATE UNIQUE INDEX "geoscrub_output__unique" ON "geoscrub_output" (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
)

psql:/dev/fd/40:26: ERROR:  could not create unique index "geoscrub_output__unique" 
Time: 9300.448 ms

real    0m49.819s
user    0m0.114s
sys 0m0.101s
# # # # # # # # # # # # #  to see error details, prepend `vb=2` to the command 
make[1]: *** [geoscrub_output/postprocess] Error 1
make[1]: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub'
make: *** [geoscrub_output/install] Error 2
make: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub'
________________________________________________________________________________
$ =1 /bin/bash[2] Thu 21:52:20 aaronmk@vegbiendev:~/bien

fix

try-create the unique constraint on each table that the geovalidation data passes through:

  1. vegbien.geoscrub.geoscrub_output
    CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output (
    (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
    );
    
    ERROR:  could not create unique index "geoscrub_output__unique" 
    DETAIL:  Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated.
    
  2. geoscrub.public.geoscrub:
    -- in geoscrub DB:
    CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub (
    (array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county])
    );
    
    ERROR:  could not create unique index "geoscrub_output__unique" 
    DETAIL:  Key ((ARRAY[decimallatitude]), (ARRAY[decimallongitude]), (ARRAY[country]), (ARRAY[stateprovince]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated.
    
  3. vegbien.public.geoscrub_input:
    CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub_input (
    (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
    );
    
    Query returned successfully with no result in 39009 ms.
    

the duplication occurs between vegbien.public.geoscrub_input and geoscrub.public.geoscrub, so it happens in the flat-file export/reimport of the table

check the flat file for duplicates:

$ wc -l exports/geoscrub_input.csv
3673412 exports/geoscrub_input.csv

$ <exports/geoscrub_input.csv sort -u|wc -l
3673402

there are duplicated lines, so check if these are duplicated rows as well:

<exports/geoscrub_input.csv sort|uniq --repeated
Argentina,Neuquen,"Los Lagos
Mexico,Michoacán,"Múgica
Mexico,Michoacán,"Tacámbaro
United States,TEXAS,WILSON,29.1,98.1666666666667

only 1 of these is a complete row (the others are beginnings of multiline rows), so there is 1 duplicate row

try re-exporting geoscrub_input.csv:

$ rm=1 exports/geoscrub_input.csv.run

<exports/geoscrub_input.csv sort|uniq --repeated
Argentina,Neuquen,"Los Lagos
Mexico,Michoacán,"Múgica
Mexico,Michoacán,"Tacámbaro
United States,TEXAS,WILSON,29.1,98.1666666666667

the row is still duplicated, so the duplication must be occurring in the flat-file export process

determine whether this row is represented differently in the DB than as plain text:

SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON');

this produces 2 identical rows:

country stateProvince county decimalLatitude decimalLongitude
United States TEXAS WILSON 29.1 98.1666666666667
United States TEXAS WILSON 29.1 98.1666666666667

this suggests that the problem is with the export of decimals

there is a Postgres setting extra_float_digits which controls this:

SET extra_float_digits = 3; -- the maximum
SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON');

this shows the difference:

country stateProvince county decimalLatitude decimalLongitude
United States TEXAS WILSON 29.1000000000000014 98.1666666666666714
United States TEXAS WILSON 29.0999999999999979 98.1666666666666714

after setting extra_float_digits globally, recheck the flat-file:

$ rm=1 exports/geoscrub_input.csv.run

<exports/geoscrub_input.csv sort|uniq --repeated
Argentina,Neuquen,"Los Lagos
Mexico,Michoacán,"Múgica
Mexico,Michoacán,"Tacámbaro

there are no more duplicate rows, so the problem appears to be solved

2nd bug

after rerunning the geoscrubbing process, however, the error still occurs. again, try-create the unique constraint on each table that the geovalidation data passes through:

  1. vegbien.public.geoscrub_input:
    now has a permament unique constraint for this
  2. geoscrub.public.geoscrub:
    -- in geoscrub DB:
    CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub (
    (array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county])
    );
    
    Query returned successfully with no result in 42299 ms.
    
  3. vegbien.geoscrub.geoscrub_output
    CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output (
    (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
    );
    
    ERROR:  could not create unique index "geoscrub_output__unique" 
    DETAIL:  Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({43}, {-1}, {ES}, {NULL}, {NULL}) is duplicated.
    

try restarting Postgres and reexporting, to ensure the extra_float_digits config param is in effect:

make postgres_restart
rm=1 inputs/.geoscrub/geoscrub_output/geoscrub.csv.run export_ &

check the flat file:

<inputs/.geoscrub/geoscrub_output/geoscrub.csv sort|uniq --repeated
19.0288889999999995,-102.055278000000001,Mexico,Michoacán,"Múgica
 Departamento",Argentina,Neuquén,,1,3,3,0
Múgica Municipio",Mexico,Michoacán,,1,3,3,0
 Tacámbaro Municipio",Mexico,Michoacán,,1,3,3,0

none of these are complete rows, so there are no duplicate rows in the flat file

look further at the row that was reported as duplicated:

-- in geoscrub DB:
SELECT * FROM geoscrub
WHERE
    (ARRAY[country], ARRAY[stateprovince], ARRAY[county]) = (ARRAY['ES'], ARRAY[NULL], ARRAY[NULL])
AND round(decimallatitude::numeric, 4) = 43
AND round(decimallongitude::numeric, 4) = -1
;

1 row retrieved.

this only returns one row, so the duplication is not likely here. check vegbien.geoscrub.geoscrub_output:

SELECT * FROM geoscrub.geoscrub_output
WHERE
    (ARRAY[country], ARRAY["stateProvince"], ARRAY[county]) = (ARRAY['ES'], ARRAY[NULL], ARRAY[NULL])
AND round("decimalLatitude"::numeric, 4) = 43
AND round("decimalLongitude"::numeric, 4) = -1
;

2 rows retrieved.
*row_num decimalLatitude decimalLongitude country stateProvince county
965351 43 -1 ES <NULL> <NULL>
965376 43 -1 ES <NULL> <NULL>

this returns multiple rows, so the problem may be in the flat file

check the flat file for rows like the duplicated one:

grep -E '4(3|2\.9[0-9]*),-(1|0\.9[0-9]*),ES,,,' inputs/.geoscrub/geoscrub_output/geoscrub.csv
42.9699999999999989,-0.910000000000000031,ES,,,Spain,,,1,2,-1,-1
43,-1,ES,,,Spain,,,1,2,-1,-1

there is no strange rounding going on, so the problem may be in the import process

examine the duplicated rows in the flat file by row_num:

map row #s to line #s for use with bin/row:

grep -En '43,-1,ES,,,' inputs/.geoscrub/geoscrub_output/geoscrub.csv
965379:43,-1,ES,,,Spain,,,1,2,-1,-1

line 965379 is most likely row 965351 (the 1st duplicated row from above)

check for where the 2nd duplicated row would be likely to be:

n=$(((965376 - 965351)*2)) bin/row $((965379 - 1)) <inputs/.geoscrub/geoscrub_output/geoscrub.csv # *2: account for potential embedded \n; `- 1`: header is row 0, not 1
43,-1,ES,,,Spain,,,1,2,-1,-1
[...]

use a spreadsheet editor to get the row 25 rows after the 1st one (row 26):

42.9609985351560013 -0.958999991416930042 ES Na

but this doesn't have coordinates (43, -1) from above, so this seems to be a different row

examine the corresponding intervening rows in the DB:

SELECT * FROM geoscrub.geoscrub_output
WHERE "*row_num" BETWEEN 965351 AND 965376+1
;

the row after the 2nd occurrence corresponds to the 1st row after the seemingly-1st occurrence in the CSV, so it appears that grep is actually providing the line # of the 2nd occurrence. since there is only one occurrence in the flat file, this means that some row must be getting modified to equal the 1st.

2nd bug's cause

the presence of Na as the stateProvince suggests that this might also be present as NA, which is a recently-added NULL-equivalent string (see lib/sql_io.py > null_strs). to fix this, we would need to remove NA as a general-purpose NULL-equivalent string. however, NA is required to load the Cyrille_traits data, so this instead needs to be disablable just for the import of the geoscrubbing data and/or enablable just for the Cyrille_traits data.

Actions

Also available in: Atom PDF