Bug #955: fix geovalidation duplicated rows - BIEN 3 - NCEAS Projects

Bug #955

Updated by Aaron Marcuse-Kubitza about 10 years ago

h3. error 

 <pre><code class="SQL"> 
 $ rm=1 inputs/.geoscrub/run & 
 SELECT create_if_not_exists($$CREATE UNIQUE INDEX $$||:table_str||%'__unique'||$$ ON $$||:table_str||$$ ( 
 (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) 
 )$$); -- runtime: 2 min ("106693.873") @starscream 
 psql:/dev/fd/40:26: NOTICE: 
 SET /*LOCAL*/ search_path TO geoscrub, util; 
 CREATE UNIQUE INDEX "geoscrub_output__unique" ON "geoscrub_output" ( 
 (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) 
 ) 

 psql:/dev/fd/40:26: ERROR:    could not create unique index "geoscrub_output__unique" 
 Time: 9300.448 ms 

 real      0m49.819s 
 user      0m0.114s 
 sys 0m0.101s 
 # # # # # # # # # # # # #    to see error details, prepend `vb=2` to the command  
 make[1]: *** [geoscrub_output/postprocess] Error 1 
 make[1]: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub' 
 make: *** [geoscrub_output/install] Error 2 
 make: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub' 
 ________________________________________________________________________________ 
 $ =1 /bin/bash[2] Thu 21:52:20 aaronmk@vegbiendev:~/bien 
 </code></pre> 

 h3. fix 

 try-create the unique constraint on each table that the geovalidation data passes through: 

 # @vegbien.geoscrub.geoscrub_output@ 
 <pre><code class="SQL"> 
 CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output ( 
 (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) 
 ); 

 ERROR:    could not create unique index "geoscrub_output__unique" 
 DETAIL:    Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated. 
 </code></pre> 
 # @geoscrub.public.geoscrub@: 
 <pre><code class="SQL"> 
 -- in geoscrub DB: 
 CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub ( 
 (array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county]) 
 ); 

 ERROR:    could not create unique index "geoscrub_output__unique" 
 DETAIL:    Key ((ARRAY[decimallatitude]), (ARRAY[decimallongitude]), (ARRAY[country]), (ARRAY[stateprovince]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated. 
 </code></pre> 
 # @vegbien.public.geoscrub_input@: 
 <pre><code class="SQL"> 
 CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub_input ( 
 (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) 
 ); 

 Query returned successfully with no result in 39009 ms. 
 </code></pre> 

 the duplication occurs between @vegbien.public.geoscrub_input@ and @geoscrub.public.geoscrub@, so it happens in the flat-file export/reimport of the table 

 check the flat file for duplicates: 

 <pre> 
 $ wc -l exports/geoscrub_input.csv 
 3673412 exports/geoscrub_input.csv 

 $ <exports/geoscrub_input.csv sort -u|wc -l 
 3673402 
 </pre> 

 there are duplicated lines, so check if these are duplicated rows as well: 

 <pre> 
 <exports/geoscrub_input.csv sort|uniq --repeated 
 Argentina,Neuquen,"Los Lagos 
 Mexico,Michoacán,"Múgica 
 Mexico,Michoacán,"Tacámbaro 
 United States,TEXAS,WILSON,29.1,98.1666666666667 
 </pre> 

 only 1 of these is a complete row (the others are beginnings of multiline rows), so there is 1 duplicate row 

 try re-exporting @geoscrub_input.csv@: 

 <pre> 
 $ rm=1 exports/geoscrub_input.csv.run 

 <exports/geoscrub_input.csv sort|uniq --repeated 
 Argentina,Neuquen,"Los Lagos 
 Mexico,Michoacán,"Múgica 
 Mexico,Michoacán,"Tacámbaro 
 United States,TEXAS,WILSON,29.1,98.1666666666667 
 </pre> 

 the row is still duplicated, so the duplication must be occurring in the flat-file export process 

 determine whether this row is represented differently in the DB than as plain text: 

 <pre><code class="SQL"> 
 SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON'); 
 </code></pre> 

 this produces 2 identical rows: 

 |country|stateProvince|county|decimalLatitude|decimalLongitude| 
 |United States|TEXAS|WILSON|29.1|98.1666666666667| 
 |United States|TEXAS|WILSON|29.1|98.1666666666667| 

 this suggests that the problem is with the export of decimals 

 there is a Postgres setting "@extra_float_digits@":http://www.postgresql.org/docs/9.3/static/runtime-config-client.html#GUC-EXTRA-FLOAT-DIGITS which controls this: 

 <pre><code class="SQL"> 
 SET extra_float_digits = 3; -- the maximum 
 SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON'); 
 </code></pre> 

 this shows the difference: 

 |country|stateProvince|county|decimalLatitude|decimalLongitude| 
 |United States|TEXAS|WILSON|29.1000000000000014|98.1666666666666714| 
 |United States|TEXAS|WILSON|29.0999999999999979|98.1666666666666714| 

 after setting @extra_float_digits@ globally, recheck the flat-file: 

 <pre> 
 $ rm=1 exports/geoscrub_input.csv.run 

 <exports/geoscrub_input.csv sort|uniq --repeated 
 Argentina,Neuquen,"Los Lagos 
 Mexico,Michoacán,"Múgica 
 Mexico,Michoacán,"Tacámbaro 
 </pre> 

 there are no more duplicate rows, so the problem appears to be solved 

 h3. 2nd bug 

 after rerunning the geoscrubbing process, however, the error still occurs. again, try-create the unique constraint on each table that the geovalidation data passes through: 

 # @vegbien.public.geoscrub_input@: 
 now has a permament unique constraint for this 
 # @geoscrub.public.geoscrub@: 
 <pre><code class="SQL"> 
 -- in geoscrub DB: 
 CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub ( 
 (array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county]) 
 ); 

 Query returned successfully with no result in 42299 ms. 
 </code></pre> 
 # @vegbien.geoscrub.geoscrub_output@ 
 <pre><code class="SQL"> 
 CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output ( 
 (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) 
 ); 

 ERROR:    could not create unique index "geoscrub_output__unique" 
 DETAIL:    Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({43}, {-1}, {ES}, {NULL}, {NULL}) is duplicated. 
 </code></pre> 

 try restarting Postgres and reexporting, to ensure the @extra_float_digits@ config param is in effect: 

 <pre> 
 make postgres_restart 
 rm=1 inputs/.geoscrub/geoscrub_output/geoscrub.csv.run export_ & 
 </pre> 

 check the flat file: 

 <pre> 
 <inputs/.geoscrub/geoscrub_output/geoscrub.csv sort|uniq --repeated 
 19.0288889999999995,-102.055278000000001,Mexico,Michoacán,"Múgica 
  Departamento",Argentina,Neuquén,,1,3,3,0 
 Múgica Municipio",Mexico,Michoacán,,1,3,3,0 
  Tacámbaro Municipio",Mexico,Michoacán,,1,3,3,0 
 </pre> 

 none of these are complete rows, so there are no duplicate rows in the flat file 

 look further at the row that was reported as duplicated: 

 <pre><code class="SQL"> 
 -- in geoscrub DB: 
 SELECT * FROM geoscrub 
 WHERE 
     (ARRAY[country], ARRAY[stateprovince], ARRAY[county]) = (ARRAY['ES'], ARRAY[NULL], ARRAY[NULL]) 
 AND round(decimallatitude::numeric, 4) = 43 
 AND round(decimallongitude::numeric, 4) = -1 
 ; 

 1 row retrieved. 
 </code></pre> 

 this only returns one row, so the duplication is not likely here. check @vegbien.geoscrub.geoscrub_output@: 

 <pre><code class="SQL"> 
 SELECT * FROM geoscrub.geoscrub_output 
 WHERE 
     (ARRAY[country], ARRAY["stateProvince"], ARRAY[county]) = (ARRAY['ES'], ARRAY[NULL], ARRAY[NULL]) 
 AND round("decimalLatitude"::numeric, 4) = 43 
 AND round("decimalLongitude"::numeric, 4) = -1 
 ; 

 2 rows retrieved. 
 </code></pre> 

 |*row_num|decimalLatitude|decimalLongitude|country|stateProvince|county| 
 |965351|43|-1|ES|<NULL>|<NULL>| 
 |965376|43|-1|ES|<NULL>|<NULL>| 

 this returns multiple rows, so the problem may be in the flat file 

 check the flat file for rows like the duplicated one: 

 <pre> 
 grep -E '4(3|2\.9[0-9]*),-(1|0\.9[0-9]*),ES,,,' inputs/.geoscrub/geoscrub_output/geoscrub.csv 
 42.9699999999999989,-0.910000000000000031,ES,,,Spain,,,1,2,-1,-1 
 43,-1,ES,,,Spain,,,1,2,-1,-1 
 </pre> 

 there is no strange rounding going on, so the problem may be in the import process 

 examine the duplicated rows in the flat file by @row_num@: 

 map row #s to line #s for use with @bin/row@: 

 <pre> 
 grep -En '43,-1,ES,,,' inputs/.geoscrub/geoscrub_output/geoscrub.csv 
 965379:43,-1,ES,,,Spain,,,1,2,-1,-1 
 </pre> 

 line 965379 is most likely row 965351 (the 1st duplicated row from above) 

 check for where the 2nd duplicated row would be likely to be: 

 <pre> 
 n=$(((965376 - 965351)*2)) bin/row $((965379 - 1)) <inputs/.geoscrub/geoscrub_output/geoscrub.csv # *2: account for potential embedded \n; `- 1`: header is row 0, not 1 
 43,-1,ES,,,Spain,,,1,2,-1,-1 
 [...] 
 </pre> 

 use a spreadsheet editor to get the row 25 rows after the 1st one (row 26): 

 |42.9609985351560013|-0.958999991416930042|ES|Na|| 

 but this doesn't have coordinates (43, -1) from above, so this seems to be a different row 

 examine the corresponding intervening rows in the DB: 

 <pre><code class="SQL"> 
 SELECT * FROM geoscrub.geoscrub_output 
 WHERE "*row_num" BETWEEN 965351 AND 965376+1 
 ; 
 </code></pre> 

 the row after the 2nd occurrence corresponds to the 1st row after the seemingly-1st occurrence in the CSV, so it appears that grep is actually providing the line # of the _2nd_ occurrence. since there is only one occurrence in the flat file, this means that some row must be getting modified to equal the 1st. 

 the presence of @Na@ as the @stateProvince@ suggests that this might also be present as @NA@, which is a recently-added NULL-equivalent string (see @lib/sql_io.py@ > @null_strs@). to fix this, we would need to remove @NA@ as a general-purpose NULL-equivalent string. however, @NA@ is required to load the @Cyrille_traits@ data, so this instead needs to be disablable just for the import of the geoscrubbing data and/or enablable just for the @Cyrille_traits@ data.

Back

Project

General

Profile

Bug #955