Bug #955: fix geovalidation duplicated rows - BIEN 3 - NCEAS Projects

Bug #955

Updated by Aaron Marcuse-Kubitza over 9 years ago

h3. error 

 <pre><code class="SQL"> 
 $ rm=1 inputs/.geoscrub/run & 
 SELECT create_if_not_exists($$CREATE UNIQUE INDEX $$||:table_str||%'__unique'||$$ ON $$||:table_str||$$ ( 
 (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) 
 )$$); -- runtime: 2 min ("106693.873") @starscream 
 psql:/dev/fd/40:26: NOTICE: 
 SET /*LOCAL*/ search_path TO geoscrub, util; 
 CREATE UNIQUE INDEX "geoscrub_output__unique" ON "geoscrub_output" ( 
 (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) 
 ) 

 psql:/dev/fd/40:26: ERROR:    could not create unique index "geoscrub_output__unique" 
 Time: 9300.448 ms 

 real      0m49.819s 
 user      0m0.114s 
 sys 0m0.101s 
 # # # # # # # # # # # # #    to see error details, prepend `vb=2` to the command  
 make[1]: *** [geoscrub_output/postprocess] Error 1 
 make[1]: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub' 
 make: *** [geoscrub_output/install] Error 2 
 make: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub' 
 ________________________________________________________________________________ 
 $ =1 /bin/bash[2] Thu 21:52:20 aaronmk@vegbiendev:~/bien 
 </code></pre> 

 h3. fix 

 try-create the unique constraint on each table that the geovalidation data passes through: 

 # @vegbien.geoscrub.geoscrub_output@ 
 <pre><code class="SQL"> 
 CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output ( 
 (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) 
 ); 

 ERROR:    could not create unique index "geoscrub_output__unique" 
 DETAIL:    Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated. 
 </code></pre> 
 # @geoscrub.public.geoscrub@: 
 <pre><code class="SQL"> 
 -- in geoscrub DB: 
 CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub ( 
 (array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county]) 
 ); 

 ERROR:    could not create unique index "geoscrub_output__unique" 
 DETAIL:    Key ((ARRAY[decimallatitude]), (ARRAY[decimallongitude]), (ARRAY[country]), (ARRAY[stateprovince]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated. 
 </code></pre> 
 # @vegbien.public.geoscrub_input@: 
 <pre><code class="SQL"> 
 CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub_input ( 
 (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) 
 ); 

 Query returned successfully with no result in 39009 ms. 
 </code></pre> 

 the duplication occurs between @vegbien.public.geoscrub_input@ and @geoscrub.public.geoscrub@, so it happens in the flat-file export/reimport of the table 

 check the flat file for duplicates: 

 <pre> 
 $ wc -l exports/geoscrub_input.csv 
 3673412 exports/geoscrub_input.csv 

 $ <exports/geoscrub_input.csv sort -u|wc -l 
 3673402 
 </pre> 

 there are duplicated lines, so check if these are duplicated rows as well: 

 <pre> 
 <exports/geoscrub_input.csv sort|uniq --repeated 
 Argentina,Neuquen,"Los Lagos 
 Mexico,Michoacán,"Múgica 
 Mexico,Michoacán,"Tacámbaro 
 United States,TEXAS,WILSON,29.1,98.1666666666667 
 </pre> 

 only 1 of these is a complete row (the others are beginnings of multiline rows), so there is 1 duplicate row 

 try re-exporting @geoscrub_input.csv@: 

 <pre> 
 $ rm=1 exports/geoscrub_input.csv.run 

 <exports/geoscrub_input.csv sort|uniq --repeated 
 Argentina,Neuquen,"Los Lagos 
 Mexico,Michoacán,"Múgica 
 Mexico,Michoacán,"Tacámbaro 
 United States,TEXAS,WILSON,29.1,98.1666666666667 
 </pre> 

 the row is still duplicated, so the duplication must be occurring in the flat-file export process 

 determine whether this row is represented differently in the DB than as plain text: 

 <pre><code class="SQL"> 
 SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON'); 
 </code></pre> 

 this produces 2 identical rows: 

 |country|stateProvince|county|decimalLatitude|decimalLongitude| 
 |United States|TEXAS|WILSON|29.1|98.1666666666667| 
 |United States|TEXAS|WILSON|29.1|98.1666666666667| 

 this suggests that the problem is with the export of decimals 

 there is a Postgres setting "@extra_float_digits@":http://www.postgresql.org/docs/9.3/static/runtime-config-client.html#GUC-EXTRA-FLOAT-DIGITS which controls this: 

 <pre><code class="SQL"> 
 SET extra_float_digits = 3; -- the maximum 
 SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON'); 
 </code></pre> 

 this shows the difference: 

 |country|stateProvince|county|decimalLatitude|decimalLongitude| 
 |United States|TEXAS|WILSON|29.1000000000000014|98.1666666666666714| 
 |United States|TEXAS|WILSON|29.0999999999999979|98.1666666666666714| 

 after setting @extra_float_digits@ globally, recheck the flat-file: 

 <pre> 
 $ rm=1 exports/geoscrub_input.csv.run 

 <exports/geoscrub_input.csv sort|uniq --repeated 
 Argentina,Neuquen,"Los Lagos 
 Mexico,Michoacán,"Múgica 
 Mexico,Michoacán,"Tacámbaro 
 </pre> 

 there are no more duplicate rows, so the problem appears to be solved 

 h3. 2nd bug 

 after rerunning the geoscrubbing process, however, the error still occurs. again, try-create the unique constraint on each table that the geovalidation data passes through: 

 # @vegbien.public.geoscrub_input@: 
 now has a permament unique constraint for this 
 # @geoscrub.public.geoscrub@: 
 <pre><code class="SQL"> 
 -- in geoscrub DB: 
 CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub ( 
 (array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county]) 
 ); 

 Query returned successfully with no result in 42299 ms. 
 </code></pre> 
 # @vegbien.geoscrub.geoscrub_output@ 
 <pre><code class="SQL"> 
 CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output ( 
 (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) 
 ); 

 ERROR:    could not create unique index "geoscrub_output__unique" 
 DETAIL:    Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({43}, {-1}, {ES}, {NULL}, {NULL}) is duplicated. 
 </code></pre> 

 try restarting Postgres and reexporting, to ensure the @extra_float_digits@ config param is in effect: 

 <pre> 
 make postgres_restart 
 rm=1 inputs/.geoscrub/geoscrub_output/geoscrub.csv.run export_ & 
 </pre> 

 check the flat file: 

 <pre> 
 <inputs/.geoscrub/geoscrub_output/geoscrub.csv sort|uniq --repeated 
 19.0288889999999995,-102.055278000000001,Mexico,Michoacán,"Múgica 
  Departamento",Argentina,Neuquén,,1,3,3,0 
 Múgica Municipio",Mexico,Michoacán,,1,3,3,0 
  Tacámbaro Municipio",Mexico,Michoacán,,1,3,3,0 
 </pre> 

 none of these are complete rows, so there are no duplicate rows in the flat file 

 look further at the row that was reported as duplicated: 

 <pre><code class="SQL"> 
 -- in geoscrub DB: 
 SELECT * FROM geoscrub 
 WHERE 
     (ARRAY[country], ARRAY[stateprovince], ARRAY[county]) = (ARRAY['ES'], ARRAY[NULL], ARRAY[NULL]) 
 AND round(decimallatitude::numeric, 4) = 43 
 AND round(decimallongitude::numeric, 4) = -1 
 ; 

 1 row retrieved. 
 </code></pre> 

 this only returns one row, so the duplication is not likely here. check @vegbien.geoscrub.geoscrub_output@: 

 <pre><code class="SQL"> 
 SELECT * FROM geoscrub.geoscrub_output 
 WHERE 
     (ARRAY[country], ARRAY["stateProvince"], ARRAY[county]) = (ARRAY['ES'], ARRAY[NULL], ARRAY[NULL]) 
 AND round("decimalLatitude"::numeric, 4) = 43 
 AND round("decimalLongitude"::numeric, 4) = -1 
 ; 

 2 rows retrieved. 
 </code></pre> 

 |*row_num|decimalLatitude|decimalLongitude|country|stateProvince|county| 
 |965351|43|-1|ES|<NULL>|<NULL>| 
 |965376|43|-1|ES|<NULL>|<NULL>| 

 this returns multiple rows, so the problem may be in the flat file 

 check the flat file for rows like the duplicated one: 

 <pre> 
 grep -E '4(3|2\.9[0-9]*),-(1|0\.9[0-9]*),ES,,,' inputs/.geoscrub/geoscrub_output/geoscrub.csv 
 42.9699999999999989,-0.910000000000000031,ES,,,Spain,,,1,2,-1,-1 
 43,-1,ES,,,Spain,,,1,2,-1,-1 
 </pre> 

 there is no strange rounding going on, so the problem may be in the import process 

 examine the duplicated rows in the flat file by @row_num@: 

 map row #s to line #s for use with @bin/row@: 

 <pre> 
 grep -En '43,-1,ES,,,' inputs/.geoscrub/geoscrub_output/geoscrub.csv 
 965379:43,-1,ES,,,Spain,,,1,2,-1,-1 
 </pre> 

 line 965379 is most likely row 965351 (the 1st duplicated row from above) 

 check for where the 2nd duplicated row would be likely to be: 

 <pre> 
 n=$(((965376 - 965351)*2)) bin/row $((965379 - 1)) <inputs/.geoscrub/geoscrub_output/geoscrub.csv # *2: account for potential embedded \n; `- 1`: header is row 0, not 1 
 43,-1,ES,,,Spain,,,1,2,-1,-1 
 [...] 
 </pre> 

 use a spreadsheet editor to get the row 25 rows after the 1st one (row 26): 

 |42.9609985351560013|-0.958999991416930042|ES|Na|| 

 but this doesn't have coordinates (43, -1) from above, so this seems to be a different row 

 examine the corresponding intervening rows in the DB: 

 <pre><code class="SQL"> 
 SELECT * FROM geoscrub.geoscrub_output 
 WHERE "*row_num" BETWEEN 965351 AND 965376+1 
 ; 
 </code></pre> 

 the row after the 2nd occurrence corresponds to the 1st row after the seemingly-1st occurrence in the CSV, so it appears that grep is actually providing the line # of the _2nd_ occurrence. since there is only one occurrence in the flat file, this means that some row must be getting modified to equal the 1st. 

 h3. 2nd bug's cause 

 the presence of @Na@ as the @stateProvince@ suggests that this might also be present as @NA@, which is a recently-added NULL-equivalent string (see @lib/sql_io.py@ > @null_strs@). to fix this, we would need to remove @NA@ as a general-purpose NULL-equivalent string. however, @NA@ is required to load the @Cyrille_traits@ data, so this instead needs to be disablable just for the import of the geoscrubbing data and/or enablable just for the @Cyrille_traits@ data.

Back

Project

General

Profile

Bug #955