Bug #955
Updated by Aaron Marcuse-Kubitza about 10 years ago
h3. error
<pre><code class="SQL">
$ rm=1 inputs/.geoscrub/run &
SELECT create_if_not_exists($$CREATE UNIQUE INDEX $$||:table_str||%'__unique'||$$ ON $$||:table_str||$$ (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
)$$); -- runtime: 2 min ("106693.873") @starscream
psql:/dev/fd/40:26: NOTICE:
SET /*LOCAL*/ search_path TO geoscrub, util;
CREATE UNIQUE INDEX "geoscrub_output__unique" ON "geoscrub_output" (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
)
psql:/dev/fd/40:26: ERROR: could not create unique index "geoscrub_output__unique"
Time: 9300.448 ms
real 0m49.819s
user 0m0.114s
sys 0m0.101s
# # # # # # # # # # # # # to see error details, prepend `vb=2` to the command
make[1]: *** [geoscrub_output/postprocess] Error 1
make[1]: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub'
make: *** [geoscrub_output/install] Error 2
make: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub'
________________________________________________________________________________
$ =1 /bin/bash[2] Thu 21:52:20 aaronmk@vegbiendev:~/bien
</code></pre>
h3. fix
try-create the unique constraint on each table that the geovalidation data passes through:
# @vegbien.geoscrub.geoscrub_output@
<pre><code class="SQL">
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
);
ERROR: could not create unique index "geoscrub_output__unique"
DETAIL: Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated.
</code></pre>
# @geoscrub.public.geoscrub@:
<pre><code class="SQL">
-- in geoscrub DB:
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub (
(array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county])
);
ERROR: could not create unique index "geoscrub_output__unique"
DETAIL: Key ((ARRAY[decimallatitude]), (ARRAY[decimallongitude]), (ARRAY[country]), (ARRAY[stateprovince]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated.
</code></pre>
# @vegbien.public.geoscrub_input@:
<pre><code class="SQL">
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub_input (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
);
Query returned successfully with no result in 39009 ms.
</code></pre>
the duplication occurs between @vegbien.public.geoscrub_input@ and @geoscrub.public.geoscrub@, so it happens in the flat-file export/reimport of the table
check the flat file for duplicates:
<pre>
$ wc -l exports/geoscrub_input.csv
3673412 exports/geoscrub_input.csv
$ <exports/geoscrub_input.csv sort -u|wc -l
3673402
</pre>
there are duplicated lines, so check if these are duplicated rows as well:
<pre>
<exports/geoscrub_input.csv sort|uniq --repeated
Argentina,Neuquen,"Los Lagos
Mexico,Michoacán,"Múgica
Mexico,Michoacán,"Tacámbaro
United States,TEXAS,WILSON,29.1,98.1666666666667
</pre>
only 1 of these is a complete row (the others are beginnings of multiline rows), so there is 1 duplicate row
try re-exporting @geoscrub_input.csv@:
<pre>
$ rm=1 exports/geoscrub_input.csv.run
<exports/geoscrub_input.csv sort|uniq --repeated
Argentina,Neuquen,"Los Lagos
Mexico,Michoacán,"Múgica
Mexico,Michoacán,"Tacámbaro
United States,TEXAS,WILSON,29.1,98.1666666666667
</pre>
the row is still duplicated, so the duplication must be occurring in the flat-file export process
determine whether this row is represented differently in the DB than as plain text:
<pre><code class="SQL">
SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON');
</code></pre>
this produces 2 identical rows:
|country|stateProvince|county|decimalLatitude|decimalLongitude|
|United States|TEXAS|WILSON|29.1|98.1666666666667|
|United States|TEXAS|WILSON|29.1|98.1666666666667|
this suggests that the problem is with the export of decimals
there is a Postgres setting "@extra_float_digits@":http://www.postgresql.org/docs/9.3/static/runtime-config-client.html#GUC-EXTRA-FLOAT-DIGITS which controls this:
<pre><code class="SQL">
SET extra_float_digits = 3; -- the maximum
SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON');
</code></pre>
this shows the difference:
|country|stateProvince|county|decimalLatitude|decimalLongitude|
|United States|TEXAS|WILSON|29.1000000000000014|98.1666666666666714|
|United States|TEXAS|WILSON|29.0999999999999979|98.1666666666666714|
after setting @extra_float_digits@ globally, recheck the flat-file:
<pre>
$ rm=1 exports/geoscrub_input.csv.run
<exports/geoscrub_input.csv sort|uniq --repeated
Argentina,Neuquen,"Los Lagos
Mexico,Michoacán,"Múgica
Mexico,Michoacán,"Tacámbaro
</pre>
there are no more duplicate rows, so the problem appears to be solved
h3. 2nd bug
after rerunning the geoscrubbing process, however, the error still occurs. again, try-create the unique constraint on each table that the geovalidation data passes through:
# @vegbien.public.geoscrub_input@:
now has a permament unique constraint for this
# @geoscrub.public.geoscrub@:
<pre><code class="SQL">
-- in geoscrub DB:
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub (
(array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county])
);
Query returned successfully with no result in 42299 ms.
</code></pre>
# @vegbien.geoscrub.geoscrub_output@
<pre><code class="SQL">
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
);
ERROR: could not create unique index "geoscrub_output__unique"
DETAIL: Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({43}, {-1}, {ES}, {NULL}, {NULL}) is duplicated.
</code></pre>
try restarting Postgres and reexporting, to ensure the @extra_float_digits@ config param is in effect:
<pre>
make postgres_restart
rm=1 inputs/.geoscrub/geoscrub_output/geoscrub.csv.run export_ &
</pre>
check the flat file:
<pre>
<inputs/.geoscrub/geoscrub_output/geoscrub.csv sort|uniq --repeated
19.0288889999999995,-102.055278000000001,Mexico,Michoacán,"Múgica
Departamento",Argentina,Neuquén,,1,3,3,0
Múgica Municipio",Mexico,Michoacán,,1,3,3,0
Tacámbaro Municipio",Mexico,Michoacán,,1,3,3,0
</pre>
none of these are complete rows, so there are no duplicate rows in the flat file
look further at the row that was reported as duplicated:
<pre><code class="SQL">
-- in geoscrub DB:
SELECT * FROM geoscrub
WHERE
(ARRAY[country], ARRAY[stateprovince], ARRAY[county]) = (ARRAY['ES'], ARRAY[NULL], ARRAY[NULL])
AND round(decimallatitude::numeric, 4) = 43
AND round(decimallongitude::numeric, 4) = -1
;
1 row retrieved.
</code></pre>
this only returns one row, so the duplication is not likely here. check @vegbien.geoscrub.geoscrub_output@:
<pre><code class="SQL">
SELECT * FROM geoscrub.geoscrub_output
WHERE
(ARRAY[country], ARRAY["stateProvince"], ARRAY[county]) = (ARRAY['ES'], ARRAY[NULL], ARRAY[NULL])
AND round("decimalLatitude"::numeric, 4) = 43
AND round("decimalLongitude"::numeric, 4) = -1
;
2 rows retrieved.
</code></pre>
|*row_num|decimalLatitude|decimalLongitude|country|stateProvince|county|
|965351|43|-1|ES|<NULL>|<NULL>|
|965376|43|-1|ES|<NULL>|<NULL>|
this returns multiple rows, so the problem may be in the flat file
check the flat file for rows like the duplicated one:
<pre>
grep -E '4(3|2\.9[0-9]*),-(1|0\.9[0-9]*),ES,,,' inputs/.geoscrub/geoscrub_output/geoscrub.csv
42.9699999999999989,-0.910000000000000031,ES,,,Spain,,,1,2,-1,-1
43,-1,ES,,,Spain,,,1,2,-1,-1
</pre>
there is no strange rounding going on, so the problem may be in the import process
examine the duplicated rows in the flat file by @row_num@:
map row #s to line #s for use with @bin/row@:
<pre>
grep -En '43,-1,ES,,,' inputs/.geoscrub/geoscrub_output/geoscrub.csv
965379:43,-1,ES,,,Spain,,,1,2,-1,-1
</pre>
line 965379 is most likely row 965351 (the 1st duplicated row from above)
check for where the 2nd duplicated row would be likely to be:
<pre>
n=$(((965376 - 965351)*2)) bin/row $((965379 - 1)) <inputs/.geoscrub/geoscrub_output/geoscrub.csv # *2: account for potential embedded \n; `- 1`: header is row 0, not 1
43,-1,ES,,,Spain,,,1,2,-1,-1
[...]
</pre>
use a spreadsheet editor to get the row 25 rows after the 1st one (row 26):
|42.9609985351560013|-0.958999991416930042|ES|Na||
but this doesn't have coordinates (43, -1) from above, so this seems to be a different row
examine the corresponding intervening rows in the DB:
<pre><code class="SQL">
SELECT * FROM geoscrub.geoscrub_output
WHERE "*row_num" BETWEEN 965351 AND 965376+1
;
</code></pre>
the row after the 2nd occurrence corresponds to the 1st row after the seemingly-1st occurrence in the CSV, so it appears that grep is actually providing the line # of the _2nd_ occurrence. since there is only one occurrence in the flat file, this means that some row must be getting modified to equal the 1st.
the presence of @Na@ as the @stateProvince@ suggests that this might also be present as @NA@, which is a recently-added NULL-equivalent string (see @lib/sql_io.py@ > @null_strs@). to fix this, we would need to remove @NA@ as a general-purpose NULL-equivalent string. however, @NA@ is required to load the @Cyrille_traits@ data, so this instead needs to be disablable just for the import of the geoscrubbing data and/or enablable just for the @Cyrille_traits@ data.