Bug #955
Updated by Aaron Marcuse-Kubitza over 9 years ago
h3. error
<pre><code class="SQL">
$ rm=1 inputs/.geoscrub/run &
SELECT create_if_not_exists($$CREATE UNIQUE INDEX $$||:table_str||%'__unique'||$$ ON $$||:table_str||$$ (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
)$$); -- runtime: 2 min ("106693.873") @starscream
psql:/dev/fd/40:26: NOTICE:
SET /*LOCAL*/ search_path TO geoscrub, util;
CREATE UNIQUE INDEX "geoscrub_output__unique" ON "geoscrub_output" (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
)
psql:/dev/fd/40:26: ERROR: could not create unique index "geoscrub_output__unique"
Time: 9300.448 ms
real 0m49.819s
user 0m0.114s
sys 0m0.101s
# # # # # # # # # # # # # to see error details, prepend `vb=2` to the command
make[1]: *** [geoscrub_output/postprocess] Error 1
make[1]: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub'
make: *** [geoscrub_output/install] Error 2
make: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub'
________________________________________________________________________________
$ =1 /bin/bash[2] Thu 21:52:20 aaronmk@vegbiendev:~/bien
</code></pre>
h3. fix
try-create the unique constraint on each table that the geovalidation data passes through:
# @vegbien.geoscrub.geoscrub_output@
<pre><code class="SQL">
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
);
ERROR: could not create unique index "geoscrub_output__unique"
DETAIL: Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated.
</code></pre>
# @geoscrub.public.geoscrub@:
<pre><code class="SQL">
-- in geoscrub DB:
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub (
(array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county])
);
ERROR: could not create unique index "geoscrub_output__unique"
DETAIL: Key ((ARRAY[decimallatitude]), (ARRAY[decimallongitude]), (ARRAY[country]), (ARRAY[stateprovince]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated.
</code></pre>
# @vegbien.public.geoscrub_input@:
<pre><code class="SQL">
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub_input (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
);
Query returned successfully with no result in 39009 ms.
</code></pre>
the duplication occurs between @vegbien.public.geoscrub_input@ and @geoscrub.public.geoscrub@, so it happens in the flat-file export/reimport of the table
check the flat file for duplicates:
<pre>
$ wc -l exports/geoscrub_input.csv
3673412 exports/geoscrub_input.csv
$ <exports/geoscrub_input.csv sort -u|wc -l
3673402
</pre>
there are duplicated lines, so check if these are duplicated rows as well:
<pre>
<exports/geoscrub_input.csv sort|uniq --repeated
Argentina,Neuquen,"Los Lagos
Mexico,Michoacán,"Múgica
Mexico,Michoacán,"Tacámbaro
United States,TEXAS,WILSON,29.1,98.1666666666667
</pre>
only 1 of these is a complete row (the others are beginnings of multiline rows), so there is 1 duplicate row
try re-exporting @geoscrub_input.csv@:
<pre>
$ rm=1 exports/geoscrub_input.csv.run
<exports/geoscrub_input.csv sort|uniq --repeated
Argentina,Neuquen,"Los Lagos
Mexico,Michoacán,"Múgica
Mexico,Michoacán,"Tacámbaro
United States,TEXAS,WILSON,29.1,98.1666666666667
</pre>
the row is still duplicated, so the duplication must be occurring in the flat-file export process
determine whether this row is represented differently in the DB than as plain text:
<pre><code class="SQL">
SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON');
</code></pre>
this produces 2 identical rows:
|country|stateProvince|county|decimalLatitude|decimalLongitude|
|United States|TEXAS|WILSON|29.1|98.1666666666667|
|United States|TEXAS|WILSON|29.1|98.1666666666667|
this suggests that the problem is with the export of decimals
there is a Postgres setting "@extra_float_digits@":http://www.postgresql.org/docs/9.3/static/runtime-config-client.html#GUC-EXTRA-FLOAT-DIGITS which controls this:
<pre><code class="SQL">
SET extra_float_digits = 3; -- the maximum
SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON');
</code></pre>
this shows the difference:
|country|stateProvince|county|decimalLatitude|decimalLongitude|
|United States|TEXAS|WILSON|29.1000000000000014|98.1666666666666714|
|United States|TEXAS|WILSON|29.0999999999999979|98.1666666666666714|
after setting @extra_float_digits@ globally, recheck the flat-file:
<pre>
$ rm=1 exports/geoscrub_input.csv.run
<exports/geoscrub_input.csv sort|uniq --repeated
Argentina,Neuquen,"Los Lagos
Mexico,Michoacán,"Múgica
Mexico,Michoacán,"Tacámbaro
</pre>
there are no more duplicate rows, so the problem appears to be solved
h3. 2nd bug
after rerunning the geoscrubbing process, however, the error still occurs. again, try-create the unique constraint on each table that the geovalidation data passes through:
# @vegbien.public.geoscrub_input@:
now has a permament unique constraint for this
# @geoscrub.public.geoscrub@:
<pre><code class="SQL">
-- in geoscrub DB:
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub (
(array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county])
);
Query returned successfully with no result in 42299 ms.
</code></pre>
# @vegbien.geoscrub.geoscrub_output@
<pre><code class="SQL">
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
);
ERROR: could not create unique index "geoscrub_output__unique"
DETAIL: Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({43}, {-1}, {ES}, {NULL}, {NULL}) is duplicated.
</code></pre>
try restarting Postgres and reexporting, to ensure the @extra_float_digits@ config param is in effect:
<pre>
make postgres_restart
rm=1 inputs/.geoscrub/geoscrub_output/geoscrub.csv.run export_ &
</pre>
check the flat file:
<pre>
<inputs/.geoscrub/geoscrub_output/geoscrub.csv sort|uniq --repeated
19.0288889999999995,-102.055278000000001,Mexico,Michoacán,"Múgica
Departamento",Argentina,Neuquén,,1,3,3,0
Múgica Municipio",Mexico,Michoacán,,1,3,3,0
Tacámbaro Municipio",Mexico,Michoacán,,1,3,3,0
</pre>
none of these are complete rows, so there are no duplicate rows in the flat file
look further at the row that was reported as duplicated:
<pre><code class="SQL">
-- in geoscrub DB:
SELECT * FROM geoscrub
WHERE
(ARRAY[country], ARRAY[stateprovince], ARRAY[county]) = (ARRAY['ES'], ARRAY[NULL], ARRAY[NULL])
AND round(decimallatitude::numeric, 4) = 43
AND round(decimallongitude::numeric, 4) = -1
;
1 row retrieved.
</code></pre>
this only returns one row, so the duplication is not likely here. check @vegbien.geoscrub.geoscrub_output@:
<pre><code class="SQL">
SELECT * FROM geoscrub.geoscrub_output
WHERE
(ARRAY[country], ARRAY["stateProvince"], ARRAY[county]) = (ARRAY['ES'], ARRAY[NULL], ARRAY[NULL])
AND round("decimalLatitude"::numeric, 4) = 43
AND round("decimalLongitude"::numeric, 4) = -1
;
2 rows retrieved.
</code></pre>
|*row_num|decimalLatitude|decimalLongitude|country|stateProvince|county|
|965351|43|-1|ES|<NULL>|<NULL>|
|965376|43|-1|ES|<NULL>|<NULL>|
this returns multiple rows, so the problem may be in the flat file
check the flat file for rows like the duplicated one:
<pre>
grep -E '4(3|2\.9[0-9]*),-(1|0\.9[0-9]*),ES,,,' inputs/.geoscrub/geoscrub_output/geoscrub.csv
42.9699999999999989,-0.910000000000000031,ES,,,Spain,,,1,2,-1,-1
43,-1,ES,,,Spain,,,1,2,-1,-1
</pre>
there is no strange rounding going on, so the problem may be in the import process
examine the duplicated rows in the flat file by @row_num@:
map row #s to line #s for use with @bin/row@:
<pre>
grep -En '43,-1,ES,,,' inputs/.geoscrub/geoscrub_output/geoscrub.csv
965379:43,-1,ES,,,Spain,,,1,2,-1,-1
</pre>
line 965379 is most likely row 965351 (the 1st duplicated row from above)
check for where the 2nd duplicated row would be likely to be:
<pre>
n=$(((965376 - 965351)*2)) bin/row $((965379 - 1)) <inputs/.geoscrub/geoscrub_output/geoscrub.csv # *2: account for potential embedded \n; `- 1`: header is row 0, not 1
43,-1,ES,,,Spain,,,1,2,-1,-1
[...]
</pre>
use a spreadsheet editor to get the row 25 rows after the 1st one (row 26):
|42.9609985351560013|-0.958999991416930042|ES|Na||
but this doesn't have coordinates (43, -1) from above, so this seems to be a different row
examine the corresponding intervening rows in the DB:
<pre><code class="SQL">
SELECT * FROM geoscrub.geoscrub_output
WHERE "*row_num" BETWEEN 965351 AND 965376+1
;
</code></pre>
the row after the 2nd occurrence corresponds to the 1st row after the seemingly-1st occurrence in the CSV, so it appears that grep is actually providing the line # of the _2nd_ occurrence. since there is only one occurrence in the flat file, this means that some row must be getting modified to equal the 1st.
h3. 2nd bug's cause
the presence of @Na@ as the @stateProvince@ suggests that this might also be present as @NA@, which is a recently-added NULL-equivalent string (see @lib/sql_io.py@ > @null_strs@). to fix this, we would need to remove @NA@ as a general-purpose NULL-equivalent string. however, @NA@ is required to load the @Cyrille_traits@ data, so this instead needs to be disablable just for the import of the geoscrubbing data and/or enablable just for the @Cyrille_traits@ data.