Bug #955
Updated by Aaron Marcuse-Kubitza about 10 years ago
h3. error
<pre><code class="SQL">
$ rm=1 inputs/.geoscrub/run &
SELECT create_if_not_exists($$CREATE UNIQUE INDEX $$||:table_str||%'__unique'||$$ ON $$||:table_str||$$ (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
)$$); -- runtime: 2 min ("106693.873") @starscream
psql:/dev/fd/40:26: NOTICE:
SET /*LOCAL*/ search_path TO geoscrub, util;
CREATE UNIQUE INDEX "geoscrub_output__unique" ON "geoscrub_output" (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
)
psql:/dev/fd/40:26: ERROR: could not create unique index "geoscrub_output__unique"
Time: 9300.448 ms
real 0m49.819s
user 0m0.114s
sys 0m0.101s
# # # # # # # # # # # # # to see error details, prepend `vb=2` to the command
make[1]: *** [geoscrub_output/postprocess] Error 1
make[1]: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub'
make: *** [geoscrub_output/install] Error 2
make: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub'
________________________________________________________________________________
$ =1 /bin/bash[2] Thu 21:52:20 aaronmk@vegbiendev:~/bien
</code></pre>
h3. fix
try-create the unique constraint on each table that the geovalidation data passes through:
# @vegbien.geoscrub.geoscrub_output@
<pre><code class="SQL">
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
);
ERROR: could not create unique index "geoscrub_output__unique"
DETAIL: Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated.
</code></pre>
# @geoscrub.public.geoscrub@:
<pre><code class="SQL">
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub (
(array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county])
);
ERROR: could not create unique index "geoscrub_output__unique"
DETAIL: Key ((ARRAY[decimallatitude]), (ARRAY[decimallongitude]), (ARRAY[country]), (ARRAY[stateprovince]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated.
</code></pre>
# @vegbien.public.geoscrub_input@:
<pre><code class="SQL">
CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub_input (
(array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county])
);
Query returned successfully with no result in 39009 ms.
</code></pre>
the duplication occurs between @vegbien.public.geoscrub_input@ and @geoscrub.public.geoscrub@, so it happens in the flat-file export/reimport of the table
check the flat file for duplicates:
<pre>
$ wc -l exports/geoscrub_input.csv
3673412 exports/geoscrub_input.csv
$ <exports/geoscrub_input.csv sort -u|wc -l
3673402
</pre>
there are duplicated lines, so check if these are duplicated rows as well:
<pre>
<exports/geoscrub_input.csv sort|uniq --repeated
Argentina,Neuquen,"Los Lagos
Mexico,Michoacán,"Múgica
Mexico,Michoacán,"Tacámbaro
United States,TEXAS,WILSON,29.1,98.1666666666667
</pre>
only 1 of these is a complete row (the others are beginnings of multiline rows), so there is 1 duplicate row
try re-exporting @geoscrub_input.csv@:
<pre>
$ rm=1 exports/geoscrub_input.csv.run
<exports/geoscrub_input.csv sort|uniq --repeated
Argentina,Neuquen,"Los Lagos
Mexico,Michoacán,"Múgica
Mexico,Michoacán,"Tacámbaro
United States,TEXAS,WILSON,29.1,98.1666666666667
</pre>
the row is still duplicated, so the duplication must be occurring in the flat-file export process
determine whether this row is represented differently in the DB than as plain text:
<pre><code class="SQL">
SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON');
</code></pre>
this produces 2 identical rows:
|country|stateProvince|county|decimalLatitude|decimalLongitude|
|United States|TEXAS|WILSON|29.1|98.1666666666667|
|United States|TEXAS|WILSON|29.1|98.1666666666667|
this suggests that the problem is with the export of decimals
there is a Postgres setting "@extra_float_digits@":http://www.postgresql.org/docs/9.3/static/runtime-config-client.html#GUC-EXTRA-FLOAT-DIGITS which controls this:
<pre><code class="SQL">
SET extra_float_digits = 3; -- the maximum
SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON');
</code></pre>
this shows the difference:
|country|stateProvince|county|decimalLatitude|decimalLongitude|
|United States|TEXAS|WILSON|29.1000000000000014|98.1666666666666714|
|United States|TEXAS|WILSON|29.0999999999999979|98.1666666666666714|
after setting @extra_float_digits@ globally, recheck the flat-file:
<pre>
$ rm=1 exports/geoscrub_input.csv.run
<exports/geoscrub_input.csv sort|uniq --repeated
Argentina,Neuquen,"Los Lagos
Mexico,Michoacán,"Múgica
Mexico,Michoacán,"Tacámbaro
</pre>
there are no more duplicate rows, so the problem is solved!