Bug #955
Updated by Aaron Marcuse-Kubitza about 10 years ago
h3. error <pre><code class="SQL"> $ rm=1 inputs/.geoscrub/run & SELECT create_if_not_exists($$CREATE UNIQUE INDEX $$||:table_str||%'__unique'||$$ ON $$||:table_str||$$ ( (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) )$$); -- runtime: 2 min ("106693.873") @starscream psql:/dev/fd/40:26: NOTICE: SET /*LOCAL*/ search_path TO geoscrub, util; CREATE UNIQUE INDEX "geoscrub_output__unique" ON "geoscrub_output" ( (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) ) psql:/dev/fd/40:26: ERROR: could not create unique index "geoscrub_output__unique" Time: 9300.448 ms real 0m49.819s user 0m0.114s sys 0m0.101s # # # # # # # # # # # # # to see error details, prepend `vb=2` to the command make[1]: *** [geoscrub_output/postprocess] Error 1 make[1]: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub' make: *** [geoscrub_output/install] Error 2 make: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub' ________________________________________________________________________________ $ =1 /bin/bash[2] Thu 21:52:20 aaronmk@vegbiendev:~/bien </code></pre> h3. fix try-create the unique constraint on each table that the geovalidation data passes through: # @vegbien.geoscrub.geoscrub_output@ <pre><code class="SQL"> CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output ( (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) ); ERROR: could not create unique index "geoscrub_output__unique" DETAIL: Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated. </code></pre> # @geoscrub.public.geoscrub@: <pre><code class="SQL"> -- in geoscrub DB: CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub ( (array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county]) ); ERROR: could not create unique index "geoscrub_output__unique" DETAIL: Key ((ARRAY[decimallatitude]), (ARRAY[decimallongitude]), (ARRAY[country]), (ARRAY[stateprovince]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated. </code></pre> # @vegbien.public.geoscrub_input@: <pre><code class="SQL"> CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub_input ( (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) ); Query returned successfully with no result in 39009 ms. </code></pre> the duplication occurs between @vegbien.public.geoscrub_input@ and @geoscrub.public.geoscrub@, so it happens in the flat-file export/reimport of the table check the flat file for duplicates: <pre> $ wc -l exports/geoscrub_input.csv 3673412 exports/geoscrub_input.csv $ <exports/geoscrub_input.csv sort -u|wc -l 3673402 </pre> there are duplicated lines, so check if these are duplicated rows as well: <pre> <exports/geoscrub_input.csv sort|uniq --repeated Argentina,Neuquen,"Los Lagos Mexico,Michoacán,"Múgica Mexico,Michoacán,"Tacámbaro United States,TEXAS,WILSON,29.1,98.1666666666667 </pre> only 1 of these is a complete row (the others are beginnings of multiline rows), so there is 1 duplicate row try re-exporting @geoscrub_input.csv@: <pre> $ rm=1 exports/geoscrub_input.csv.run <exports/geoscrub_input.csv sort|uniq --repeated Argentina,Neuquen,"Los Lagos Mexico,Michoacán,"Múgica Mexico,Michoacán,"Tacámbaro United States,TEXAS,WILSON,29.1,98.1666666666667 </pre> the row is still duplicated, so the duplication must be occurring in the flat-file export process determine whether this row is represented differently in the DB than as plain text: <pre><code class="SQL"> SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON'); </code></pre> this produces 2 identical rows: |country|stateProvince|county|decimalLatitude|decimalLongitude| |United States|TEXAS|WILSON|29.1|98.1666666666667| |United States|TEXAS|WILSON|29.1|98.1666666666667| this suggests that the problem is with the export of decimals there is a Postgres setting "@extra_float_digits@":http://www.postgresql.org/docs/9.3/static/runtime-config-client.html#GUC-EXTRA-FLOAT-DIGITS which controls this: <pre><code class="SQL"> SET extra_float_digits = 3; -- the maximum SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON'); </code></pre> this shows the difference: |country|stateProvince|county|decimalLatitude|decimalLongitude| |United States|TEXAS|WILSON|29.1000000000000014|98.1666666666666714| |United States|TEXAS|WILSON|29.0999999999999979|98.1666666666666714| after setting @extra_float_digits@ globally, recheck the flat-file: <pre> $ rm=1 exports/geoscrub_input.csv.run <exports/geoscrub_input.csv sort|uniq --repeated Argentina,Neuquen,"Los Lagos Mexico,Michoacán,"Múgica Mexico,Michoacán,"Tacámbaro </pre> there are no more duplicate rows, so the problem appears to be solved after rerunning the geoscrubbing process, however, the error still occurs. again, try-create the unique constraint on each table that the geovalidation data passes through: # @vegbien.public.geoscrub_input@: now has a permament unique constraint for this # @geoscrub.public.geoscrub@: <pre><code class="SQL"> -- in geoscrub DB: CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub ( (array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county]) ); Query returned successfully with no result in 42299 ms. </code></pre> # @vegbien.geoscrub.geoscrub_output@ <pre><code class="SQL"> CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output ( (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) ); ERROR: could not create unique index "geoscrub_output__unique" DETAIL: Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({43}, {-1}, {ES}, {NULL}, {NULL}) is duplicated. </code></pre> try restarting Postgres and reexporting, to ensure the @extra_float_digits@ config param is in effect: <pre> make postgres_restart rm=1 inputs/.geoscrub/geoscrub_output/geoscrub.csv.run export_ & </pre> check the flat file: <pre> <inputs/.geoscrub/geoscrub_output/geoscrub.csv sort|uniq --repeated 19.0288889999999995,-102.055278000000001,Mexico,Michoacán,"Múgica Departamento",Argentina,Neuquén,,1,3,3,0 Múgica Municipio",Mexico,Michoacán,,1,3,3,0 Tacámbaro Municipio",Mexico,Michoacán,,1,3,3,0 </pre> none of these are complete rows, so there are no duplicate rows in the flat file look further at the row that was reported as duplicated: <pre><code class="SQL"> -- in geoscrub DB: SELECT * FROM geoscrub WHERE (ARRAY[country], ARRAY[stateprovince], ARRAY[county]) = (ARRAY['ES'], ARRAY[NULL], ARRAY[NULL]) ARRAY[NULL]); AND round(decimallatitude::numeric, 4) = 43 AND round(decimallongitude::numeric, 4) = -1 ; 1 row retrieved. </code></pre> this only returns one row, so the duplication is not likely here. check @vegbien.geoscrub.geoscrub_output@: <pre><code class="SQL"> -- in geoscrub DB: SELECT * FROM geoscrub.geoscrub_output WHERE (ARRAY[country], ARRAY["stateProvince"], ARRAY[county]) = (ARRAY['ES'], ARRAY[NULL], ARRAY[NULL]) AND round("decimalLatitude"::numeric, 4) = 43 AND round("decimalLongitude"::numeric, 4) = -1 ; 2 5774 rows retrieved. </code></pre> this does return multiple rows as expected, so the problem must be in the flat file