Project

General

Profile

Bug #955

Updated by Aaron Marcuse-Kubitza about 10 years ago

h3. error 

 <pre><code class="SQL"> 
 $ rm=1 inputs/.geoscrub/run & 
 SELECT create_if_not_exists($$CREATE UNIQUE INDEX $$||:table_str||%'__unique'||$$ ON $$||:table_str||$$ ( 
 (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) 
 )$$); -- runtime: 2 min ("106693.873") @starscream 
 psql:/dev/fd/40:26: NOTICE: 
 SET /*LOCAL*/ search_path TO geoscrub, util; 
 CREATE UNIQUE INDEX "geoscrub_output__unique" ON "geoscrub_output" ( 
 (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) 
 ) 

 psql:/dev/fd/40:26: ERROR:    could not create unique index "geoscrub_output__unique" 
 Time: 9300.448 ms 

 real      0m49.819s 
 user      0m0.114s 
 sys 0m0.101s 
 # # # # # # # # # # # # #    to see error details, prepend `vb=2` to the command  
 make[1]: *** [geoscrub_output/postprocess] Error 1 
 make[1]: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub' 
 make: *** [geoscrub_output/install] Error 2 
 make: Leaving directory `/home/aaronmk/bien/inputs/.geoscrub' 
 ________________________________________________________________________________ 
 $ =1 /bin/bash[2] Thu 21:52:20 aaronmk@vegbiendev:~/bien 
 </code></pre> 

 h3. fix 

 try-create the unique constraint on each table that the geovalidation data passes through: 

 # @vegbien.geoscrub.geoscrub_output@ 
 <pre><code class="SQL"> 
 CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub.geoscrub_output ( 
 (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) 
 ); 

 ERROR:    could not create unique index "geoscrub_output__unique" 
 DETAIL:    Key ((ARRAY["decimalLatitude"]), (ARRAY["decimalLongitude"]), (ARRAY[country]), (ARRAY["stateProvince"]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated. 
 </code></pre> 
 # @geoscrub.public.geoscrub@: 
 <pre><code class="SQL"> 
 CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub ( 
 (array["decimallatitude"]), (array["decimallongitude"]), (array[country]), (array["stateprovince"]), (array[county]) 
 ); 

 ERROR:    could not create unique index "geoscrub_output__unique" 
 DETAIL:    Key ((ARRAY[decimallatitude]), (ARRAY[decimallongitude]), (ARRAY[country]), (ARRAY[stateprovince]), (ARRAY[county]))=({29.1}, {98.1666666666667}, {"United States"}, {TEXAS}, {WILSON}) is duplicated. 
 </code></pre> 
 # @vegbien.public.geoscrub_input@: 
 <pre><code class="SQL"> 
 CREATE UNIQUE INDEX "geoscrub_output__unique" ON geoscrub_input ( 
 (array["decimalLatitude"]), (array["decimalLongitude"]), (array[country]), (array["stateProvince"]), (array[county]) 
 ); 

 Query returned successfully with no result in 39009 ms. 
 </code></pre> 

 the duplication occurs between @vegbien.public.geoscrub_input@ and @geoscrub.public.geoscrub@, so it happens in the flat-file export/reimport of the table 

 check the flat file for duplicates: 

 <pre> 
 $ wc -l exports/geoscrub_input.csv 
 3673412 exports/geoscrub_input.csv 

 $ <exports/geoscrub_input.csv sort -u|wc -l 
 3673402 
 </pre> 

 there are duplicated lines, so check if these are duplicated rows as well: 

 <pre> 
 <exports/geoscrub_input.csv sort|uniq --repeated 
 Argentina,Neuquen,"Los Lagos 
 Mexico,Michoacán,"Múgica 
 Mexico,Michoacán,"Tacámbaro 
 United States,TEXAS,WILSON,29.1,98.1666666666667 
 </pre> 

 only 1 of these is a complete row (the others are beginnings of multiline rows), so there is 1 duplicate row 

 try re-exporting @geoscrub_input.csv@: 

 <pre> 
 $ rm=1 exports/geoscrub_input.csv.run 

 <exports/geoscrub_input.csv sort|uniq --repeated 
 Argentina,Neuquen,"Los Lagos 
 Mexico,Michoacán,"Múgica 
 Mexico,Michoacán,"Tacámbaro 
 United States,TEXAS,WILSON,29.1,98.1666666666667 
 </pre> 

 the row is still duplicated, so the duplication must be occurring in the flat-file export process 

 determine whether this row is represented differently in the DB than as plain text: 

 <pre><code class="SQL"> 
 SELECT * FROM geoscrub_input WHERE (country, "stateProvince", county) = ('United States', 'TEXAS', 'WILSON'); 
 </code></pre> 

 this produces 2 identical rows: 

 |country|stateProvince|county|decimalLatitude|decimalLongitude| 
 |United States|TEXAS|WILSON|29.1|98.1666666666667| 
 |United States|TEXAS|WILSON|29.1|98.1666666666667| 

 this suggests that the problem is with the export of decimals

Back