E-mails from Jim: ----- As a quick but hopefully sufficient way of transferring the geoscrub results back to you, I dumped my geoscrub output table out to CSV and stuck it on vegbiendev at /tmp/public.2012-11-04-07-34-10.r5984.geoscrub_output.csv. Here is the essential schema info: decimallatitude double precision decimallongitude double precision country text stateprovince text county text countrystd text stateprovincestd text countystd text latlonvalidity integer countryvalidity integer stateprovincevalidity integer countyvalidity integer The first 6 columns are identical to what you provided me as input, and if you do a projection over them, you should be able to recover the geoscrub_input table exactly. I confirmed this is the case in my database, but after importing on your end you should double check to make sure nothing screwy happened during the dump to CSV. The added countrystd, stateprovincestd, and countystd columns contain the corresponding GADM place names in cases where the scrubbing procedure yielded a match to GADM. And the four *validity columns contain scores as described in my email to the bien-db list a few minutes ago. ----- Attached is a tabulation of provisional geo validity scores I generated for the full set of 1707970 geoscrub_input records Aaron provided me a couple of weeks ago (from schema public.2012-11-04-07-34-10.r5984). This goes all the way down to level of county/parish (i.e., 2nd order administrative divisions), although I know the scrubbing can still be improved especially at that lower level. Hence my "provisional" qualifier. To produce these scores, I first passed the data through a geoscrubbing pipeline that attempts to translate asserted names into GADM (http://gadm.org) names with the help of geonames.org data, some custom mappings, and a few other tricks. Then I pushed them through a geovalidation pipeline that assesses the proximity of asserted lat/lon coordinates to their putative administrative areas in cases where scrubbing was successful. All operations happen in a Postgis database, and the full procedure ran for me in ~2 hours on a virtual server similar to vegbiendev. (This doesn't include the time it takes to set things up by importing GADM and geonames data and building appropriate indexes, but that's a one-time cost anyway.) I still need to do a detailed writeup of the geoscrubbing and geovalidation procedures, but I think you have all the context you need for this email. My validity codes are defined as follows, with the general rule that bigger numbers are better: For latlonvalidity: -1: Latitude and/or longitude is null 0: Coordinate is not a valid geographic location 1: Coordinate is a valid geographic location For countryvalidity/stateprovincevalidity/countyvalidity: -1: Name is null at this or some higher level 0: Complete name provided, but couldn't be scrubbed to GADM 1: Point is >5km from putative GADM polygon 2: Point is <=5km from putative GADM polygon, but still outside it 3: Point is in (or on border of) putative GADM polygon Importantly, note that validity at each administrative level below country is constrained by the validity at higher levels. For example, if a stateprovince name is given but the country name is null, then the stateprovincevalidity score is -1. And of course, if a point doesn't fall within the scrubbed country, it certainly can't fall within the scrubbed stateprovince. To put it another way, the integer validity code at a lower level can never be larger than that of higher levels. You could generate this yourself from the attached data, but for convenience, here is a tabulation of the lat/lon validity scores by themselves: latlonvalidity count -1 0 0 4981 1 1702989 ... and here are separate tabulations of the scores for each administrative level, in each case considering only locations with a valid coordinate (i.e., where latlonvalidity is 1): countryvalidity count -1 222078 0 6521 1 49137 2 23444 3 1401809 stateprovincevalidity count -1 298969 0 19282 1 107985 2 34634 3 1242119 countyvalidity count -1 1429935 0 61266 1 24842 2 12477 3 174469 -----