1 |
6236
|
aaronmk
|
E-mails from Jim:
|
2 |
|
|
-----
|
3 |
|
|
As a quick but hopefully sufficient way of transferring the geoscrub results back to you, I dumped my geoscrub output table out to CSV and stuck it on vegbiendev at /tmp/public.2012-11-04-07-34-10.r5984.geoscrub_output.csv.
|
4 |
|
|
|
5 |
|
|
Here is the essential schema info:
|
6 |
|
|
|
7 |
|
|
decimallatitude double precision
|
8 |
|
|
decimallongitude double precision
|
9 |
|
|
country text
|
10 |
|
|
stateprovince text
|
11 |
|
|
county text
|
12 |
|
|
countrystd text
|
13 |
|
|
stateprovincestd text
|
14 |
|
|
countystd text
|
15 |
|
|
latlonvalidity integer
|
16 |
|
|
countryvalidity integer
|
17 |
|
|
stateprovincevalidity integer
|
18 |
|
|
countyvalidity integer
|
19 |
|
|
|
20 |
|
|
The first 6 columns are identical to what you provided me as input, and if you do a projection over them, you should be able to recover the geoscrub_input table exactly. I confirmed this is the case in my database, but after importing on your end you should double check to make sure nothing screwy happened during the dump to CSV.
|
21 |
|
|
|
22 |
|
|
The added countrystd, stateprovincestd, and countystd columns contain the corresponding GADM place names in cases where the scrubbing procedure yielded a match to GADM. And the four *validity columns contain scores as described in my email to the bien-db list a few minutes ago.
|
23 |
|
|
-----
|
24 |
|
|
Attached is a tabulation of provisional geo validity scores I generated for the full set of 1707970 geoscrub_input records Aaron provided me a couple of weeks ago (from schema public.2012-11-04-07-34-10.r5984). This goes all the way down to level of county/parish (i.e., 2nd order administrative divisions), although I know the scrubbing can still be improved especially at that lower level. Hence my "provisional" qualifier.
|
25 |
|
|
|
26 |
|
|
To produce these scores, I first passed the data through a geoscrubbing pipeline that attempts to translate asserted names into GADM (http://gadm.org) names with the help of geonames.org data, some custom mappings, and a few other tricks. Then I pushed them through a geovalidation pipeline that assesses the proximity of asserted lat/lon coordinates to their putative administrative areas in cases where scrubbing was successful. All operations happen in a Postgis database, and the full procedure ran for me in ~2 hours on a virtual server similar to vegbiendev. (This doesn't include the time it takes to set things up by importing GADM and geonames data and building appropriate indexes, but that's a one-time cost anyway.)
|
27 |
|
|
|
28 |
|
|
I still need to do a detailed writeup of the geoscrubbing and geovalidation procedures, but I think you have all the context you need for this email.
|
29 |
|
|
|
30 |
|
|
My validity codes are defined as follows, with the general rule that bigger numbers are better:
|
31 |
|
|
|
32 |
|
|
For latlonvalidity:
|
33 |
|
|
-1: Latitude and/or longitude is null
|
34 |
|
|
0: Coordinate is not a valid geographic location
|
35 |
|
|
1: Coordinate is a valid geographic location
|
36 |
|
|
|
37 |
|
|
For countryvalidity/stateprovincevalidity/countyvalidity:
|
38 |
|
|
-1: Name is null at this or some higher level
|
39 |
|
|
0: Complete name provided, but couldn't be scrubbed to GADM
|
40 |
|
|
1: Point is >5km from putative GADM polygon
|
41 |
|
|
2: Point is <=5km from putative GADM polygon, but still outside it
|
42 |
|
|
3: Point is in (or on border of) putative GADM polygon
|
43 |
|
|
|
44 |
|
|
Importantly, note that validity at each administrative level below country is constrained by the validity at higher levels. For example, if a stateprovince name is given but the country name is null, then the stateprovincevalidity score is -1. And of course, if a point doesn't fall within the scrubbed country, it certainly can't fall within the scrubbed stateprovince. To put it another way, the integer validity code at a lower level can never be larger than that of higher levels.
|
45 |
|
|
|
46 |
|
|
You could generate this yourself from the attached data, but for convenience, here is a tabulation of the lat/lon validity scores by themselves:
|
47 |
|
|
|
48 |
|
|
latlonvalidity count
|
49 |
|
|
-1 0
|
50 |
|
|
0 4981
|
51 |
|
|
1 1702989
|
52 |
|
|
|
53 |
|
|
... and here are separate tabulations of the scores for each administrative level, in each case considering only locations with a valid coordinate (i.e., where latlonvalidity is 1):
|
54 |
|
|
|
55 |
|
|
countryvalidity count
|
56 |
|
|
-1 222078
|
57 |
|
|
0 6521
|
58 |
|
|
1 49137
|
59 |
|
|
2 23444
|
60 |
|
|
3 1401809
|
61 |
|
|
|
62 |
|
|
stateprovincevalidity count
|
63 |
|
|
-1 298969
|
64 |
|
|
0 19282
|
65 |
|
|
1 107985
|
66 |
|
|
2 34634
|
67 |
|
|
3 1242119
|
68 |
|
|
|
69 |
|
|
countyvalidity count
|
70 |
|
|
-1 1429935
|
71 |
|
|
0 61266
|
72 |
|
|
1 24842
|
73 |
|
|
2 12477
|
74 |
|
|
3 174469
|
75 |
|
|
-----
|