1
|
E-mails from Jim:
|
2
|
-----
|
3
|
As a quick but hopefully sufficient way of transferring the geoscrub results back to you, I dumped my geoscrub output table out to CSV and stuck it on vegbiendev at /tmp/public.2012-11-04-07-34-10.r5984.geoscrub_output.csv.
|
4
|
|
5
|
Here is the essential schema info:
|
6
|
|
7
|
decimallatitude double precision
|
8
|
decimallongitude double precision
|
9
|
country text
|
10
|
stateprovince text
|
11
|
county text
|
12
|
countrystd text
|
13
|
stateprovincestd text
|
14
|
countystd text
|
15
|
latlonvalidity integer
|
16
|
countryvalidity integer
|
17
|
stateprovincevalidity integer
|
18
|
countyvalidity integer
|
19
|
|
20
|
The first 6 columns are identical to what you provided me as input, and if you do a projection over them, you should be able to recover the geoscrub_input table exactly. I confirmed this is the case in my database, but after importing on your end you should double check to make sure nothing screwy happened during the dump to CSV.
|
21
|
|
22
|
The added countrystd, stateprovincestd, and countystd columns contain the corresponding GADM place names in cases where the scrubbing procedure yielded a match to GADM. And the four *validity columns contain scores as described in my email to the bien-db list a few minutes ago.
|
23
|
-----
|
24
|
Attached is a tabulation of provisional geo validity scores I generated for the full set of 1707970 geoscrub_input records Aaron provided me a couple of weeks ago (from schema public.2012-11-04-07-34-10.r5984). This goes all the way down to level of county/parish (i.e., 2nd order administrative divisions), although I know the scrubbing can still be improved especially at that lower level. Hence my "provisional" qualifier.
|
25
|
|
26
|
To produce these scores, I first passed the data through a geoscrubbing pipeline that attempts to translate asserted names into GADM (http://gadm.org) names with the help of geonames.org data, some custom mappings, and a few other tricks. Then I pushed them through a geovalidation pipeline that assesses the proximity of asserted lat/lon coordinates to their putative administrative areas in cases where scrubbing was successful. All operations happen in a Postgis database, and the full procedure ran for me in ~2 hours on a virtual server similar to vegbiendev. (This doesn't include the time it takes to set things up by importing GADM and geonames data and building appropriate indexes, but that's a one-time cost anyway.)
|
27
|
|
28
|
I still need to do a detailed writeup of the geoscrubbing and geovalidation procedures, but I think you have all the context you need for this email.
|
29
|
|
30
|
My validity codes are defined as follows, with the general rule that bigger numbers are better:
|
31
|
|
32
|
For latlonvalidity:
|
33
|
-1: Latitude and/or longitude is null
|
34
|
0: Coordinate is not a valid geographic location
|
35
|
1: Coordinate is a valid geographic location
|
36
|
|
37
|
For countryvalidity/stateprovincevalidity/countyvalidity:
|
38
|
-1: Name is null at this or some higher level
|
39
|
0: Complete name provided, but couldn't be scrubbed to GADM
|
40
|
1: Point is >5km from putative GADM polygon
|
41
|
2: Point is <=5km from putative GADM polygon, but still outside it
|
42
|
3: Point is in (or on border of) putative GADM polygon
|
43
|
|
44
|
Importantly, note that validity at each administrative level below country is constrained by the validity at higher levels. For example, if a stateprovince name is given but the country name is null, then the stateprovincevalidity score is -1. And of course, if a point doesn't fall within the scrubbed country, it certainly can't fall within the scrubbed stateprovince. To put it another way, the integer validity code at a lower level can never be larger than that of higher levels.
|
45
|
|
46
|
You could generate this yourself from the attached data, but for convenience, here is a tabulation of the lat/lon validity scores by themselves:
|
47
|
|
48
|
latlonvalidity count
|
49
|
-1 0
|
50
|
0 4981
|
51
|
1 1702989
|
52
|
|
53
|
... and here are separate tabulations of the scores for each administrative level, in each case considering only locations with a valid coordinate (i.e., where latlonvalidity is 1):
|
54
|
|
55
|
countryvalidity count
|
56
|
-1 222078
|
57
|
0 6521
|
58
|
1 49137
|
59
|
2 23444
|
60
|
3 1401809
|
61
|
|
62
|
stateprovincevalidity count
|
63
|
-1 298969
|
64
|
0 19282
|
65
|
1 107985
|
66
|
2 34634
|
67
|
3 1242119
|
68
|
|
69
|
countyvalidity count
|
70
|
-1 1429935
|
71
|
0 61266
|
72
|
1 24842
|
73
|
2 12477
|
74
|
3 174469
|
75
|
-----
|