1
|
E-mails from Jim:
|
2
|
-----
|
3
|
As a quick but hopefully sufficient way of transferring the geoscrub results back to you, I dumped my geoscrub output table out to CSV and stuck it on vegbiendev at /tmp/public.2012-11-04-07-34-10.r5984.geoscrub_output.csv.
|
4
|
|
5
|
Here is the essential schema info:
|
6
|
|
7
|
decimallatitude double precision
|
8
|
decimallongitude double precision
|
9
|
country text
|
10
|
stateprovince text
|
11
|
county text
|
12
|
countrystd text
|
13
|
stateprovincestd text
|
14
|
countystd text
|
15
|
latlonvalidity integer
|
16
|
countryvalidity integer
|
17
|
stateprovincevalidity integer
|
18
|
countyvalidity integer
|
19
|
|
20
|
The first 6 columns are identical to what you provided me as input, and if you do a projection over them, you should be able to recover the geoscrub_input table exactly. I confirmed this is the case in my database, but after importing on your end you should double check to make sure nothing screwy happened during the dump to CSV.
|
21
|
|
22
|
The added countrystd, stateprovincestd, and countystd columns contain the corresponding GADM place names in cases where the scrubbing procedure yielded a match to GADM. And the four *validity columns contain scores as described in my email to the bien-db list a few minutes ago.
|
23
|
-----
|
24
|
Attached is a tabulation of provisional geo validity scores I generated for the full set of 1707970 geoscrub_input records Aaron provided me a couple of weeks ago (from schema public.2012-11-04-07-34-10.r5984). This goes all the way down to level of county/parish (i.e., 2nd order administrative divisions), although I know the scrubbing can still be improved especially at that lower level. Hence my "provisional" qualifier.
|
25
|
|
26
|
To produce these scores, I first passed the data through a geoscrubbing pipeline that attempts to translate asserted names into GADM (http://gadm.org) names with the help of geonames.org data, some custom mappings, and a few other tricks. Then I pushed them through a geovalidation pipeline that assesses the proximity of asserted lat/lon coordinates to their putative administrative areas in cases where scrubbing was successful. All operations happen in a Postgis database, and the full procedure ran for me in ~2 hours on a virtual server similar to vegbiendev. (This doesn't include the time it takes to set things up by importing GADM and geonames data and building appropriate indexes, but that's a one-time cost anyway.)
|
27
|
|
28
|
I still need to do a detailed writeup of the geoscrubbing and geovalidation procedures, but I think you have all the context you need for this email.
|
29
|
|
30
|
My validity codes are defined as follows, with the general rule that bigger numbers are better:
|
31
|
|
32
|
For latlonvalidity:
|
33
|
-1: Latitude and/or longitude is null
|
34
|
0: Coordinate is not a valid geographic location
|
35
|
1: Coordinate is a valid geographic location
|
36
|
|
37
|
For countryvalidity/stateprovincevalidity/countyvalidity:
|
38
|
-1: Name is null at this or some higher level
|
39
|
0: Complete name provided, but couldn't be scrubbed to GADM
|
40
|
1: Point is >5km from putative GADM polygon
|
41
|
2: Point is <=5km from putative GADM polygon, but still outside it
|
42
|
3: Point is in (or on border of) putative GADM polygon
|
43
|
|
44
|
Importantly, note that validity at each administrative level below country is constrained by the validity at higher levels. For example, if a stateprovince name is given but the country name is null, then the stateprovincevalidity score is -1. And of course, if a point doesn't fall within the scrubbed country, it certainly can't fall within the scrubbed stateprovince. To put it another way, the integer validity code at a lower level can never be larger than that of higher levels.
|
45
|
|
46
|
You could generate this yourself from the attached data, but for convenience, here is a tabulation of the lat/lon validity scores by themselves:
|
47
|
|
48
|
latlonvalidity count
|
49
|
-1 0
|
50
|
0 4981
|
51
|
1 1702989
|
52
|
|
53
|
... and here are separate tabulations of the scores for each administrative level, in each case considering only locations with a valid coordinate (i.e., where latlonvalidity is 1):
|
54
|
|
55
|
countryvalidity count
|
56
|
-1 222078
|
57
|
0 6521
|
58
|
1 49137
|
59
|
2 23444
|
60
|
3 1401809
|
61
|
|
62
|
stateprovincevalidity count
|
63
|
-1 298969
|
64
|
0 19282
|
65
|
1 107985
|
66
|
2 34634
|
67
|
3 1242119
|
68
|
|
69
|
countyvalidity count
|
70
|
-1 1429935
|
71
|
0 61266
|
72
|
1 24842
|
73
|
2 12477
|
74
|
3 174469
|
75
|
-----
|
76
|
|
77
|
e-mail from Jim on 2013-1-16:
|
78
|
-----
|
79
|
Back in Nov 2012 I created a local Git repo to manage my geovalidation
|
80
|
code as I was developing it and messing around. Technically I'm pretty
|
81
|
sure there is a way to graft it into the BIEN SVN repo you've been
|
82
|
using, but I don't have time to deal with that now, and anyway there may
|
83
|
be advantages to keeping this separate (much as the TNRS development is
|
84
|
independent). I'm happy to give you access so you can clone the Git repo
|
85
|
yourself if you'd like, but in any event, I just created a 'BIEN Geo'
|
86
|
subproject of the main BIEN project in Redmine, and exposed my repo
|
87
|
through the browser therein. It's currently set as a private project,
|
88
|
but you should be able to see this when logged in:
|
89
|
|
90
|
https://projects.nceas.ucsb.edu/nceas/projects/biengeo/repository
|
91
|
|
92
|
So now you should at least be able to view and download the scripts, and
|
93
|
peruse the history. (If you do want to be able to clone the repo, send
|
94
|
me a public key and I'll send you further instructions.)
|
95
|
|
96
|
I just spent some time improving comments in the various files, as well
|
97
|
as writing up a README.txt file that I hope will get you started. Let me
|
98
|
know if (when) you have questions. As I mention in the README, the
|
99
|
scripts are probably not 100% runnable end-to-end without some
|
100
|
intervention here and there. Pretty close, but not quite. That was
|
101
|
certainly my goal, but given that this was all done in a flurry of a few
|
102
|
days with continual changes, it's not quite there.
|
103
|
|
104
|
Also, the shell scripts do have the specific wget commands to grab GADM2
|
105
|
and geonames.org data that need to be imported into the database to
|
106
|
geoscrub/geovalidate-enable it. But if you want (of for that matter if
|
107
|
you have trouble with downloading anything), I can put the files I got
|
108
|
back in November somewhere on vegbiendev. In fact, although the GADM2
|
109
|
data should be unchanged (it's versioned at 2.0), the geonames.org
|
110
|
downloads come from their daily database dumps IIRC, so what you'd get
|
111
|
now might not be the same as what I got. Probably unlikely to be many,
|
112
|
if any, changes in the high-level admin units we're using, but who
|
113
|
knows.
|
114
|
-----
|