1 |
6236
|
aaronmk
|
E-mails from Jim:
|
2 |
|
|
-----
|
3 |
|
|
As a quick but hopefully sufficient way of transferring the geoscrub results back to you, I dumped my geoscrub output table out to CSV and stuck it on vegbiendev at /tmp/public.2012-11-04-07-34-10.r5984.geoscrub_output.csv.
|
4 |
|
|
|
5 |
|
|
Here is the essential schema info:
|
6 |
|
|
|
7 |
|
|
decimallatitude double precision
|
8 |
|
|
decimallongitude double precision
|
9 |
|
|
country text
|
10 |
|
|
stateprovince text
|
11 |
|
|
county text
|
12 |
|
|
countrystd text
|
13 |
|
|
stateprovincestd text
|
14 |
|
|
countystd text
|
15 |
|
|
latlonvalidity integer
|
16 |
|
|
countryvalidity integer
|
17 |
|
|
stateprovincevalidity integer
|
18 |
|
|
countyvalidity integer
|
19 |
|
|
|
20 |
|
|
The first 6 columns are identical to what you provided me as input, and if you do a projection over them, you should be able to recover the geoscrub_input table exactly. I confirmed this is the case in my database, but after importing on your end you should double check to make sure nothing screwy happened during the dump to CSV.
|
21 |
|
|
|
22 |
|
|
The added countrystd, stateprovincestd, and countystd columns contain the corresponding GADM place names in cases where the scrubbing procedure yielded a match to GADM. And the four *validity columns contain scores as described in my email to the bien-db list a few minutes ago.
|
23 |
|
|
-----
|
24 |
|
|
Attached is a tabulation of provisional geo validity scores I generated for the full set of 1707970 geoscrub_input records Aaron provided me a couple of weeks ago (from schema public.2012-11-04-07-34-10.r5984). This goes all the way down to level of county/parish (i.e., 2nd order administrative divisions), although I know the scrubbing can still be improved especially at that lower level. Hence my "provisional" qualifier.
|
25 |
|
|
|
26 |
|
|
To produce these scores, I first passed the data through a geoscrubbing pipeline that attempts to translate asserted names into GADM (http://gadm.org) names with the help of geonames.org data, some custom mappings, and a few other tricks. Then I pushed them through a geovalidation pipeline that assesses the proximity of asserted lat/lon coordinates to their putative administrative areas in cases where scrubbing was successful. All operations happen in a Postgis database, and the full procedure ran for me in ~2 hours on a virtual server similar to vegbiendev. (This doesn't include the time it takes to set things up by importing GADM and geonames data and building appropriate indexes, but that's a one-time cost anyway.)
|
27 |
|
|
|
28 |
|
|
I still need to do a detailed writeup of the geoscrubbing and geovalidation procedures, but I think you have all the context you need for this email.
|
29 |
|
|
|
30 |
|
|
My validity codes are defined as follows, with the general rule that bigger numbers are better:
|
31 |
|
|
|
32 |
|
|
For latlonvalidity:
|
33 |
|
|
-1: Latitude and/or longitude is null
|
34 |
|
|
0: Coordinate is not a valid geographic location
|
35 |
|
|
1: Coordinate is a valid geographic location
|
36 |
|
|
|
37 |
|
|
For countryvalidity/stateprovincevalidity/countyvalidity:
|
38 |
|
|
-1: Name is null at this or some higher level
|
39 |
|
|
0: Complete name provided, but couldn't be scrubbed to GADM
|
40 |
|
|
1: Point is >5km from putative GADM polygon
|
41 |
|
|
2: Point is <=5km from putative GADM polygon, but still outside it
|
42 |
|
|
3: Point is in (or on border of) putative GADM polygon
|
43 |
|
|
|
44 |
|
|
Importantly, note that validity at each administrative level below country is constrained by the validity at higher levels. For example, if a stateprovince name is given but the country name is null, then the stateprovincevalidity score is -1. And of course, if a point doesn't fall within the scrubbed country, it certainly can't fall within the scrubbed stateprovince. To put it another way, the integer validity code at a lower level can never be larger than that of higher levels.
|
45 |
|
|
|
46 |
|
|
You could generate this yourself from the attached data, but for convenience, here is a tabulation of the lat/lon validity scores by themselves:
|
47 |
|
|
|
48 |
|
|
latlonvalidity count
|
49 |
|
|
-1 0
|
50 |
|
|
0 4981
|
51 |
|
|
1 1702989
|
52 |
|
|
|
53 |
|
|
... and here are separate tabulations of the scores for each administrative level, in each case considering only locations with a valid coordinate (i.e., where latlonvalidity is 1):
|
54 |
|
|
|
55 |
|
|
countryvalidity count
|
56 |
|
|
-1 222078
|
57 |
|
|
0 6521
|
58 |
|
|
1 49137
|
59 |
|
|
2 23444
|
60 |
|
|
3 1401809
|
61 |
|
|
|
62 |
|
|
stateprovincevalidity count
|
63 |
|
|
-1 298969
|
64 |
|
|
0 19282
|
65 |
|
|
1 107985
|
66 |
|
|
2 34634
|
67 |
|
|
3 1242119
|
68 |
|
|
|
69 |
|
|
countyvalidity count
|
70 |
|
|
-1 1429935
|
71 |
|
|
0 61266
|
72 |
|
|
1 24842
|
73 |
|
|
2 12477
|
74 |
|
|
3 174469
|
75 |
|
|
-----
|
76 |
7253
|
aaronmk
|
|
77 |
|
|
e-mail from Jim on 2013-1-16:
|
78 |
|
|
-----
|
79 |
|
|
Back in Nov 2012 I created a local Git repo to manage my geovalidation
|
80 |
|
|
code as I was developing it and messing around. Technically I'm pretty
|
81 |
|
|
sure there is a way to graft it into the BIEN SVN repo you've been
|
82 |
|
|
using, but I don't have time to deal with that now, and anyway there may
|
83 |
|
|
be advantages to keeping this separate (much as the TNRS development is
|
84 |
|
|
independent). I'm happy to give you access so you can clone the Git repo
|
85 |
|
|
yourself if you'd like, but in any event, I just created a 'BIEN Geo'
|
86 |
|
|
subproject of the main BIEN project in Redmine, and exposed my repo
|
87 |
|
|
through the browser therein. It's currently set as a private project,
|
88 |
|
|
but you should be able to see this when logged in:
|
89 |
|
|
|
90 |
|
|
https://projects.nceas.ucsb.edu/nceas/projects/biengeo/repository
|
91 |
|
|
|
92 |
|
|
So now you should at least be able to view and download the scripts, and
|
93 |
|
|
peruse the history. (If you do want to be able to clone the repo, send
|
94 |
|
|
me a public key and I'll send you further instructions.)
|
95 |
|
|
|
96 |
|
|
I just spent some time improving comments in the various files, as well
|
97 |
|
|
as writing up a README.txt file that I hope will get you started. Let me
|
98 |
|
|
know if (when) you have questions. As I mention in the README, the
|
99 |
|
|
scripts are probably not 100% runnable end-to-end without some
|
100 |
|
|
intervention here and there. Pretty close, but not quite. That was
|
101 |
|
|
certainly my goal, but given that this was all done in a flurry of a few
|
102 |
|
|
days with continual changes, it's not quite there.
|
103 |
|
|
|
104 |
|
|
Also, the shell scripts do have the specific wget commands to grab GADM2
|
105 |
|
|
and geonames.org data that need to be imported into the database to
|
106 |
|
|
geoscrub/geovalidate-enable it. But if you want (of for that matter if
|
107 |
|
|
you have trouble with downloading anything), I can put the files I got
|
108 |
|
|
back in November somewhere on vegbiendev. In fact, although the GADM2
|
109 |
|
|
data should be unchanged (it's versioned at 2.0), the geonames.org
|
110 |
|
|
downloads come from their daily database dumps IIRC, so what you'd get
|
111 |
|
|
now might not be the same as what I got. Probably unlikely to be many,
|
112 |
|
|
if any, changes in the high-level admin units we're using, but who
|
113 |
|
|
knows.
|
114 |
|
|
-----
|