Project

General

Profile

1
E-mails from Jim:
2
-----
3
As a quick but hopefully sufficient way of transferring the geoscrub results back to you, I dumped my geoscrub output table out to CSV and stuck it on vegbiendev at /tmp/public.2012-11-04-07-34-10.r5984.geoscrub_output.csv.
4

    
5
Here is the essential schema info:
6

    
7
 decimallatitude double precision
8
 decimallongitude double precision
9
 country text
10
 stateprovince text
11
 county text
12
 countrystd text
13
 stateprovincestd text
14
 countystd text
15
 latlonvalidity integer
16
 countryvalidity integer
17
 stateprovincevalidity integer
18
 countyvalidity integer
19

    
20
The first 6 columns are identical to what you provided me as input, and if you do a projection over them, you should be able to recover the geoscrub_input table exactly. I confirmed this is the case in my database, but after importing on your end you should double check to make sure nothing screwy happened during the dump to CSV.
21

    
22
The added countrystd, stateprovincestd, and countystd columns contain the corresponding GADM place names in cases where the scrubbing procedure yielded a match to GADM. And the four *validity columns contain scores as described in my email to the bien-db list a few minutes ago.
23
-----
24
Attached is a tabulation of provisional geo validity scores I generated for the full set of 1707970 geoscrub_input records Aaron provided me a couple of weeks ago (from schema public.2012-11-04-07-34-10.r5984). This goes all the way down to level of county/parish (i.e., 2nd order administrative divisions), although I know the scrubbing can still be improved especially at that lower level. Hence my "provisional" qualifier.
25

    
26
To produce these scores, I first passed the data through a geoscrubbing pipeline that attempts to translate asserted names into GADM (http://gadm.org) names with the help of geonames.org data, some custom mappings, and a few other tricks. Then I pushed them through a geovalidation pipeline that assesses the proximity of asserted lat/lon coordinates to their putative administrative areas in cases where scrubbing was successful. All operations happen in a Postgis database, and the full procedure ran for me in ~2 hours on a virtual server similar to vegbiendev. (This doesn't include the time it takes to set things up by importing GADM and geonames data and building appropriate indexes, but that's a one-time cost anyway.)
27

    
28
I still need to do a detailed writeup of the geoscrubbing and geovalidation procedures, but I think you have all the context you need for this email.
29

    
30
My validity codes are defined as follows, with the general rule that bigger numbers are better:
31

    
32
For latlonvalidity:
33
-1: Latitude and/or longitude is null
34
 0: Coordinate is not a valid geographic location
35
 1: Coordinate is a valid geographic location
36

    
37
For countryvalidity/stateprovincevalidity/countyvalidity:
38
-1: Name is null at this or some higher level
39
 0: Complete name provided, but couldn't be scrubbed to GADM
40
 1: Point is >5km from putative GADM polygon
41
 2: Point is <=5km from putative GADM polygon, but still outside it
42
 3: Point is in (or on border of) putative GADM polygon
43

    
44
Importantly, note that validity at each administrative level below country is constrained by the validity at higher levels. For example, if a stateprovince name is given but the country name is null, then the stateprovincevalidity score is -1. And of course, if a point doesn't fall within the scrubbed country, it certainly can't fall within the scrubbed stateprovince. To put it another way, the integer validity code at a lower level can never be larger than that of higher levels.
45

    
46
You could generate this yourself from the attached data, but for convenience, here is a tabulation of the lat/lon validity scores by themselves:
47

    
48
 latlonvalidity   count
49
             -1       0
50
              0    4981
51
              1 1702989
52

    
53
... and here are separate tabulations of the scores for each administrative level, in each case considering only locations with a valid coordinate (i.e., where latlonvalidity is 1):
54

    
55
 countryvalidity   count
56
              -1  222078
57
               0    6521
58
               1   49137
59
               2   23444
60
               3 1401809
61

    
62
 stateprovincevalidity   count
63
                    -1  298969
64
                     0   19282
65
                     1  107985
66
                     2   34634
67
                     3 1242119
68

    
69
 countyvalidity   count
70
             -1 1429935
71
              0   61266
72
              1   24842
73
              2   12477
74
              3  174469
75
-----
76

    
77
e-mail from Jim on 2013-1-16:
78
-----
79
Back in Nov 2012 I created a local Git repo to manage my geovalidation
80
code as I was developing it and messing around. Technically I'm pretty
81
sure there is a way to graft it into the BIEN SVN repo you've been
82
using, but I don't have time to deal with that now, and anyway there may
83
be advantages to keeping this separate (much as the TNRS development is
84
independent). I'm happy to give you access so you can clone the Git repo
85
yourself if you'd like, but in any event, I just created a 'BIEN Geo'
86
subproject of the main BIEN project in Redmine, and exposed my repo
87
through the browser therein. It's currently set as a private project,
88
but you should be able to see this when logged in:
89

    
90
https://projects.nceas.ucsb.edu/nceas/projects/biengeo/repository
91

    
92
So now you should at least be able to view and download the scripts, and
93
peruse the history. (If you do want to be able to clone the repo, send
94
me a public key and I'll send you further instructions.)
95

    
96
I just spent some time improving comments in the various files, as well
97
as writing up a README.txt file that I hope will get you started. Let me
98
know if (when) you have questions. As I mention in the README, the
99
scripts are probably not 100% runnable end-to-end without some
100
intervention here and there. Pretty close, but not quite. That was
101
certainly my goal, but given that this was all done in a flurry of a few
102
days with continual changes, it's not quite there.
103

    
104
Also, the shell scripts do have the specific wget commands to grab GADM2
105
and geonames.org data that need to be imported into the database to
106
geoscrub/geovalidate-enable it. But if you want (of for that matter if
107
you have trouble with downloading anything), I can put the files I got
108
back in November somewhere on vegbiendev. In fact, although the GADM2
109
data should be unchanged (it's versioned at 2.0), the geonames.org
110
downloads come from their daily database dumps IIRC, so what you'd get
111
now might not be the same as what I got. Probably unlikely to be many,
112
if any, changes in the high-level admin units we're using, but who
113
knows.
114
-----
(1-1/2)