Project

General

Profile

1
see also:
2
http://wiki.vegpath.org/2012-11-09_conference_call#Geovalidation
3

    
4
mailto:administrator@johndonoghue.net?John_Donoghue.2012-4-3:17:05.Re:+[Bien-db]+VegBIEN+validation+utilities
5
-----
6
my geographic coordinate validation routines involved passing data between BIEN2 (MySQL) to ArcGIS (for spatial joins) to PostGIS (for spatial overlay via geo-aware SQL scripts, and back to BIEN2. It wasn't a pretty system but it was the only way to work around the limitations we had for doing true spatial work in MySQL. If BIEN3 was implemented in PostgreSQL (with PostGIS extensions), I think we could implement the whole workflow in a series of SQL scripts.
7
-----
8

    
9
e-mail from Jim on 2012-11-16:
10
-----
11
As a quick but hopefully sufficient way of transferring the geoscrub results back to you, I dumped my geoscrub output table out to CSV and stuck it on vegbiendev at /tmp/public.2012-11-04-07-34-10.r5984.geoscrub_output.csv.
12

    
13
Here is the essential schema info:
14

    
15
 decimallatitude double precision
16
 decimallongitude double precision
17
 country text
18
 stateprovince text
19
 county text
20
 countrystd text
21
 stateprovincestd text
22
 countystd text
23
 latlonvalidity integer
24
 countryvalidity integer
25
 stateprovincevalidity integer
26
 countyvalidity integer
27

    
28
The first 6 columns are identical to what you provided me as input, and if you do a projection over them, you should be able to recover the geoscrub_input table exactly. I confirmed this is the case in my database, but after importing on your end you should double check to make sure nothing screwy happened during the dump to CSV.
29

    
30
The added countrystd, stateprovincestd, and countystd columns contain the corresponding GADM place names in cases where the scrubbing procedure yielded a match to GADM. And the four *validity columns contain scores as described in my email to the bien-db list a few minutes ago.
31
-----
32

    
33
e-mail from Jim on 2012-11-16:
34
-----
35
Attached is a tabulation of provisional geo validity scores I generated for the full set of 1707970 geoscrub_input records Aaron provided me a couple of weeks ago (from schema public.2012-11-04-07-34-10.r5984). This goes all the way down to level of county/parish (i.e., 2nd order administrative divisions), although I know the scrubbing can still be improved especially at that lower level. Hence my "provisional" qualifier.
36

    
37
To produce these scores, I first passed the data through a geoscrubbing pipeline that attempts to translate asserted names into GADM (http://gadm.org) names with the help of geonames.org data, some custom mappings, and a few other tricks. Then I pushed them through a geovalidation pipeline that assesses the proximity of asserted lat/lon coordinates to their putative administrative areas in cases where scrubbing was successful. All operations happen in a Postgis database, and the full procedure ran for me in ~2 hours on a virtual server similar to vegbiendev. (This doesn't include the time it takes to set things up by importing GADM and geonames data and building appropriate indexes, but that's a one-time cost anyway.)
38

    
39
I still need to do a detailed writeup of the geoscrubbing and geovalidation procedures, but I think you have all the context you need for this email.
40

    
41
My validity codes are defined as follows, with the general rule that bigger numbers are better:
42

    
43
For latlonvalidity:
44
-1: Latitude and/or longitude is null
45
 0: Coordinate is not a valid geographic location
46
 1: Coordinate is a valid geographic location
47

    
48
For countryvalidity/stateprovincevalidity/countyvalidity:
49
-1: Name is null at this or some higher level
50
 0: Complete name provided, but couldn't be scrubbed to GADM
51
 1: Point is >5km from putative GADM polygon
52
 2: Point is <=5km from putative GADM polygon, but still outside it
53
 3: Point is in (or on border of) putative GADM polygon
54

    
55
Importantly, note that validity at each administrative level below country is constrained by the validity at higher levels. For example, if a stateprovince name is given but the country name is null, then the stateprovincevalidity score is -1. And of course, if a point doesn't fall within the scrubbed country, it certainly can't fall within the scrubbed stateprovince. To put it another way, the integer validity code at a lower level can never be larger than that of higher levels.
56

    
57
You could generate this yourself from the attached data, but for convenience, here is a tabulation of the lat/lon validity scores by themselves:
58

    
59
 latlonvalidity   count
60
             -1       0
61
              0    4981
62
              1 1702989
63

    
64
... and here are separate tabulations of the scores for each administrative level, in each case considering only locations with a valid coordinate (i.e., where latlonvalidity is 1):
65

    
66
 countryvalidity   count
67
              -1  222078
68
               0    6521
69
               1   49137
70
               2   23444
71
               3 1401809
72

    
73
 stateprovincevalidity   count
74
                    -1  298969
75
                     0   19282
76
                     1  107985
77
                     2   34634
78
                     3 1242119
79

    
80
 countyvalidity   count
81
             -1 1429935
82
              0   61266
83
              1   24842
84
              2   12477
85
              3  174469
86
-----
87

    
88
e-mail from Jim on 2013-1-16:
89
-----
90
Back in Nov 2012 I created a local Git repo to manage my geovalidation
91
code as I was developing it and messing around. Technically I'm pretty
92
sure there is a way to graft it into the BIEN SVN repo you've been
93
using, but I don't have time to deal with that now, and anyway there may
94
be advantages to keeping this separate (much as the TNRS development is
95
independent). I'm happy to give you access so you can clone the Git repo
96
yourself if you'd like, but in any event, I just created a 'BIEN Geo'
97
subproject of the main BIEN project in Redmine, and exposed my repo
98
through the browser therein. It's currently set as a private project,
99
but you should be able to see this when logged in:
100

    
101
https://projects.nceas.ucsb.edu/nceas/projects/biengeo/repository
102

    
103
So now you should at least be able to view and download the scripts, and
104
peruse the history. (If you do want to be able to clone the repo, send
105
me a public key and I'll send you further instructions.)
106

    
107
I just spent some time improving comments in the various files, as well
108
as writing up a README.txt file that I hope will get you started. Let me
109
know if (when) you have questions. As I mention in the README, the
110
scripts are probably not 100% runnable end-to-end without some
111
intervention here and there. Pretty close, but not quite. That was
112
certainly my goal, but given that this was all done in a flurry of a few
113
days with continual changes, it's not quite there.
114

    
115
Also, the shell scripts do have the specific wget commands to grab GADM2
116
and geonames.org data that need to be imported into the database to
117
geoscrub/geovalidate-enable it. But if you want (of for that matter if
118
you have trouble with downloading anything), I can put the files I got
119
back in November somewhere on vegbiendev. In fact, although the GADM2
120
data should be unchanged (it's versioned at 2.0), the geonames.org
121
downloads come from their daily database dumps IIRC, so what you'd get
122
now might not be the same as what I got. Probably unlikely to be many,
123
if any, changes in the high-level admin units we're using, but who
124
knows.
125
-----
126

    
127
e-mail from Jim on 2013-1-18:
128
-----
129
Regarding calculation of precision estimates for counties, I just resuscitated something I had started working up (with Bob's inspiration) back on the final Friday afternoon of the Nov 2012 BIEN WG meeting at NCEAS. Using GADM2 administrative polygons data (http://www.gadm.org), this calculation works by first reprojecting a given county into whatever UTM zone contains the county centroid, then (just as Bob suggested in his email) computing the distance from the centroid to the most distance point on the county border.
130

    
131
Have a look at the attached CSV of results for US counties, and mull over whether this seems useful. Maybe even sanity-check against your favorite county or two! For those who care, the SQL statement is pasted below my sig, along with a 'utmzone' function that I borrowed from the PostGIS wiki.
132

    
133
Note that I did the whole UTM reprojection bit because the GADM2 data are in a geographic projection, for which planar distance calculations aren't in meaningful units and aren't consistent across latitudes. Although postgis provides some functions that know how to work directly on the WGS 84 ellipsoid and don't require a projection, the exact functions we need for this aren't available.
134

    
135
Some misc things that were in my head as I did this:
136

    
137
1. This precision measure is conservative insofar as it basically describes a circle of radius equal to the precision value, and most counties will only actually occupy a fraction of the circle depending on their shape -- think of long skinny counties, or those which have an offshore island under their jurisdiction. Totally reasonable! But this _is_ just one conception of what "precision" means for county data.
138

    
139
2. In GADM2, counties are the lowest admin level for the US, making things easy. But other countries have additional levels below county/parish, meaning that we'd need to add a step that first unions those smaller polygons into a single polygon for the county/parish.
140

    
141
3. I was a little concerned about the, um, precision of these UTM-based precision estimates. Although UTM is convenient insofar as a local zone can be applied when looking at a given smallish area (like a county), distance is not perfectly preserved in this projection, which suffers from distortion as you approach the edges of a given zone. However, I just did some testing of the UTM distortion using points at roughly opposite east-west ends of Santa Barbara County. SB makes for a decent test case because it's a fairly large county that lies across two UTM zones. Distance between the test points I used is 100833 meters in UTM, and 100782 meters using the more accurate ellipsoid calculation. So I'm pretty confident we could say that our county precision estimates are themselves precise to 1 km (and probably usually much better than that), which I assume is fine?
142

    
143
As an aside, I was curious to see Franklin County, VA, showing up as having one of the worst precision values. Turns out GADM2 counts Franklin County as including the independent city of Franklin located a few hundred km east (http://en.wikipedia.org/wiki/Franklin,_Virginia). What can I say, anything we do is only as good as our inputs.
144

    
145
Cheers,
146
Jim
147

    
148

    
149
-- generate precision estimates for all US counties
150
-- (requires utmzone function, defined below)
151
SELECT name_1 AS state, name_2 AS county,
152
       ROUND(ST_Length(
153
         ST_Longestline(
154
           ST_Transform(geom, utmzone(ST_Centroid(geom))),
155
           ST_Centroid(ST_Transform(geom, utmzone(ST_Centroid(geom))))
156
         )
157
       )::numeric/1000, 3) AS error_km
158
  FROM gadm2
159
  WHERE name_0='United States'
160
  ORDER BY state, county;
161

    
162
-- function taken from postgis users wiki
163
CREATE OR REPLACE FUNCTION utmzone(geometry)
164
   RETURNS integer AS
165
 $BODY$
166
 DECLARE
167
     geomgeog geometry;
168
     zone int;
169
     pref int;
170

    
171
 BEGIN
172
     geomgeog:= ST_Transform($1,4326);
173

    
174
     IF (ST_Y(geomgeog))>0 THEN
175
        pref:=32600;
176
     ELSE
177
        pref:=32700;
178
     END IF;
179

    
180
     zone:=floor((ST_X(geomgeog)+180)/6)+1;
181

    
182
     RETURN zone+pref;
183
 END;
184
 $BODY$ LANGUAGE 'plpgsql' IMMUTABLE
185
   COST 100;
186
-----
187

    
188
e-mail from Jim on 2013-1-18:
189
-----
190
See attached, same as before but now with centroid coordinates (after
191
back-transforming from the local UTM zone to WGS84), along with their
192
calculated precision in km. Note that these centroids may differ from
193
what other authorities claim, because of differences in the method
194
and/or differences in the input polygons.
195

    
196
Revised CSV attached, revised SQL statement below.
197

    
198
Cheers,
199
Jim
200

    
201
SELECT name_1 AS state, name_2 AS county,
202
       ST_Y(ST_Transform(ST_Centroid(geom), 4326)) AS latitude,
203
       ST_X(ST_Transform(ST_Centroid(geom), 4326)) AS longitude,
204
       ROUND(
205
         ST_Length(ST_Longestline(geom, ST_Centroid(geom))
206
       )::numeric/1000, 3) AS error_km
207
  FROM (
208
    SELECT name_0, name_1, name_2,
209
           ST_Transform(geom, utmzone(ST_Centroid(geom))) AS geom
210
      FROM gadm2
211
      WHERE name_0='United States') AS gadm2_utm
212
  ORDER BY state, county
213
-----
(1-1/2)