1
|
BIEN geovalidation notes
|
2
|
========================
|
3
|
|
4
|
***** obtain source code:
|
5
|
svn co https://code.nceas.ucsb.edu/code/projects/bien/derived/biengeo/
|
6
|
additional, in-progress files are at
|
7
|
sftp://vegbiendev.nceas.ucsb.edu/home/psarando/src/bien/derived/biengeo/
|
8
|
|
9
|
***** install dependencies:
|
10
|
The only dependencies for running these scripts are PostgreSQL 9.1, postgis 2.0,
|
11
|
and unzip.
|
12
|
|
13
|
For postgis installation on Ubuntu 12.04 see:
|
14
|
http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1204
|
15
|
|
16
|
Installing these packages on Ubuntu 13.04 should only require these commands:
|
17
|
sudo apt-get install unzip
|
18
|
sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable
|
19
|
* Note additional install instructions for postgis-2.0 on Ubuntu 13 found here:
|
20
|
* http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1304
|
21
|
* Currently, the ubuntugis-unstable sources list needs to change to 'quantal'.
|
22
|
sudo apt-get update
|
23
|
sudo apt-get install postgresql-9.1-postgis-2.0-scripts
|
24
|
|
25
|
|
26
|
***** initialize the DB:
|
27
|
cd <svn_biengeo_root>
|
28
|
1. setup.sh
|
29
|
- creates postgis DB and all base tables
|
30
|
|
31
|
***** update geoscrub validation data:
|
32
|
runtime: ~40 minutes
|
33
|
cd <svn_biengeo_root>
|
34
|
2. update_validation_data.sh
|
35
|
- runs the following scripts in order to load validation data:
|
36
|
* update_gadm_data.sh
|
37
|
runtime: ~15 minutes (not including download time)
|
38
|
- loads GADM2 data into a new (or re-created) gadm2 table
|
39
|
* update_geonames_data.sh
|
40
|
runtime: ~25 minutes (not including download time)
|
41
|
- loads geonames.org data and adds some custom mapping logic
|
42
|
* geonames-to-gadm.*.sql
|
43
|
runtime: ~1 minute
|
44
|
- contains SQL statements that build linkages between geonames.org
|
45
|
names and GADM2 names
|
46
|
|
47
|
***** geoscrub new data:
|
48
|
WARNING: deletes any previous geoscrubbing results!
|
49
|
runtime: ~5.5 h
|
50
|
cd <svn_biengeo_root>
|
51
|
3. geoscrub.sh
|
52
|
- runs the following scripts in order to load and scrub vegbien input data:
|
53
|
* load-geoscrub-input.sh
|
54
|
- dumps geoscrub_input from vegbien and loads it into the geoscrub db
|
55
|
* geonames.sql
|
56
|
- contains SQL statements that scrub asserted names and (to the
|
57
|
extent possible) map them to GADM2
|
58
|
* geovalidate.sql
|
59
|
runtime: 5.5 h
|
60
|
- contains (postgis-extended) SQL statements that score the validity
|
61
|
of GADM2-scrubbed names against given point coordinates
|
62
|
|
63
|
[Also see comments embedded in specific scripts in this directory.]
|
64
|
|
65
|
The bash and SQL statements contained in the files as ordered above
|
66
|
should be applied to carry out geographic name scrubbing and
|
67
|
geovalidation on a given corpus of BIEN location records.
|
68
|
|
69
|
That said, given the tight deadline under which this was done in order
|
70
|
to produced a geovalidated BIEN3 corpus in advance of the Nov 2013
|
71
|
working group meeting, and the corresponding manner in which much of
|
72
|
this was actually executed piecemeal in an iterative and interactive
|
73
|
fashion within a bash shell and psql session, I can't guarantee that the
|
74
|
code in its current state could be run end-to-end without intervention.
|
75
|
It's close, but probably not bulletproof.
|
76
|
|
77
|
The resulting 'geoscrub' table is what contains the scrubbed (i.e.,
|
78
|
GADM2-matched) names and various geovalidation scores.
|
79
|
|
80
|
Notes/Caveats/Todos:
|
81
|
* Clearly the SQL statements used in this procedure suffer from a lot of
|
82
|
redundancy, and it might be worth trying to refactor once we're happy
|
83
|
with the particular approach taken.
|
84
|
* Need to pull out more known notes/caveats/todos and highlight them :)
|