Project

General

Profile

1 10707 aaronmk
BIEN geovalidation notes
2
========================
3
4 11445 aaronmk
***** obtain source code:
5
svn co https://code.nceas.ucsb.edu/code/projects/bien/derived/biengeo/
6
additional, in-progress files are at
7
sftp://vegbiendev.nceas.ucsb.edu/home/psarando/src/bien/derived/biengeo/
8
9 11444 aaronmk
***** install dependencies:
10 11347 psarando
The only dependencies for running these scripts are PostgreSQL 9.1, postgis 2.0,
11
and unzip.
12 11448 psarando
13
For postgis installation on Ubuntu 12.04 see:
14
http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1204
15
16
Installing these packages on Ubuntu 13.04 should only require these commands:
17 11347 psarando
sudo apt-get install unzip
18 11448 psarando
sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable
19
* Note additional install instructions for postgis-2.0 on Ubuntu 13 found here:
20
* http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1304
21
* Currently, the ubuntugis-unstable sources list needs to change to 'quantal'.
22
sudo apt-get update
23
sudo apt-get install postgresql-9.1-postgis-2.0-scripts
24 11347 psarando
25 11448 psarando
26 11444 aaronmk
***** initialize the DB:
27 11446 aaronmk
cd <svn_biengeo_root>
28 11555 psarando
1. setup.sh
29 11497 psarando
   - creates postgis DB and all base tables
30 11444 aaronmk
31 11497 psarando
***** update geoscrub validation data:
32
cd <svn_biengeo_root>
33
2. update_validation_data.sh
34
   - runs the following scripts in order to load validation data:
35
   * update_gadm_data.sh
36
     - loads GADM2 data into a new (or re-created) gadm2 table
37
   * update_geonames_data.sh
38
     - loads geonames.org data and adds some custom mapping logic
39
   * geonames-to-gadm.*.sql
40
     - contains SQL statements that build linkages between geonames.org
41
       names and GADM2 names
42
43 11444 aaronmk
***** geoscrub new data:
44 11447 aaronmk
WARNING: deletes any previous geoscrubbing results!
45 11467 aaronmk
runtime: ~5.5 h
46 11446 aaronmk
cd <svn_biengeo_root>
47 11497 psarando
3. geoscrub.sh
48 11493 psarando
   - runs the following scripts in order to load and scrub vegbien input data:
49
   * load-geoscrub-input.sh
50
     - dumps geoscrub_input from vegbien and loads it into the geoscrub db
51
   * geonames.sql
52
     - contains SQL statements that scrub asserted names and (to the
53
       extent possible) map them to GADM2
54
   * geovalidate.sql
55
     runtime: 5.5 h
56
     - contains (postgis-extended) SQL statements that score the validity
57
       of GADM2-scrubbed names against given point coordinates
58 10707 aaronmk
59 11444 aaronmk
[Also see comments embedded in specific scripts in this directory.]
60
61 11493 psarando
The bash and SQL statements contained in the files as ordered above
62 11444 aaronmk
should be applied to carry out geographic name scrubbing and
63
geovalidation on a given corpus of BIEN location records.
64
65
That said, given the tight deadline under which this was done in order
66
to produced a geovalidated BIEN3 corpus in advance of the Nov 2013
67
working group meeting, and the corresponding manner in which much of
68
this was actually executed piecemeal in an iterative and interactive
69
fashion within a bash shell and psql session, I can't guarantee that the
70
code in its current state could be run end-to-end without intervention.
71
It's close, but probably not bulletproof.
72
73 10707 aaronmk
The resulting 'geoscrub' table is what contains the scrubbed (i.e.,
74
GADM2-matched) names and various geovalidation scores.
75
76
Notes/Caveats/Todos:
77
* Clearly the SQL statements used in this procedure suffer from a lot of
78
  redundancy, and it might be worth trying to refactor once we're happy
79
  with the particular approach taken.
80
* Need to pull out more known notes/caveats/todos and highlight them :)