Project

General

Profile

1
BIEN geovalidation notes
2
========================
3

    
4
***** obtain source code:
5
svn co https://code.nceas.ucsb.edu/code/projects/bien/derived/biengeo/
6
additional, in-progress files are at
7
sftp://vegbiendev.nceas.ucsb.edu/home/psarando/src/bien/derived/biengeo/
8

    
9
***** install dependencies:
10
The only dependencies for running these scripts are PostgreSQL 9.1, postgis 2.0,
11
and unzip.
12

    
13
For postgis installation on Ubuntu 12.04 see:
14
http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1204
15

    
16
Installing these packages on Ubuntu 13.04 should only require these commands:
17
sudo apt-get install unzip
18
sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable
19
* Note additional install instructions for postgis-2.0 on Ubuntu 13 found here:
20
* http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1304
21
* Currently, the ubuntugis-unstable sources list needs to change to 'quantal'.
22
sudo apt-get update
23
sudo apt-get install postgresql-9.1-postgis-2.0-scripts
24

    
25

    
26
***** initialize the DB:
27
cd <svn_biengeo_root>
28
1. setup.sh
29
   - creates postgis DB and all base tables
30

    
31
***** update geoscrub validation data:
32
runtime: ~40 minutes
33
cd <svn_biengeo_root>
34
2. update_validation_data.sh
35
   - runs the following scripts in order to load validation data:
36
   * update_gadm_data.sh
37
     runtime: ~15 minutes (not including download time)
38
     - loads GADM2 data into a new (or re-created) gadm2 table
39
   * update_geonames_data.sh
40
     runtime: ~25 minutes (not including download time)
41
     - loads geonames.org data and adds some custom mapping logic
42
   * geonames-to-gadm.*.sql
43
     runtime: ~1 minute
44
     - contains SQL statements that build linkages between geonames.org
45
       names and GADM2 names
46

    
47
***** geoscrub new data:
48
WARNING: deletes any previous geoscrubbing results!
49
runtime: ~5.5 h
50
cd <svn_biengeo_root>
51
3. geoscrub.sh
52
   - runs the following scripts in order to load and scrub vegbien input data:
53
   * load-geoscrub-input.sh
54
     - dumps geoscrub_input from vegbien and loads it into the geoscrub db
55
   * geonames.sql
56
     - contains SQL statements that scrub asserted names and (to the
57
       extent possible) map them to GADM2
58
   * geovalidate.sql
59
     runtime: 5.5 h
60
     - contains (postgis-extended) SQL statements that score the validity
61
       of GADM2-scrubbed names against given point coordinates
62

    
63
[Also see comments embedded in specific scripts in this directory.]
64

    
65
The bash and SQL statements contained in the files as ordered above
66
should be applied to carry out geographic name scrubbing and
67
geovalidation on a given corpus of BIEN location records.
68

    
69
That said, given the tight deadline under which this was done in order
70
to produced a geovalidated BIEN3 corpus in advance of the Nov 2013
71
working group meeting, and the corresponding manner in which much of
72
this was actually executed piecemeal in an iterative and interactive
73
fashion within a bash shell and psql session, I can't guarantee that the
74
code in its current state could be run end-to-end without intervention.
75
It's close, but probably not bulletproof.
76

    
77
The resulting 'geoscrub' table is what contains the scrubbed (i.e.,
78
GADM2-matched) names and various geovalidation scores.
79

    
80
Notes/Caveats/Todos:
81
* Clearly the SQL statements used in this procedure suffer from a lot of
82
  redundancy, and it might be worth trying to refactor once we're happy
83
  with the particular approach taken.
84
* Need to pull out more known notes/caveats/todos and highlight them :)
(2-2/26)