Project

General

Profile

1
BIEN geovalidation notes
2
========================
3

    
4
***** obtain source code:
5
svn co https://code.nceas.ucsb.edu/code/projects/bien/derived/biengeo/
6

    
7
***** install dependencies:
8
The only dependencies for running these scripts are PostgreSQL 9.1, postgis 2.0,
9
and unzip.
10

    
11
For postgis installation on Ubuntu 12.04 see:
12
http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1204
13

    
14
Installing these packages on Ubuntu 13.04 should only require these commands:
15
sudo apt-get install unzip
16
sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable
17
* Note additional install instructions for postgis-2.0 on Ubuntu 13 found here:
18
* http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1304
19
* Currently, the ubuntugis-unstable sources list needs to change to 'quantal'.
20
sudo apt-get update
21
sudo apt-get install postgresql-9.1-postgis-2.0-scripts
22

    
23

    
24
***** Notes on running the shell scripts:
25
Running any script with the -? or --help option will print a usage message with
26
all available options for that script, and then exit.
27

    
28
The following scripts accept database connection options, similar to psql.
29
For example: setup.sh -d dbname -h hostname -U username
30
The defaults passed to psql commands, if no options are given to the shell
31
scripts, are 'geoscrub' for dbname and 'bien' for username. The -h option is
32
not passed to psql commands by default.
33

    
34
The update_validation_data.sh scripts will only download fresh validation data
35
if the directories for GADM data (~/gadm_v2_shp by default) and geonames.org
36
data (~/geonames by default) do not already exist or do not contain that data.
37

    
38
The geoscrub scripts will only download fresh input data if the directory for
39
the input data (~/geoscrub_input by default) does not already exist or does not
40
contain the geoscrub-corpus.csv input file.
41

    
42
***** initialize the DB:
43
cd <svn_biengeo_root>
44
1. setup.sh
45
   - creates postgis DB and all base tables
46

    
47
***** update geoscrub validation data:
48
runtime: ~40 minutes
49
cd <svn_biengeo_root>
50
2. update_validation_data.sh [--gadm-data=gadm_dir] [--geonames-data=geonames_dir]
51
   - runs the following scripts in order to load validation data:
52
   * update_gadm_data.sh
53
     runtime: ~15 minutes (not including download time)
54
     - loads GADM2 data into a new (or re-created) gadm2 table
55
   * update_geonames_data.sh
56
     runtime: ~25 minutes (not including download time)
57
     - loads geonames.org data and adds some custom mapping logic
58
   * geonames-to-gadm.*.sql
59
     runtime: ~1 minute
60
     - contains SQL statements that build linkages between geonames.org
61
       names and GADM2 names
62

    
63
***** geoscrub new data:
64
WARNING: deletes any previous geoscrubbing results!
65
runtime: ~5.5 h
66
cd <svn_biengeo_root>
67
3. geoscrub.sh [--geoscrub-input=input_dir] [--output-file=geoscrub-output.csv]
68
   - runs the following scripts in order to load and scrub vegbien input data:
69
   * load-geoscrub-input.sh
70
     - dumps geoscrub_input from vegbien and loads it into the geoscrub db
71
   * geonames.sql
72
     - contains SQL statements that scrub asserted names and (to the
73
       extent possible) map them to GADM2
74
   * geovalidate.sql
75
     runtime: 5.5 h
76
     - contains (postgis-extended) SQL statements that score the validity
77
       of GADM2-scrubbed names against given point coordinates
78
   - If the --output-file (or -o) option is given, then the final geoscrub table
79
     will be dumped to the specified output file in CSV format.
80

    
81
[Also see comments embedded in specific scripts in this directory.]
82

    
83
The bash and SQL statements contained in the files as ordered above
84
should be applied to carry out geographic name scrubbing and
85
geovalidation on a given corpus of BIEN location records.
86

    
87
That said, given the tight deadline under which this was done in order
88
to produced a geovalidated BIEN3 corpus in advance of the Nov 2013
89
working group meeting, and the corresponding manner in which much of
90
this was actually executed piecemeal in an iterative and interactive
91
fashion within a bash shell and psql session, I can't guarantee that the
92
code in its current state could be run end-to-end without intervention.
93
It's close, but probably not bulletproof.
94

    
95
The resulting 'geoscrub' table is what contains the scrubbed (i.e.,
96
GADM2-matched) names and various geovalidation scores.
97

    
98
Notes/Caveats/Todos:
99
* Clearly the SQL statements used in this procedure suffer from a lot of
100
  redundancy, and it might be worth trying to refactor once we're happy
101
  with the particular approach taken.
102
* Need to pull out more known notes/caveats/todos and highlight them :)
(2-2/27)