Project

General

Profile

1 10707 aaronmk
BIEN geovalidation notes
2
========================
3
4 11445 aaronmk
***** obtain source code:
5
svn co https://code.nceas.ucsb.edu/code/projects/bien/derived/biengeo/
6
7 11444 aaronmk
***** install dependencies:
8 11347 psarando
The only dependencies for running these scripts are PostgreSQL 9.1, postgis 2.0,
9
and unzip.
10 11448 psarando
11
For postgis installation on Ubuntu 12.04 see:
12
http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1204
13
14
Installing these packages on Ubuntu 13.04 should only require these commands:
15 11347 psarando
sudo apt-get install unzip
16 11448 psarando
sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable
17
* Note additional install instructions for postgis-2.0 on Ubuntu 13 found here:
18
* http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1304
19
* Currently, the ubuntugis-unstable sources list needs to change to 'quantal'.
20
sudo apt-get update
21
sudo apt-get install postgresql-9.1-postgis-2.0-scripts
22 11347 psarando
23 11448 psarando
24 11562 psarando
***** Notes on running the shell scripts:
25
Running any script with the -? or --help option will print a usage message with
26
all available options for that script, and then exit.
27
28
The following scripts accept database connection options, similar to psql.
29
For example: setup.sh -d dbname -h hostname -U username
30
The defaults passed to psql commands, if no options are given to the shell
31
scripts, are 'geoscrub' for dbname and 'bien' for username. The -h option is
32
not passed to psql commands by default.
33
34
The update_validation_data.sh scripts will only download fresh validation data
35
if the directories for GADM data (~/gadm_v2_shp by default) and geonames.org
36
data (~/geonames by default) do not already exist or do not contain that data.
37
38
The geoscrub scripts will only download fresh input data if the directory for
39
the input data (~/geoscrub_input by default) does not already exist or does not
40
contain the geoscrub-corpus.csv input file.
41
42 11444 aaronmk
***** initialize the DB:
43 11446 aaronmk
cd <svn_biengeo_root>
44 11555 psarando
1. setup.sh
45 11497 psarando
   - creates postgis DB and all base tables
46 11444 aaronmk
47 11497 psarando
***** update geoscrub validation data:
48 11556 psarando
runtime: ~40 minutes
49 11497 psarando
cd <svn_biengeo_root>
50 11562 psarando
2. update_validation_data.sh [--gadm-data=gadm_dir] [--geonames-data=geonames_dir]
51 11497 psarando
   - runs the following scripts in order to load validation data:
52
   * update_gadm_data.sh
53 11556 psarando
     runtime: ~15 minutes (not including download time)
54 11497 psarando
     - loads GADM2 data into a new (or re-created) gadm2 table
55
   * update_geonames_data.sh
56 11556 psarando
     runtime: ~25 minutes (not including download time)
57 11497 psarando
     - loads geonames.org data and adds some custom mapping logic
58
   * geonames-to-gadm.*.sql
59 11556 psarando
     runtime: ~1 minute
60 11497 psarando
     - contains SQL statements that build linkages between geonames.org
61
       names and GADM2 names
62
63 11444 aaronmk
***** geoscrub new data:
64 11447 aaronmk
WARNING: deletes any previous geoscrubbing results!
65 11467 aaronmk
runtime: ~5.5 h
66 11446 aaronmk
cd <svn_biengeo_root>
67 11563 psarando
3. geoscrub.sh [--geoscrub-input=input_dir] [--output-file=geoscrub-output.csv]
68 11493 psarando
   - runs the following scripts in order to load and scrub vegbien input data:
69
   * load-geoscrub-input.sh
70
     - dumps geoscrub_input from vegbien and loads it into the geoscrub db
71
   * geonames.sql
72
     - contains SQL statements that scrub asserted names and (to the
73
       extent possible) map them to GADM2
74
   * geovalidate.sql
75
     runtime: 5.5 h
76
     - contains (postgis-extended) SQL statements that score the validity
77
       of GADM2-scrubbed names against given point coordinates
78 11563 psarando
   - If the --output-file (or -o) option is given, then the final geoscrub table
79
     will be dumped to the specified output file in CSV format.
80 10707 aaronmk
81 11444 aaronmk
[Also see comments embedded in specific scripts in this directory.]
82
83 11493 psarando
The bash and SQL statements contained in the files as ordered above
84 11444 aaronmk
should be applied to carry out geographic name scrubbing and
85
geovalidation on a given corpus of BIEN location records.
86
87
That said, given the tight deadline under which this was done in order
88
to produced a geovalidated BIEN3 corpus in advance of the Nov 2013
89
working group meeting, and the corresponding manner in which much of
90
this was actually executed piecemeal in an iterative and interactive
91
fashion within a bash shell and psql session, I can't guarantee that the
92
code in its current state could be run end-to-end without intervention.
93
It's close, but probably not bulletproof.
94
95 10707 aaronmk
The resulting 'geoscrub' table is what contains the scrubbed (i.e.,
96
GADM2-matched) names and various geovalidation scores.
97
98
Notes/Caveats/Todos:
99
* Clearly the SQL statements used in this procedure suffer from a lot of
100
  redundancy, and it might be worth trying to refactor once we're happy
101
  with the particular approach taken.
102
* Need to pull out more known notes/caveats/todos and highlight them :)