1 |
10707
|
aaronmk
|
BIEN geovalidation notes
|
2 |
|
|
========================
|
3 |
|
|
|
4 |
11445
|
aaronmk
|
***** obtain source code:
|
5 |
|
|
svn co https://code.nceas.ucsb.edu/code/projects/bien/derived/biengeo/
|
6 |
|
|
|
7 |
11444
|
aaronmk
|
***** install dependencies:
|
8 |
11347
|
psarando
|
The only dependencies for running these scripts are PostgreSQL 9.1, postgis 2.0,
|
9 |
|
|
and unzip.
|
10 |
11448
|
psarando
|
|
11 |
|
|
For postgis installation on Ubuntu 12.04 see:
|
12 |
|
|
http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1204
|
13 |
|
|
|
14 |
|
|
Installing these packages on Ubuntu 13.04 should only require these commands:
|
15 |
11347
|
psarando
|
sudo apt-get install unzip
|
16 |
11448
|
psarando
|
sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable
|
17 |
|
|
* Note additional install instructions for postgis-2.0 on Ubuntu 13 found here:
|
18 |
|
|
* http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1304
|
19 |
|
|
* Currently, the ubuntugis-unstable sources list needs to change to 'quantal'.
|
20 |
|
|
sudo apt-get update
|
21 |
|
|
sudo apt-get install postgresql-9.1-postgis-2.0-scripts
|
22 |
11347
|
psarando
|
|
23 |
11448
|
psarando
|
|
24 |
11562
|
psarando
|
***** Notes on running the shell scripts:
|
25 |
|
|
Running any script with the -? or --help option will print a usage message with
|
26 |
|
|
all available options for that script, and then exit.
|
27 |
|
|
|
28 |
|
|
The following scripts accept database connection options, similar to psql.
|
29 |
|
|
For example: setup.sh -d dbname -h hostname -U username
|
30 |
|
|
The defaults passed to psql commands, if no options are given to the shell
|
31 |
|
|
scripts, are 'geoscrub' for dbname and 'bien' for username. The -h option is
|
32 |
|
|
not passed to psql commands by default.
|
33 |
|
|
|
34 |
|
|
The update_validation_data.sh scripts will only download fresh validation data
|
35 |
|
|
if the directories for GADM data (~/gadm_v2_shp by default) and geonames.org
|
36 |
|
|
data (~/geonames by default) do not already exist or do not contain that data.
|
37 |
|
|
|
38 |
|
|
The geoscrub scripts will only download fresh input data if the directory for
|
39 |
|
|
the input data (~/geoscrub_input by default) does not already exist or does not
|
40 |
|
|
contain the geoscrub-corpus.csv input file.
|
41 |
|
|
|
42 |
11444
|
aaronmk
|
***** initialize the DB:
|
43 |
11446
|
aaronmk
|
cd <svn_biengeo_root>
|
44 |
11555
|
psarando
|
1. setup.sh
|
45 |
11497
|
psarando
|
- creates postgis DB and all base tables
|
46 |
11444
|
aaronmk
|
|
47 |
11497
|
psarando
|
***** update geoscrub validation data:
|
48 |
11556
|
psarando
|
runtime: ~40 minutes
|
49 |
11497
|
psarando
|
cd <svn_biengeo_root>
|
50 |
11562
|
psarando
|
2. update_validation_data.sh [--gadm-data=gadm_dir] [--geonames-data=geonames_dir]
|
51 |
11497
|
psarando
|
- runs the following scripts in order to load validation data:
|
52 |
|
|
* update_gadm_data.sh
|
53 |
11556
|
psarando
|
runtime: ~15 minutes (not including download time)
|
54 |
11497
|
psarando
|
- loads GADM2 data into a new (or re-created) gadm2 table
|
55 |
|
|
* update_geonames_data.sh
|
56 |
11556
|
psarando
|
runtime: ~25 minutes (not including download time)
|
57 |
11497
|
psarando
|
- loads geonames.org data and adds some custom mapping logic
|
58 |
|
|
* geonames-to-gadm.*.sql
|
59 |
11556
|
psarando
|
runtime: ~1 minute
|
60 |
11497
|
psarando
|
- contains SQL statements that build linkages between geonames.org
|
61 |
|
|
names and GADM2 names
|
62 |
|
|
|
63 |
11444
|
aaronmk
|
***** geoscrub new data:
|
64 |
11447
|
aaronmk
|
WARNING: deletes any previous geoscrubbing results!
|
65 |
11467
|
aaronmk
|
runtime: ~5.5 h
|
66 |
11446
|
aaronmk
|
cd <svn_biengeo_root>
|
67 |
11563
|
psarando
|
3. geoscrub.sh [--geoscrub-input=input_dir] [--output-file=geoscrub-output.csv]
|
68 |
11493
|
psarando
|
- runs the following scripts in order to load and scrub vegbien input data:
|
69 |
|
|
* load-geoscrub-input.sh
|
70 |
|
|
- dumps geoscrub_input from vegbien and loads it into the geoscrub db
|
71 |
|
|
* geonames.sql
|
72 |
|
|
- contains SQL statements that scrub asserted names and (to the
|
73 |
|
|
extent possible) map them to GADM2
|
74 |
|
|
* geovalidate.sql
|
75 |
|
|
runtime: 5.5 h
|
76 |
|
|
- contains (postgis-extended) SQL statements that score the validity
|
77 |
|
|
of GADM2-scrubbed names against given point coordinates
|
78 |
11563
|
psarando
|
- If the --output-file (or -o) option is given, then the final geoscrub table
|
79 |
|
|
will be dumped to the specified output file in CSV format.
|
80 |
10707
|
aaronmk
|
|
81 |
11444
|
aaronmk
|
[Also see comments embedded in specific scripts in this directory.]
|
82 |
|
|
|
83 |
11493
|
psarando
|
The bash and SQL statements contained in the files as ordered above
|
84 |
11444
|
aaronmk
|
should be applied to carry out geographic name scrubbing and
|
85 |
|
|
geovalidation on a given corpus of BIEN location records.
|
86 |
|
|
|
87 |
|
|
That said, given the tight deadline under which this was done in order
|
88 |
|
|
to produced a geovalidated BIEN3 corpus in advance of the Nov 2013
|
89 |
|
|
working group meeting, and the corresponding manner in which much of
|
90 |
|
|
this was actually executed piecemeal in an iterative and interactive
|
91 |
|
|
fashion within a bash shell and psql session, I can't guarantee that the
|
92 |
|
|
code in its current state could be run end-to-end without intervention.
|
93 |
|
|
It's close, but probably not bulletproof.
|
94 |
|
|
|
95 |
10707
|
aaronmk
|
The resulting 'geoscrub' table is what contains the scrubbed (i.e.,
|
96 |
|
|
GADM2-matched) names and various geovalidation scores.
|
97 |
|
|
|
98 |
|
|
Notes/Caveats/Todos:
|
99 |
|
|
* Clearly the SQL statements used in this procedure suffer from a lot of
|
100 |
|
|
redundancy, and it might be worth trying to refactor once we're happy
|
101 |
|
|
with the particular approach taken.
|
102 |
|
|
* Need to pull out more known notes/caveats/todos and highlight them :)
|