BIEN geovalidation notes
========================

***** obtain source code:
svn co https://code.nceas.ucsb.edu/code/projects/bien/derived/biengeo/

***** install dependencies:
The only dependencies for running these scripts are PostgreSQL 9.1, postgis 2.0,
and unzip.

For postgis installation on Ubuntu 12.04 see:
http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1204

Installing these packages on Ubuntu 13.04 should only require these commands:
sudo apt-get install unzip
sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable
* Note additional install instructions for postgis-2.0 on Ubuntu 13 found here:
* http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1304
* Currently, the ubuntugis-unstable sources list needs to change to 'quantal'.
sudo apt-get update
sudo apt-get install postgresql-9.1-postgis-2.0-scripts


***** Notes on running the shell scripts:
Running any script with the -? or --help option will print a usage message with
all available options for that script, and then exit.

The following scripts accept database connection options, similar to psql.
For example: setup.sh -d dbname -h hostname -U username
The defaults passed to psql commands, if no options are given to the shell
scripts, are 'geoscrub' for dbname and 'bien' for username. The -h option is
not passed to psql commands by default.

The update_validation_data.sh scripts will only download fresh validation data
if the directories for GADM data (~/gadm_v2_shp by default) and geonames.org
data (~/geonames by default) do not already exist or do not contain that data.

The geoscrub scripts will only download fresh input data if the directory for
the input data (~/geoscrub_input by default) does not already exist or does not
contain the geoscrub-corpus.csv input file.

***** initialize the DB:
cd <svn_biengeo_root>
1. setup.sh
   - creates postgis DB and all base tables

***** update geoscrub validation data:
runtime: ~40 minutes
cd <svn_biengeo_root>
2. update_validation_data.sh [--gadm-data=gadm_dir] [--geonames-data=geonames_dir]
   - runs the following scripts in order to load validation data:
   * update_gadm_data.sh
     runtime: ~15 minutes (not including download time)
     - loads GADM2 data into a new (or re-created) gadm2 table
   * update_geonames_data.sh
     runtime: ~25 minutes (not including download time)
     - loads geonames.org data and adds some custom mapping logic
   * geonames-to-gadm.*.sql
     runtime: ~1 minute
     - contains SQL statements that build linkages between geonames.org
       names and GADM2 names

***** geoscrub new data:
WARNING: deletes any previous geoscrubbing results!
runtime: ~2.5 h
cd <svn_biengeo_root>
3. geoscrub.sh [--geoscrub-input=input_dir] [--output-file=geoscrub-output.csv]
   - runs the following scripts in order to load and scrub vegbien input data:
   * load-geoscrub-input.sh
     - dumps geoscrub_input from vegbien and loads it into the geoscrub db
   * geonames.sql
     - contains SQL statements that scrub asserted names and (to the
       extent possible) map them to GADM2
   * geovalidate.sql
     runtime: 2.5 h
     - contains (postgis-extended) SQL statements that score the validity
       of GADM2-scrubbed names against given point coordinates
   - If the --output-file (or -o) option is given, then the final geoscrub table
     will be dumped to the specified output file in CSV format.

[Also see comments embedded in specific scripts in this directory.]

The bash and SQL statements contained in the files as ordered above
should be applied to carry out geographic name scrubbing and
geovalidation on a given corpus of BIEN location records.

That said, given the tight deadline under which this was done in order
to produced a geovalidated BIEN3 corpus in advance of the Nov 2013
working group meeting, and the corresponding manner in which much of
this was actually executed piecemeal in an iterative and interactive
fashion within a bash shell and psql session, I can't guarantee that the
code in its current state could be run end-to-end without intervention.
It's close, but probably not bulletproof.

The resulting 'geoscrub' table is what contains the scrubbed (i.e.,
GADM2-matched) names and various geovalidation scores.

Notes/Caveats/Todos:
* Clearly the SQL statements used in this procedure suffer from a lot of
  redundancy, and it might be worth trying to refactor once we're happy
  with the particular approach taken.
* Need to pull out more known notes/caveats/todos and highlight them :)