BIEN geovalidation notes ======================== ***** obtain source code: svn co https://code.nceas.ucsb.edu/code/projects/bien/derived/biengeo/ additional, in-progress files are at sftp://vegbiendev.nceas.ucsb.edu/home/psarando/src/bien/derived/biengeo/ ***** install dependencies: The only dependencies for running these scripts are PostgreSQL 9.1, postgis 2.0, and unzip. For postgis installation on Ubuntu 12.04 see: http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1204 Installing these packages on Ubuntu 13.04 should only require these commands: sudo apt-get install unzip sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable * Note additional install instructions for postgis-2.0 on Ubuntu 13 found here: * http://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS20Ubuntu1304 * Currently, the ubuntugis-unstable sources list needs to change to 'quantal'. sudo apt-get update sudo apt-get install postgresql-9.1-postgis-2.0-scripts ***** initialize the DB: cd 1. geovalidate.sh - creates postgis DB and loads GADM2 data 2. geonames.sh - loads geonames.org data and adds some custom mapping logic 3. geonames-to-gadm.sql - contains SQL statements that build linkages between geonames.org names and GADM2 names ***** geoscrub new data: WARNING: deletes any previous geoscrubbing results! runtime: ~5.5 h cd 4. load-geoscrub-input.sh - dumps geoscrub_input from vegbien and loads it into the geoscrub db 5. geonames.sql sudo -u postgres psql -e --set ON_ERROR_STOP=1 -d geoscrub < geonames.sql - contains SQL statements that scrub asserted names and (to the extent possible) map them to GADM2 6. geovalidate.sql runtime: 5.5 h sudo -u postgres psql -e --set ON_ERROR_STOP=1 -d geoscrub < geovalidate.sql - contains (postgis-extended) SQL statements that score the validity of GADM2-scrubbed names against given point coordinates [Also see comments embedded in specific scripts in this directory.] The bash and SQL statements contained in the files as ordered below should be applied to carry out geographic name scrubbing and geovalidation on a given corpus of BIEN location records. That said, given the tight deadline under which this was done in order to produced a geovalidated BIEN3 corpus in advance of the Nov 2013 working group meeting, and the corresponding manner in which much of this was actually executed piecemeal in an iterative and interactive fashion within a bash shell and psql session, I can't guarantee that the code in its current state could be run end-to-end without intervention. It's close, but probably not bulletproof. The resulting 'geoscrub' table is what contains the scrubbed (i.e., GADM2-matched) names and various geovalidation scores. Notes/Caveats/Todos: * Clearly the SQL statements used in this procedure suffer from a lot of redundancy, and it might be worth trying to refactor once we're happy with the particular approach taken. * Need to pull out more known notes/caveats/todos and highlight them :)