2012-11-29 web interface breakout group¶

how to walk away from BIEN so it runs itself
web service
user interface needs to use API
HTML form that calls API
TROPICOS has web interface to create SQL queries
users familiar w/ command line, users who understand content
ordering
website, core requirements
factual website
visualizations
data requests
data uploads
data architecture
series of use cases
high-level user story for purpose of website
similar to ecommerce shopping site
gives the data, not analysis
passive interface
also interface to do something to data
data entry tool to update data in BIEN
provenance issue: how to get correct data back to original provider
expert users of BIEN who are allowed to manipulate data
plots more complex, smaller than specimens datasets
mechanism to collect thousands of plots on Excel spreadsheets
contracts: must make data publicly available
role of VegBank?
NVS broader than VegBank
VegCore can accommodate these changes
implications for components the website would have
who are the users? what products/analyses they need?
should BIEN become data repo for plots data in the U.S./the world?
make BIEN modular? each organization has data in empty schema
BIEN is method, not data
Africa with BIEN structure
beyond sci community or researchers, who has interest in BIEN data?
- general public? consultants?
- scientific method not in consulting
assessment: what's potentially there in terms of species
various Latin American repos have started charging consultants
stopblock
what mechanisms? companies make donation
NBG (NY?) has contracts with mining companies
other things BIEN produces: plots are input, ranges are output
horticultural community: what could grow in person's area
get and contribute plot data
native plant society-type groups
agriculture: iPlant funding because useful for crop science
plant groups, education
package data in simple way for students: modules
IUCN classification
NGOs: range models, raw occurrence data
select interfaces to get data
challenge is interfaces that change data
TROPICOS experience: takes years for user community to be happy with forms, steps
data entry interfaces are highly programmer-intensive, lower priority
4-5 years have had group of programmers doing web interfaces for TROPICOS
Eric Fegraus (Conservation International): unified schema
BIEN not involved in interface development, CI would do that
independent data entry tool which can push data into BIEN
continued funding for UI development?
expose web interface for uploading data, but not data entry tools

Download tracking¶

track who downloads data
can't just make all data public, because some of it has access restrictions
not data entry/correction interface
SALVIAS a good model for interface
monitor who downloads the data
don't need graphical interface
datasets in SALVIAS can be tagged in 3 access ways: totally hidden, metadata-only, public
logging of downloads
who user was, IP address
users tagged as belonging to dataset
mechanism to send someone an e-mail when someone downloads their data (opt-out)
providing this through a web service
anonymous downloads
capture timestamp, IP address, username if logged in
authorization of access to level 2 data
owner grants access to datasets
peer-to-peer access mediated by database
SALVIAS maintains itself
if build infrastructure that supports this, other things come with it:
- can provide to data provider who downloaded their data
particular functionality that repo should support
requirements
potentially have a RESTful API
a URL to do any action you want, then a UI on top of that
request access to dataset
grant API keys
authenticate access
infrastructure exposed to people who don't know RESTful APIs
interface issue
data entry and correction: nice to have
download/logging: required
control of data access by owner
- avoid TRY's headaches of having to mediate this
e-mail changes
window after which dataset goes public: 5 years or 10 unanswered requests
build in e-mail pinger
SALVIAS has dead-end e-mails
data that's not fully public->ensure no data spills with minimal future work needed
identify visibility of records
what gets coalesced back into analytical DB
accesslevel field in analytical DB
track provenance in analytical DB
plot data has species name, place
elements assembled
access at owner/plot/date level
queries could bypass levels in core DB
one challenge is fuzziness
hierarchy of top-level and ultimate data provider
allow for fuzziness in identifying data provider
who owns plot data in public repo?
proximal entity
Conabio/REMIB
- were open to sharing with TROPICOS?
error reports, range maps
another user community: data providers
data providers serve range maps created by BIEN
estimate of cost?
developing TROPICOS DB: paid developer
TNRS paid developer for a year
4000-5000 rare species names to exclude
web service mechanism to request data products
provenance functionality, data ownership
data exploration: Brian's web service
put user interface over web service
more user-friendly
HTML form with picklists, GIS maps
built into web interface
web interface and web service would match
different groups doing different services, need to collaborate: "eat your own dogfood"
every group provides data to other groups via API
RESTful API
work to make interface more robust: security, authentication
BIEN-specific requests->queries
translate higher level request to SQL query
cached queries
indexes on analytical DB
index every field in analytical DB
TROPICOS reporting DB regenerated nightly
capture administrative data: additional schema elements
SALVIAS: when user signs up, add name, e-mail
- user (human/institution) linked to data
NVS has party concept to manage ownership and participation on plots
application and permissions
external authentication
confederated security
Google sign-in
Shibboleth
everyone else: needs new account
using DataONE
if we do something, prefer to use out-of-the-box security
identity research
use own credentials to sign in on another site
ad-hoc user needs own account
complex model
need authentication of some kind
iPlant has approaches?
what is procedure to get an account?
sign up link
verifying that not a bot
passive interface that doesn't require human approval
need to be identified somehow
need user to track downloads
also internal mechanism for data access
require log in
anonymous user -> access public data
this is just for read transactions
log IPs to determine hits/user
data packages: how many times read?
logging table for Python table
straightforward?

what about update mechanism
takes month and many e-mails to load data for other DBs
what is a mechanism to upload data?
published schema to use?
CSV file like on TDWG site?
increasingly automate pipeline
need human being to be comfortable that incoming data meets DB's quality standards
compare to global jellyfish (JEDI)
upload CSV: potentially VegCSV
spec of what upload needs to look like
datatypes
if import fails, provide feedback to user
mechanism to send data: drop box, harvester, etc.
managed to staging system
data validations
feedback to provider about data quality, valid mappings
data w/ frequent updates (active datasets)
immediate feedback
balance between strict vs. loose VegCSV
- where possible, use well-known standards
- but also allow similar data
metadata catalog
GIVD: items that didn't apply, e.g. # of releves
vocabularies w/ common elements
core elements that everyone recognizes
optional elements
minimal required elements
weakly-typed table->define datatypes
once user's data is good, PDF report generated
successful upload->moderation queue
who submitted, when
table with unique ID associated with upload
deletion of inserted records if error
custom mapping saved in background
track submission as bundle
how to know when to delete something?
TNRS model: don't rebuild whole database, grows or contracts on dataset basis
versioning database: rollback to previous version
TNRS fkey walk: ON DELETE CASCADE
but leaves NCBI taxonomy
embed as much within database structure as possible
which are shared, which are unique keys
validations, reporting
frequency of plots
sparkline things
class of data captured by NVS
upload data->analysis of internal quality
2nd-level validations
how data compares to population statistics
put in taxonomic name
early validations->flag for user to check
meet back at 11:15am
end-of-pipeline crowdsourcing and user feedback issues
corrections to data
once maps visible, find issues
part of DB to store user feedback, to filter/improve data
- exclude data marked as wrong
3-4 categories to tag data with
human/automated layers to filter data
hide incorrect maps?
how to track all range maps
API level, how to store bits of feedback, annotations on content
annotations on data object in BIEN, even if data itself is not in BIEN
click range map point to mark as incorrect
users mark specimens as cultivated
visual interactions with data
collections as cultivated
every herbarium from Index Herbariorum, mark w/in radius as cultivated
shapefile with boundaries of botanical gardens
weighting by population (cities)
geospatial component, but lose info corrected around small cities
web interface: view data record, add specific comment
look at range map->validate, star rating, which are incorrect
- rating a book
occurrence, point data
cluster of points->flag as questionable
simple interface->develop more w/ feedback
people request datasets, find issues
feedback as dataset comment
talked about versioning downloads
snapshot of BIEN
comments about data within snapshot
remind user to give feedback on downloaded data
complete the loop on how data used
collate answer
don't build large infrastructure, start small
way to get feedback on invalid records
first year we have range map data for all species
BIEN 2 data->BIEN 3
which species to rerun
frameworks so don't need to build infrastructure
one-click flagging of points
feedback about record
original list
automatic downloads: downloading portions of the data
mech to associate user with provider
log IP address
search for things
data provider flips switch to make data public
data exports in same format
user interface for visualizing the data
hard to gauge what user community would want
needs-driven
CVS data exploration visualization

data discovery
hardening the range modeling algorithms
layer that sits on top of range modeling applications
things that people can say
filter on mapped areas, things, species
select specific fields that want to look at
picklist
something much more detailed: shapefile of Nat'l Park
- would be very nice app
Java
what are main search/discovery axes?
- country, spatial, temporal, taxonomic, trait, plot size, size range, habit
- BIEN2 doesn't have temporal data, because old collections are handwritten->range of years
need start/end date for collected date
date ranges: good thing about VegBank
TROPICOS also has start/end dates
fields for D/M/Y->display date
legacy data
spatial query: what's at a point
family or habit
flag what is co-occurrence data
level of granularity
how many axes to subset
- one for each column
also support ANDs/ORs
filter on axes
rainfall > x
climate filters

TROPICOS query builder
different levels of access through web interfaces
SELECT access only
interface only queries analytical database
NVS interface doesn't query core DB, instead analytical DB and metadata
avoid need to e-mail Brad to request extract

Files (0)

Project

General

Profile

Wiki

2012-11-29 web interface breakout group¶

Download tracking¶