2012-11-29 breakout groups¶

Table of contents
2012-11-29 breakout groups

2 short presentations: Brian McGill, Mark

Brian McGill: work started before last BIEN meeting w/ geospatial portal¶

climate data
ENO
targeted to scientist
API
URL
species or lat/long list
species valid -> environmental data
corrected by TNRS
web server w/ PostGIS
species list
authorities, names
Cyril's trait data
extract
2 species, traits
switch to return all values
turn it into lat/longs
species occurrence prepended with lat/long
Quercus rubra
denormalized table
start with species -> validation -> environment layers
mean annual temperature
annual minimum temperature, precipitation
compare to abundance of another species
field: #s at points in space
raster: pixels with values filled in
anything that's spatially sampled
different kinds of interpolation
Quercus alba abundances
design diagram
get lat/long oriented list
environmental data
species list->validation
occurrence data->lat/long list
lat/long point->list of species at point (for trees)
extract from phylogeny->shapefile w/ ranges
boxes
species that intersect box
simple architecture
thin front-end on top of fancy data
targeted to scientists
practical because it returns CSV files
pipeline architecture
interpolation: linear, distance-weighted
Python
include metadata of database version
available now on server
word document
Python already has phylogeny libraries
front-end on data
use cases developed
augments BIEN->more accessible and usable
niche tool
API vs. GUI
GUI that uses web service?
point-and-click
HTML code
SB: 35 N, 120 W
RESTful API
design to go for
document that lists URL strings
will list feedback
associate with BIEN 3 website
move to NCEAS server
needs PostGIS, Python
quick demo
integrate into codebase
work with Jim's code
upgrade to FastCGI
WorldClim layers, not ENO yet
temperature, precipitation
min, max temperature, precipitation, degree days, growing season
water balance
30m digital RTM model
slope, aspect
where to serve layers
on Plone site under Tools > BIEN > informatics

BIEN 3 access, data sharing

Mark¶

metadata: the implicit context about the dataset
no distinct line between data and metadata
metadata is additional information you need to repurpose information
Dublin Core metadata
dataset
name, when created, where got it from
consistent names for fields for who, what, where, when, why
agencies had to have metadata
Internet was bucket of info
federal geographic commission FGDC->ISO standard 1991
- geospatial standard
technically savvy researchers
spreadsheets complement one another
outreach to community on metadata standard by NCEAS, LTER (ecological stations)
- LTER needed metadata standards
can't fit metadata into filename
formal language in XML
endorsed by NCEAS, LTER, ITER
ingestion of documents
KNB
DataONE initiative: NSF funded datanet
- supports creation of robust repositories for NSF-funded research projects
ONE = Observation Network for Earth
incorporate entire framework into DataONE
express metadata in EML, contribute it to KNB where ecologists look for data
Dryad a member of DataONE
Dryad metadata spec vs. DataONE
sufficient metadata to ingest digital object into application
more useful if had better metadata framework
BIEN is dynamic resource
exists, has this kind of content, is at this location
when any researcher publishes result
reports based on BIEN data: which version?
when get result set, unambiguous what the contents are
results can change overnight when data reloaded
versioning clarification: engineer into BIEN when data downloaded
semiannual or quarterly releases
start/stop dates
dynamic list of occurrences
timestamped update field
do for core set of occurrences
BIEN metadata on KNB site
planned level of documentation for NYBG, VegBank
how much is aggregator responsible for individualizing contributors
kinds of things that wouldn't necessarily come with the data
when upload data to GBIF, metadata provided: description, last updated date
misc info field
1st-class fields
community of developers watching for how metadata evolves
EML metadata often stored in relational database

main focus: BIEN 3 web interface
Brad: intro to current web interface
subgroups draft documents to inform BIEN website
4 specific subgroups at 9:15am
- web interface: 2h
- data serving policy: 2h
  - how serving data, interfacing w/ other groups
- authorship and data use agreement: 1h
  - Bob, Barbara Thiers
- traits: 1h
draft document based on discussion
data-serving policy
independent researchers, map of file
smaller groups, could be in same room
policy
herbaria data
- formal goals for serving data
credit for data contributed

Brad's presentation¶

based on experience with SALVIAS
what do we need to know to serve the data
data providers, data ownership, data participation, methodology, versioning
approximate source of data: e.g. ARIZ or GBIF
may want to know how often data was used in publications
may want to be on publications
interest in knowing errors found or enhancements to data
data ownership
- subset of people who created it
- people with continuing interest in data
who redistributed the data, who participated in publications
errors with data, changes/enhancements
conditions of use: e.g. acknowledgment, option of participating in publications
specimens: who collected the specimen data, who will want to control use of it
data participation
data quality
methodology: things you need to know about how data was done
plots collected with one methodology can't be combined with data of another methodology
trees 10m, 50m in diameter: units?
versioning: put dataset somewhere and not change it
formal infrastructure for metadata
metadata->provenance
where data came from, may have been pass around multiple times
provenance: notion of tracking things to a source, and prior to that
determiner in NYBG, but not BIEN 3? maybe only in core DB
web interface group, data serving/metadata: 2h
BIEN authorship/data use, traits: 1h
then repopulate data serving and metadata or web interface groups

interface is what's needed to make database sustainable
primarily geared to short-term (1-2 years)
separate short-, long-term components
metadata: Barbara, Bill, Martha
authorship: Susan, Bob, Brian M
trait, authorship shorter groups->will be joining large groups
web interface in separate room
back at 11:15am
metadata group in lounge
authorship, traits in main room

Traits¶

web search tool for shapefiles
provide data in shapefile instead of raster->change coordinate system
user community wants JPEG?
what kind of data downloadable to get points used to create range map
download data inside BIEN or go back to original data
use model as search criteria
geographic ranges
issue with endangered species: have small ranges, making publicly available
does IUCN need to control something for this?
whether to embargo data
go to publication where species was described
GBIF best practices
generalized to 10km grid square
centroid
only species w/ minimum of 5 collections
could do raster grid
can't control how data is ultimately used
hide endangered species in web interface
- or don't release localities->won't be included in shapfiles
include fuzzing in coordinate uncertainty
different color of point for fuzzed points
issue for small complex hulls
everything that's not a maxent model
who sharing shapefiles with?
information available one way or another, not to hide information from researchers
provide capability of fuzzing new data
need metadata to describe that point was fuzzed
don't make endangered species publicly accessible
contact to get full data
societies list or subset
local endemism
embargo initial data
CA endangered plants vs. US
providing individual data
rasters of big data
core level needs

Traits¶

traits, what to do w/ data
plots DB has public, by request, private
real, public data
link to original datasets, TRY
not TRY's dataset, but matrix with trait
trait value -> provide linkage to it
use case: what to do with public trait data
serving data in raw form
if analysis incorporates >x% of points
goodwill towards data providers
put trait data in paper
about trait data
cite BIEN whitepaper
contact info for traits
direct link to data
same data provider
Glopnet datasets access? not web access
- link to paper instead
10 years after publication?
also data bigger than Glopnet
discussion from large amount of trait community
contributing highly valuable data?
# datasets contributing: new datasets vs. appending to one big file

Metadata¶

whose data is it?
people running analyses across community
enabling data: clean, scrubbed, stdized data
providing mechanism to integrate data
don't put too many hurdles in front of data
cognizant of who data is coming from

Download tracking¶

when someone downloads info from web, count total records downloaded by community
identify provenance down to source
BIEN download format
understand relative proportions of indiv contributors
push button to notify that a paper has used someone's data
if registered to BIEN account, annual reporting

NVS: dataset permissions: public, private, metadata only
case-by-case basis
statement accompanying data explaining rationale
specifications of how existing sources acknowledged
supplementary materials w/ data
citation to NVS adequate or need acknowledgment to specific data provider

back at a little past 1:30pm

Authorship¶

Susan, Bob, Brian E

get conditions from data providers
access to raw data vs derived products
mean abundance
open data: used for research but not served forward
reserved
derived products
downstream use of data
# records -> amount of involvment
BIEN-derived products: when to make accessible
for 2013, usage is exclusive to BIEN group
2014-2016: by permission usage
coauthorship requirements: active involvment in paper
final list
who from BIEN working on what
by 2017, use and cite data in derived products or level 0/1 data
need to not serve the endangered species
2020: acknowledgment of versions
straightforward levels, timeframes
in acknowledgment, have URI pointing to contributors
- put on BIEN website
contact Brian E with comments
3 years before data is completely public (w/o asking for coauthorship)
year 1: internal only embargo
level 0, level 1 subject to quantity limit
workflow to enable use of scrubbed, corrected, stdized data
embargo restrictions: need to provide data to support analysis
other data networks: need to provide data, but data was part of other network
find out how global collections repository will work
link to repository record
contact people if datasets change
maintain own contact info
phylogenetic tree is a derived product
ENO data behind website
where contact information should be stored: one place w/ contact info
- bit tags?
running list of everyone w/ contact info
e.g. using Gmail for identity
file of contact info on website: participants > people > spreadsheet
- HTML form instead of CSV
LDAP info: keeping it up to date
what DataONE is doing: create, store, authenticate
- Google account, CI login

Vision for BIEN user interface¶

self-sustaining interface can walk away from
many consultants would find this information valuable
- payment? herbaria charge for identifications
NGOs
role for education
providing data back to MO and NY
return data products back to providers
2-layer architecture: API, UI
single entry point: inner level API
higher level UI built on top of API
what is out of scope?
- data entry/correction tool
data management tool using schema
online mapping tool
in scope
- authentication
  - shouldn't have to reinvent, confederated approaches we can follow
tracking and recording of provenance
content access and control
public info, but still log IP address, date, dataset
means of logging and reporting to data owner
allow data owner to set access to data on case by case basis?
data loading and validation process
harvest DwC data
plot data: online mapping tool
- guides user through mapping process
map data provider's data to BIEN schema
save mappings in user's schema
user-defined fields to put custom data somewhere
dataset basis changes: reload whole dataset, don't edit BIEN live
report frequency distributions
valuable feedback
record initial date of upload
existing versioning tools
automatic versioning
upload plots incrementally
if existing plot dataset
immediate validations
core DB, analytical DB
automated download->error log
quality control, crowdsourcing->increase utility, accuracy of data
- mechanism to record good/bad coords
- submit comment
what functionality envisioned for crowdsourcing?
- range maps
- ranked scale for map quality
- comments for maps
raw observation->which points cultivated
expertise of outside users
annotate products
allow input data into BIEN
map star rating
search and discovery of data
location: clicking on map, defining rectangle
build custom queries
save queries to run again
need expanded schema w/ users, passwords, data access levels
ideal scenario
priorities:
1. access controls/authentication
investment in possibility of entire world commenting on 3 maps
crowdsourcing maps: use Map of Life
feedback from crowdsourcing effort

Metadata¶

document on Redmine site
table of data contributors
- link to reference website, aggregator vs. primary
refresh, develop tags/metadata associated w/ data contribution
keep in separate table on Redmine site
insiders vs. outsiders: access levels
table of data providers
cycle of handoffs of data policy restrictions
dig through several layers of strata to get data usage policies and associate w/ appropriate records in BIEN
standards about who is able to provide data to BIEN
current sources credible, but if allow public to provide data becomes an issue
provide a template the data providers will fill out
finite # of contributors
look through websites for that info
clarify challenges, perceptions, expectations how adhere to data usage policies
set example by being attentive to data usage policies
relative to data providers
preservation and dissemination service
counters of downloads by record ID
subsetting of data
summarizations of data, monitors of usage
who owns data? IP law differs between countries
no copyright for data in US
15 yrs on data in EU
registries: e.g. Index Herbariorum
GRBio
unique ID in DB
get code from Index Herbariorum
information that resolves where info is from
when click URL, expect go somewhere/get something
source of data

Methodologies¶

short list of critical method parameters
- spatial area circumscribed
talk to Bob: VegBank discussion about methods
need versioning
to what extent should BIEN be live vs. static?
correcting an error or adding new data
robust versioning
regular snapshots
- quarterly
if major corrections, new data -> new version
feedback is major added feature
motivation for people to submit data
quality control pipeline
high priority, high value enticement
W3C standards for web objects
e-mailing the error logs
methodology, versioning, conditions of use
data policies
- many different data contributors
- aggregators
GBIF: disclaimer: up to user to check special conditions
methodology, complexity described in VegX paper
presentation on the Plone site

filtered push (James Macklin)
- AppleCore
when annotate record
different DB systems, portals
consistent syntax for providing feedback to data providers

Authorship group¶

who should be included on papers
very interesting, daunting problem of who should be included
how to sort through this
Jens, Brian McGill, Barbara, Bob
who would want to use the data and be authors on it
over 50 people involved at various times
primary BIEN contact (Brian E) approves request to do a paper
if not included on core list, let Brian know
profile on core DB
login/password
who should we contact in addition
not have core, non-core separation
need way forward, core or not?
20-25 people
weight core people by involvment in last 1-2 years
can seem arbitrary who's included, also for honorary reasons
paper will say how much each author's contribution was
discussion on clarifying roles attributed to aspects of authorship

5:15pm Elsie's 4th meeting room
6:30pm group reservation at Chuck's (near Brofy's, Marina)
tomorrow morning: group picture
homework:
- what you would prioritize for drafts, analysis
last afternoon group meeting
- future of BIEN
- next meeting?
- priorities for 2013
marching orders for 2013
what to prioritize for Friday

trait paper
whitepaper
prioritize goals, wishlist for metadata group
walk through schema
after 2:30pm meeting

Potential validations¶

See attached Potential validations on BIEN data 29 Nov.docx

no summary today
meet tomorrow at 8:30am

Files (1)

Project

General

Profile

Wiki

2012-11-29 breakout groups¶

Brian McGill: work started before last BIEN meeting w/ geospatial portal¶

Mark¶

Brad's presentation¶

Traits¶

Traits¶

Metadata¶

Download tracking¶

Authorship¶

Vision for BIEN user interface¶

Metadata¶

Methodologies¶

Authorship group¶

Potential validations¶