2012-11-29 breakout groups¶
- Table of contents
- 2012-11-29 breakout groups
- 2 short presentations: Brian McGill, Mark
Brian McGill: work started before last BIEN meeting w/ geospatial portal¶
- climate data
- ENO
- targeted to scientist
- API
- URL
- species or lat/long list
- species valid -> environmental data
- corrected by TNRS
- web server w/ PostGIS
- species list
- authorities, names
- Cyril's trait data
- extract
- 2 species, traits
- switch to return all values
- turn it into lat/longs
- species occurrence prepended with lat/long
- Quercus rubra
- denormalized table
- start with species -> validation -> environment layers
- mean annual temperature
- annual minimum temperature, precipitation
- compare to abundance of another species
- field: #s at points in space
- raster: pixels with values filled in
- anything that's spatially sampled
- different kinds of interpolation
- Quercus alba abundances
- design diagram
- get lat/long oriented list
- environmental data
- species list->validation
- occurrence data->lat/long list
- lat/long point->list of species at point (for trees)
- extract from phylogeny->shapefile w/ ranges
- boxes
- species that intersect box
- simple architecture
- thin front-end on top of fancy data
- targeted to scientists
- practical because it returns CSV files
- pipeline architecture
- interpolation: linear, distance-weighted
- Python
- include metadata of database version
- available now on server
- word document
- Python already has phylogeny libraries
- front-end on data
- use cases developed
- augments BIEN->more accessible and usable
- niche tool
- API vs. GUI
- GUI that uses web service?
- point-and-click
- HTML code
- SB: 35 N, 120 W
- RESTful API
- design to go for
- document that lists URL strings
- will list feedback
- associate with BIEN 3 website
- move to NCEAS server
- needs PostGIS, Python
- quick demo
- integrate into codebase
- work with Jim's code
- upgrade to FastCGI
- WorldClim layers, not ENO yet
- temperature, precipitation
- min, max temperature, precipitation, degree days, growing season
- water balance
- 30m digital RTM model
- slope, aspect
- where to serve layers
- on Plone site under Tools > BIEN > informatics
- BIEN 3 access, data sharing
Mark¶
- metadata: the implicit context about the dataset
- no distinct line between data and metadata
- metadata is additional information you need to repurpose information
- Dublin Core metadata
- dataset
- name, when created, where got it from
- consistent names for fields for who, what, where, when, why
- agencies had to have metadata
- Internet was bucket of info
- federal geographic commission FGDC->ISO standard 1991
- geospatial standard
- technically savvy researchers
- spreadsheets complement one another
- outreach to community on metadata standard by NCEAS, LTER (ecological stations)
- LTER needed metadata standards
- can't fit metadata into filename
- formal language in XML
- endorsed by NCEAS, LTER, ITER
- ingestion of documents
- KNB
- DataONE initiative: NSF funded datanet
- supports creation of robust repositories for NSF-funded research projects
- ONE = Observation Network for Earth
- incorporate entire framework into DataONE
- express metadata in EML, contribute it to KNB where ecologists look for data
- Dryad a member of DataONE
- Dryad metadata spec vs. DataONE
- sufficient metadata to ingest digital object into application
- more useful if had better metadata framework
- BIEN is dynamic resource
- exists, has this kind of content, is at this location
- when any researcher publishes result
- reports based on BIEN data: which version?
- when get result set, unambiguous what the contents are
- results can change overnight when data reloaded
- versioning clarification: engineer into BIEN when data downloaded
- semiannual or quarterly releases
- start/stop dates
- dynamic list of occurrences
- timestamped update field
- do for core set of occurrences
- BIEN metadata on KNB site
- planned level of documentation for NYBG, VegBank
- how much is aggregator responsible for individualizing contributors
- kinds of things that wouldn't necessarily come with the data
- when upload data to GBIF, metadata provided: description, last updated date
- misc info field
- 1st-class fields
- community of developers watching for how metadata evolves
- EML metadata often stored in relational database
- main focus: BIEN 3 web interface
- Brad: intro to current web interface
- subgroups draft documents to inform BIEN website
- 4 specific subgroups at 9:15am
- web interface: 2h
- data serving policy: 2h
- how serving data, interfacing w/ other groups
- authorship and data use agreement: 1h
- Bob, Barbara Thiers
- traits: 1h
- draft document based on discussion
- data-serving policy
- independent researchers, map of file
- smaller groups, could be in same room
- policy
- herbaria data
- formal goals for serving data
- credit for data contributed
Brad's presentation¶
- based on experience with SALVIAS
- what do we need to know to serve the data
- data providers, data ownership, data participation, methodology, versioning
- approximate source of data: e.g. ARIZ or GBIF
- may want to know how often data was used in publications
- may want to be on publications
- interest in knowing errors found or enhancements to data
- data ownership
- subset of people who created it
- people with continuing interest in data
- who redistributed the data, who participated in publications
- errors with data, changes/enhancements
- conditions of use: e.g. acknowledgment, option of participating in publications
- specimens: who collected the specimen data, who will want to control use of it
- data participation
- data quality
- methodology: things you need to know about how data was done
- plots collected with one methodology can't be combined with data of another methodology
- trees 10m, 50m in diameter: units?
- versioning: put dataset somewhere and not change it
- formal infrastructure for metadata
- metadata->provenance
- where data came from, may have been pass around multiple times
- provenance: notion of tracking things to a source, and prior to that
- determiner in NYBG, but not BIEN 3? maybe only in core DB
- web interface group, data serving/metadata: 2h
- BIEN authorship/data use, traits: 1h
- then repopulate data serving and metadata or web interface groups
- interface is what's needed to make database sustainable
- primarily geared to short-term (1-2 years)
- separate short-, long-term components
- metadata: Barbara, Bill, Martha
- authorship: Susan, Bob, Brian M
- trait, authorship shorter groups->will be joining large groups
- web interface in separate room
- back at 11:15am
- metadata group in lounge
- authorship, traits in main room
Traits¶
- web search tool for shapefiles
- provide data in shapefile instead of raster->change coordinate system
- user community wants JPEG?
- what kind of data downloadable to get points used to create range map
- download data inside BIEN or go back to original data
- use model as search criteria
- geographic ranges
- issue with endangered species: have small ranges, making publicly available
- does IUCN need to control something for this?
- whether to embargo data
- go to publication where species was described
- GBIF best practices
- generalized to 10km grid square
- centroid
- only species w/ minimum of 5 collections
- could do raster grid
- can't control how data is ultimately used
- hide endangered species in web interface
- or don't release localities->won't be included in shapfiles
- include fuzzing in coordinate uncertainty
- different color of point for fuzzed points
- issue for small complex hulls
- everything that's not a maxent model
- who sharing shapefiles with?
- information available one way or another, not to hide information from researchers
- provide capability of fuzzing new data
- need metadata to describe that point was fuzzed
- don't make endangered species publicly accessible
- contact to get full data
- societies list or subset
- local endemism
- embargo initial data
- CA endangered plants vs. US
- providing individual data
- rasters of big data
- core level needs
Traits¶
- traits, what to do w/ data
- plots DB has public, by request, private
- real, public data
- link to original datasets, TRY
- not TRY's dataset, but matrix with trait
- trait value -> provide linkage to it
- use case: what to do with public trait data
- serving data in raw form
- if analysis incorporates >x% of points
- goodwill towards data providers
- put trait data in paper
- about trait data
- cite BIEN whitepaper
- contact info for traits
- direct link to data
- same data provider
- Glopnet datasets access? not web access
- link to paper instead
- 10 years after publication?
- also data bigger than Glopnet
- discussion from large amount of trait community
- contributing highly valuable data?
- # datasets contributing: new datasets vs. appending to one big file
Metadata¶
- whose data is it?
- people running analyses across community
- enabling data: clean, scrubbed, stdized data
- providing mechanism to integrate data
- don't put too many hurdles in front of data
- cognizant of who data is coming from
Download tracking¶
- when someone downloads info from web, count total records downloaded by community
- identify provenance down to source
- BIEN download format
- understand relative proportions of indiv contributors
- push button to notify that a paper has used someone's data
- if registered to BIEN account, annual reporting
- NVS: dataset permissions: public, private, metadata only
- case-by-case basis
- statement accompanying data explaining rationale
- specifications of how existing sources acknowledged
- supplementary materials w/ data
- citation to NVS adequate or need acknowledgment to specific data provider
- back at a little past 1:30pm
Authorship¶
Susan, Bob, Brian E
- get conditions from data providers
- access to raw data vs derived products
- mean abundance
- open data: used for research but not served forward
- reserved
- derived products
- downstream use of data
- # records -> amount of involvment
- BIEN-derived products: when to make accessible
- for 2013, usage is exclusive to BIEN group
- 2014-2016: by permission usage
- coauthorship requirements: active involvment in paper
- final list
- who from BIEN working on what
- by 2017, use and cite data in derived products or level 0/1 data
- need to not serve the endangered species
- 2020: acknowledgment of versions
- straightforward levels, timeframes
- in acknowledgment, have URI pointing to contributors
- put on BIEN website
- contact Brian E with comments
- 3 years before data is completely public (w/o asking for coauthorship)
- year 1: internal only embargo
- level 0, level 1 subject to quantity limit
- workflow to enable use of scrubbed, corrected, stdized data
- embargo restrictions: need to provide data to support analysis
- other data networks: need to provide data, but data was part of other network
- find out how global collections repository will work
- link to repository record
- contact people if datasets change
- maintain own contact info
- phylogenetic tree is a derived product
- ENO data behind website
- where contact information should be stored: one place w/ contact info
- bit tags?
- running list of everyone w/ contact info
- e.g. using Gmail for identity
- file of contact info on website: participants > people > spreadsheet
- HTML form instead of CSV
- LDAP info: keeping it up to date
- what DataONE is doing: create, store, authenticate
- Google account, CI login
Vision for BIEN user interface¶
- self-sustaining interface can walk away from
- many consultants would find this information valuable
- payment? herbaria charge for identifications
- NGOs
- role for education
- providing data back to MO and NY
- return data products back to providers
- 2-layer architecture: API, UI
- single entry point: inner level API
- higher level UI built on top of API
- what is out of scope?
- data entry/correction tool
- data management tool using schema
- online mapping tool
- in scope
- authentication
- shouldn't have to reinvent, confederated approaches we can follow
- authentication
- tracking and recording of provenance
- content access and control
- public info, but still log IP address, date, dataset
- means of logging and reporting to data owner
- allow data owner to set access to data on case by case basis?
- data loading and validation process
- harvest DwC data
- plot data: online mapping tool
- guides user through mapping process
- map data provider's data to BIEN schema
- save mappings in user's schema
- user-defined fields to put custom data somewhere
- dataset basis changes: reload whole dataset, don't edit BIEN live
- report frequency distributions
- valuable feedback
- record initial date of upload
- existing versioning tools
- automatic versioning
- upload plots incrementally
- if existing plot dataset
- immediate validations
- core DB, analytical DB
- automated download->error log
- quality control, crowdsourcing->increase utility, accuracy of data
- mechanism to record good/bad coords
- submit comment
- what functionality envisioned for crowdsourcing?
- range maps
- ranked scale for map quality
- comments for maps
- raw observation->which points cultivated
- expertise of outside users
- annotate products
- allow input data into BIEN
- map star rating
- search and discovery of data
- location: clicking on map, defining rectangle
- build custom queries
- save queries to run again
- need expanded schema w/ users, passwords, data access levels
- ideal scenario
- priorities:
- access controls/authentication
- investment in possibility of entire world commenting on 3 maps
- crowdsourcing maps: use Map of Life
- feedback from crowdsourcing effort
Metadata¶
- document on Redmine site
- table of data contributors
- link to reference website, aggregator vs. primary
- refresh, develop tags/metadata associated w/ data contribution
- keep in separate table on Redmine site
- insiders vs. outsiders: access levels
- table of data providers
- cycle of handoffs of data policy restrictions
- dig through several layers of strata to get data usage policies and associate w/ appropriate records in BIEN
- standards about who is able to provide data to BIEN
- current sources credible, but if allow public to provide data becomes an issue
- provide a template the data providers will fill out
- finite # of contributors
- look through websites for that info
- clarify challenges, perceptions, expectations how adhere to data usage policies
- set example by being attentive to data usage policies
- relative to data providers
- preservation and dissemination service
- counters of downloads by record ID
- subsetting of data
- summarizations of data, monitors of usage
- who owns data? IP law differs between countries
- no copyright for data in US
- 15 yrs on data in EU
- registries: e.g. Index Herbariorum
- GRBio
- unique ID in DB
- get code from Index Herbariorum
- information that resolves where info is from
- when click URL, expect go somewhere/get something
- source of data
Methodologies¶
- short list of critical method parameters
- spatial area circumscribed
- talk to Bob: VegBank discussion about methods
- need versioning
- to what extent should BIEN be live vs. static?
- correcting an error or adding new data
- robust versioning
- regular snapshots
- quarterly
- if major corrections, new data -> new version
- feedback is major added feature
- motivation for people to submit data
- quality control pipeline
- high priority, high value enticement
- W3C standards for web objects
- e-mailing the error logs
- methodology, versioning, conditions of use
- data policies
- many different data contributors
- aggregators
- GBIF: disclaimer: up to user to check special conditions
- methodology, complexity described in VegX paper
- presentation on the Plone site
- filtered push (James Macklin)
- AppleCore
- when annotate record
- different DB systems, portals
- consistent syntax for providing feedback to data providers
Authorship group¶
- who should be included on papers
- very interesting, daunting problem of who should be included
- how to sort through this
- Jens, Brian McGill, Barbara, Bob
- who would want to use the data and be authors on it
- over 50 people involved at various times
- primary BIEN contact (Brian E) approves request to do a paper
- if not included on core list, let Brian know
- profile on core DB
- login/password
- who should we contact in addition
- not have core, non-core separation
- need way forward, core or not?
- 20-25 people
- weight core people by involvment in last 1-2 years
- can seem arbitrary who's included, also for honorary reasons
- paper will say how much each author's contribution was
- discussion on clarifying roles attributed to aspects of authorship
- 5:15pm Elsie's 4th meeting room
- 6:30pm group reservation at Chuck's (near Brofy's, Marina)
- tomorrow morning: group picture
- homework:
- what you would prioritize for drafts, analysis
- last afternoon group meeting
- future of BIEN
- next meeting?
- priorities for 2013
- marching orders for 2013
- what to prioritize for Friday
- trait paper
- whitepaper
- prioritize goals, wishlist for metadata group
- walk through schema
- after 2:30pm meeting
Potential validations¶
See attached Potential validations on BIEN data 29 Nov.docx
- no summary today
- meet tomorrow at 8:30am