Project

General

Profile

2012-11-29 breakout groups

  • 2 short presentations: Brian McGill, Mark

Brian McGill: work started before last BIEN meeting w/ geospatial portal

  • climate data
  • ENO
  • targeted to scientist
  • API
  • URL
  • species or lat/long list
  • species valid -> environmental data
  • corrected by TNRS
  • web server w/ PostGIS
  • species list
  • authorities, names
  • Cyril's trait data
  • extract
  • 2 species, traits
  • switch to return all values
  • turn it into lat/longs
  • species occurrence prepended with lat/long
  • Quercus rubra
  • denormalized table
  • start with species -> validation -> environment layers
  • mean annual temperature
  • annual minimum temperature, precipitation
  • compare to abundance of another species
  • field: #s at points in space
  • raster: pixels with values filled in
  • anything that's spatially sampled
  • different kinds of interpolation
  • Quercus alba abundances
  • design diagram
  • get lat/long oriented list
  • environmental data
  • species list->validation
  • occurrence data->lat/long list
  • lat/long point->list of species at point (for trees)
  • extract from phylogeny->shapefile w/ ranges
  • boxes
  • species that intersect box
  • simple architecture
  • thin front-end on top of fancy data
  • targeted to scientists
  • practical because it returns CSV files
  • pipeline architecture
  • interpolation: linear, distance-weighted
  • Python
  • include metadata of database version
  • available now on server
  • word document
  • Python already has phylogeny libraries
  • front-end on data
  • use cases developed
  • augments BIEN->more accessible and usable
  • niche tool
  • API vs. GUI
  • GUI that uses web service?
  • point-and-click
  • HTML code
  • SB: 35 N, 120 W
  • RESTful API
  • design to go for
  • document that lists URL strings
  • will list feedback
  • associate with BIEN 3 website
  • move to NCEAS server
  • needs PostGIS, Python
  • quick demo
  • integrate into codebase
  • work with Jim's code
  • upgrade to FastCGI
  • WorldClim layers, not ENO yet
  • temperature, precipitation
  • min, max temperature, precipitation, degree days, growing season
  • water balance
  • 30m digital RTM model
  • slope, aspect
  • where to serve layers
  • on Plone site under Tools > BIEN > informatics
  • BIEN 3 access, data sharing

Mark

  • metadata: the implicit context about the dataset
  • no distinct line between data and metadata
  • metadata is additional information you need to repurpose information
  • Dublin Core metadata
  • dataset
  • name, when created, where got it from
  • consistent names for fields for who, what, where, when, why
  • agencies had to have metadata
  • Internet was bucket of info
  • federal geographic commission FGDC->ISO standard 1991
    • geospatial standard
  • technically savvy researchers
  • spreadsheets complement one another
  • outreach to community on metadata standard by NCEAS, LTER (ecological stations)
    • LTER needed metadata standards
  • can't fit metadata into filename
  • formal language in XML
  • endorsed by NCEAS, LTER, ITER
  • ingestion of documents
  • KNB
  • DataONE initiative: NSF funded datanet
    • supports creation of robust repositories for NSF-funded research projects
  • ONE = Observation Network for Earth
  • incorporate entire framework into DataONE
  • express metadata in EML, contribute it to KNB where ecologists look for data
  • Dryad a member of DataONE
  • Dryad metadata spec vs. DataONE
  • sufficient metadata to ingest digital object into application
  • more useful if had better metadata framework
  • BIEN is dynamic resource
  • exists, has this kind of content, is at this location
  • when any researcher publishes result
  • reports based on BIEN data: which version?
  • when get result set, unambiguous what the contents are
  • results can change overnight when data reloaded
  • versioning clarification: engineer into BIEN when data downloaded
  • semiannual or quarterly releases
  • start/stop dates
  • dynamic list of occurrences
  • timestamped update field
  • do for core set of occurrences
  • BIEN metadata on KNB site
  • planned level of documentation for NYBG, VegBank
  • how much is aggregator responsible for individualizing contributors
  • kinds of things that wouldn't necessarily come with the data
  • when upload data to GBIF, metadata provided: description, last updated date
  • misc info field
  • 1st-class fields
  • community of developers watching for how metadata evolves
  • EML metadata often stored in relational database
  • main focus: BIEN 3 web interface
  • Brad: intro to current web interface
  • subgroups draft documents to inform BIEN website
  • 4 specific subgroups at 9:15am
    • web interface: 2h
    • data serving policy: 2h
      • how serving data, interfacing w/ other groups
    • authorship and data use agreement: 1h
      • Bob, Barbara Thiers
    • traits: 1h
  • draft document based on discussion
  • data-serving policy
  • independent researchers, map of file
  • smaller groups, could be in same room
  • policy
  • herbaria data
    • formal goals for serving data
  • credit for data contributed

Brad's presentation

  • based on experience with SALVIAS
  • what do we need to know to serve the data
  • data providers, data ownership, data participation, methodology, versioning
  • approximate source of data: e.g. ARIZ or GBIF
  • may want to know how often data was used in publications
  • may want to be on publications
  • interest in knowing errors found or enhancements to data
  • data ownership
    • subset of people who created it
    • people with continuing interest in data
  • who redistributed the data, who participated in publications
  • errors with data, changes/enhancements
  • conditions of use: e.g. acknowledgment, option of participating in publications
  • specimens: who collected the specimen data, who will want to control use of it
  • data participation
  • data quality
  • methodology: things you need to know about how data was done
  • plots collected with one methodology can't be combined with data of another methodology
  • trees 10m, 50m in diameter: units?
  • versioning: put dataset somewhere and not change it
  • formal infrastructure for metadata
  • metadata->provenance
  • where data came from, may have been pass around multiple times
  • provenance: notion of tracking things to a source, and prior to that
  • determiner in NYBG, but not BIEN 3? maybe only in core DB
  • web interface group, data serving/metadata: 2h
  • BIEN authorship/data use, traits: 1h
  • then repopulate data serving and metadata or web interface groups
  • interface is what's needed to make database sustainable
  • primarily geared to short-term (1-2 years)
  • separate short-, long-term components
  • metadata: Barbara, Bill, Martha
  • authorship: Susan, Bob, Brian M
  • trait, authorship shorter groups->will be joining large groups
  • web interface in separate room
  • back at 11:15am
  • metadata group in lounge
  • authorship, traits in main room

Traits

  • web search tool for shapefiles
  • provide data in shapefile instead of raster->change coordinate system
  • user community wants JPEG?
  • what kind of data downloadable to get points used to create range map
  • download data inside BIEN or go back to original data
  • use model as search criteria
  • geographic ranges
  • issue with endangered species: have small ranges, making publicly available
  • does IUCN need to control something for this?
  • whether to embargo data
  • go to publication where species was described
  • GBIF best practices
  • generalized to 10km grid square
  • centroid
  • only species w/ minimum of 5 collections
  • could do raster grid
  • can't control how data is ultimately used
  • hide endangered species in web interface
    • or don't release localities->won't be included in shapfiles
  • include fuzzing in coordinate uncertainty
  • different color of point for fuzzed points
  • issue for small complex hulls
  • everything that's not a maxent model
  • who sharing shapefiles with?
  • information available one way or another, not to hide information from researchers
  • provide capability of fuzzing new data
  • need metadata to describe that point was fuzzed
  • don't make endangered species publicly accessible
  • contact to get full data
  • societies list or subset
  • local endemism
  • embargo initial data
  • CA endangered plants vs. US
  • providing individual data
  • rasters of big data
  • core level needs

Traits

  • traits, what to do w/ data
  • plots DB has public, by request, private
  • real, public data
  • link to original datasets, TRY
  • not TRY's dataset, but matrix with trait
  • trait value -> provide linkage to it
  • use case: what to do with public trait data
  • serving data in raw form
  • if analysis incorporates >x% of points
  • goodwill towards data providers
  • put trait data in paper
  • about trait data
  • cite BIEN whitepaper
  • contact info for traits
  • direct link to data
  • same data provider
  • Glopnet datasets access? not web access
    • link to paper instead
  • 10 years after publication?
  • also data bigger than Glopnet
  • discussion from large amount of trait community
  • contributing highly valuable data?
  • # datasets contributing: new datasets vs. appending to one big file

Metadata

  • whose data is it?
  • people running analyses across community
  • enabling data: clean, scrubbed, stdized data
  • providing mechanism to integrate data
  • don't put too many hurdles in front of data
  • cognizant of who data is coming from

Download tracking

  • when someone downloads info from web, count total records downloaded by community
  • identify provenance down to source
  • BIEN download format
  • understand relative proportions of indiv contributors
  • push button to notify that a paper has used someone's data
  • if registered to BIEN account, annual reporting
  • NVS: dataset permissions: public, private, metadata only
  • case-by-case basis
  • statement accompanying data explaining rationale
  • specifications of how existing sources acknowledged
  • supplementary materials w/ data
  • citation to NVS adequate or need acknowledgment to specific data provider
  • back at a little past 1:30pm

Authorship

Susan, Bob, Brian E

  • get conditions from data providers
  • access to raw data vs derived products
  • mean abundance
  • open data: used for research but not served forward
  • reserved
  • derived products
  • downstream use of data
  • # records -> amount of involvment
  • BIEN-derived products: when to make accessible
  • for 2013, usage is exclusive to BIEN group
  • 2014-2016: by permission usage
  • coauthorship requirements: active involvment in paper
  • final list
  • who from BIEN working on what
  • by 2017, use and cite data in derived products or level 0/1 data
  • need to not serve the endangered species
  • 2020: acknowledgment of versions
  • straightforward levels, timeframes
  • in acknowledgment, have URI pointing to contributors
    • put on BIEN website
  • contact Brian E with comments
  • 3 years before data is completely public (w/o asking for coauthorship)
  • year 1: internal only embargo
  • level 0, level 1 subject to quantity limit
  • workflow to enable use of scrubbed, corrected, stdized data
  • embargo restrictions: need to provide data to support analysis
  • other data networks: need to provide data, but data was part of other network
  • find out how global collections repository will work
  • link to repository record
  • contact people if datasets change
  • maintain own contact info
  • phylogenetic tree is a derived product
  • ENO data behind website
  • where contact information should be stored: one place w/ contact info
    • bit tags?
  • running list of everyone w/ contact info
  • e.g. using Gmail for identity
  • file of contact info on website: participants > people > spreadsheet
    • HTML form instead of CSV
  • LDAP info: keeping it up to date
  • what DataONE is doing: create, store, authenticate
    • Google account, CI login

Vision for BIEN user interface

  • self-sustaining interface can walk away from
  • many consultants would find this information valuable
    • payment? herbaria charge for identifications
  • NGOs
  • role for education
  • providing data back to MO and NY
  • return data products back to providers
  • 2-layer architecture: API, UI
  • single entry point: inner level API
  • higher level UI built on top of API
  • what is out of scope?
    • data entry/correction tool
  • data management tool using schema
  • online mapping tool
  • in scope
    • authentication
      • shouldn't have to reinvent, confederated approaches we can follow
  • tracking and recording of provenance
  • content access and control
  • public info, but still log IP address, date, dataset
  • means of logging and reporting to data owner
  • allow data owner to set access to data on case by case basis?
  • data loading and validation process
  • harvest DwC data
  • plot data: online mapping tool
    • guides user through mapping process
  • map data provider's data to BIEN schema
  • save mappings in user's schema
  • user-defined fields to put custom data somewhere
  • dataset basis changes: reload whole dataset, don't edit BIEN live
  • report frequency distributions
  • valuable feedback
  • record initial date of upload
  • existing versioning tools
  • automatic versioning
  • upload plots incrementally
  • if existing plot dataset
  • immediate validations
  • core DB, analytical DB
  • automated download->error log
  • quality control, crowdsourcing->increase utility, accuracy of data
    • mechanism to record good/bad coords
    • submit comment
  • what functionality envisioned for crowdsourcing?
    • range maps
    • ranked scale for map quality
    • comments for maps
  • raw observation->which points cultivated
  • expertise of outside users
  • annotate products
  • allow input data into BIEN
  • map star rating
  • search and discovery of data
  • location: clicking on map, defining rectangle
  • build custom queries
  • save queries to run again
  • need expanded schema w/ users, passwords, data access levels
  • ideal scenario
  • priorities:
    1. access controls/authentication
  • investment in possibility of entire world commenting on 3 maps
  • crowdsourcing maps: use Map of Life
  • feedback from crowdsourcing effort

Metadata

  • document on Redmine site
  • table of data contributors
    • link to reference website, aggregator vs. primary
  • refresh, develop tags/metadata associated w/ data contribution
  • keep in separate table on Redmine site
  • insiders vs. outsiders: access levels
  • table of data providers
  • cycle of handoffs of data policy restrictions
  • dig through several layers of strata to get data usage policies and associate w/ appropriate records in BIEN
  • standards about who is able to provide data to BIEN
  • current sources credible, but if allow public to provide data becomes an issue
  • provide a template the data providers will fill out
  • finite # of contributors
  • look through websites for that info
  • clarify challenges, perceptions, expectations how adhere to data usage policies
  • set example by being attentive to data usage policies
  • relative to data providers
  • preservation and dissemination service
  • counters of downloads by record ID
  • subsetting of data
  • summarizations of data, monitors of usage
  • who owns data? IP law differs between countries
  • no copyright for data in US
  • 15 yrs on data in EU
  • registries: e.g. Index Herbariorum
  • GRBio
  • unique ID in DB
  • get code from Index Herbariorum
  • information that resolves where info is from
  • when click URL, expect go somewhere/get something
  • source of data

Methodologies

  • short list of critical method parameters
    • spatial area circumscribed
  • talk to Bob: VegBank discussion about methods
  • need versioning
  • to what extent should BIEN be live vs. static?
  • correcting an error or adding new data
  • robust versioning
  • regular snapshots
    • quarterly
  • if major corrections, new data -> new version
  • feedback is major added feature
  • motivation for people to submit data
  • quality control pipeline
  • high priority, high value enticement
  • W3C standards for web objects
  • e-mailing the error logs
  • methodology, versioning, conditions of use
  • data policies
    • many different data contributors
    • aggregators
  • GBIF: disclaimer: up to user to check special conditions
  • methodology, complexity described in VegX paper
  • presentation on the Plone site
  • filtered push (James Macklin)
    • AppleCore
  • when annotate record
  • different DB systems, portals
  • consistent syntax for providing feedback to data providers

Authorship group

  • who should be included on papers
  • very interesting, daunting problem of who should be included
  • how to sort through this
  • Jens, Brian McGill, Barbara, Bob
  • who would want to use the data and be authors on it
  • over 50 people involved at various times
  • primary BIEN contact (Brian E) approves request to do a paper
  • if not included on core list, let Brian know
  • profile on core DB
  • login/password
  • who should we contact in addition
  • not have core, non-core separation
  • need way forward, core or not?
  • 20-25 people
  • weight core people by involvment in last 1-2 years
  • can seem arbitrary who's included, also for honorary reasons
  • paper will say how much each author's contribution was
  • discussion on clarifying roles attributed to aspects of authorship
  • 5:15pm Elsie's 4th meeting room
  • 6:30pm group reservation at Chuck's (near Brofy's, Marina)
  • tomorrow morning: group picture
  • homework:
    • what you would prioritize for drafts, analysis
  • last afternoon group meeting
    • future of BIEN
    • next meeting?
    • priorities for 2013
  • marching orders for 2013
  • what to prioritize for Friday
  • trait paper
  • whitepaper
  • prioritize goals, wishlist for metadata group
  • walk through schema
  • after 2:30pm meeting

Potential validations

See attached Potential validations on BIEN data 29 Nov.docx

  • no summary today
  • meet tomorrow at 8:30am