2011 working group Th BIEN Components¶
- different classes of validation
- required formats to be enterable into database: db constraints
- use cases related to analytical database: what does analyst want to get out of db
- user interface in scope?
- more interest from sci community w/ interface
- different than having webmap server serve shapefiles
- huge pipeline to John's map generating iPlant component
- easy to use interface -> more interest -> more data
- users -> API -> data -> API -> tools, users
- decouple impl from database
- robust core db w/ clean APIs
- UI: tools themselves?
- web-based search tool?
- map to look at locations
- UI implements API
- UI for data upload
- MySQL in vs API
- need public access point in some form
- API abstracts database backend: whether it's MySQL or Postgres, etc.
- range maps: already run and produced endpoint
- cron job to produce analyt db regularly (behind API)
- use cases are what we want to retrieve from db
- timestamping and versioning
- analyt dbs have versions generated at regular intervals
- timestamp and archive download
- continuous taxonomic updating of core db: track changes?
- timestamp as much as possible, but sometimes data is dynamic (GBIF query)
- if people only getting data though endpoint, don't need to have minute-to-minute versioning
- reporting db is from day before (generated daily)
- don't keep old versions of it
- requirement to use data in papers to have stamped versions
- each time refresh endpoint, new version
- do users need to wait to see new data entered?
- allow to query live and snapshoted db
- version data points rather than whole db
- e.g. species lat/longs
- user record by record changes
- refresh dataset -> auto refresh endpoints
- mirror of core db to query vs products put out every quarter
- need to cite exact version in paper
- having real-time queries to other data sources?
- bandwidth problems
- piping in other data sources challenging if dynamic
- build TNRS into database?
- names can be validated on the fly but then names change from query to query
- sometimes want repeatability, but then can only use snapshotable data
- key elements
- core db
- loading modules
- validation
- analytical database
- public access point
- versioning
- analytical end products are views of db
- not directly in raw data
- data summaries/end products
- raw data vs calculated values
- normalization, aggregation
- derived data products range from raw data to highly-derived anayt products (e.g. range maps)
- user just needs traits, range map as products
- identify commonly desired end products
- reasons for derived products
- versioning
- performance (range maps take a long time on personal computer, but 6 hrs on high performance machine)
- convenience
- repeatability
- simplifies data distribution UIs
- query builder
- single table to pick and choose search criteria for what to download
- relationships among data elements that are not inherent in the data
- info, algs, software
- assembly of info creates more info than component parts
- platform doesn't matter as long as doesn't become obsolete/blocker
- e.g. if MySQL can't do geo, switch
- e.g. TNRS is a scrubbing alg
- some of additional info comes from TNRS, validation: combines existing info with external info
- what to do to ensure user can get range maps
- is validation in scope?
- validation is something that data passes through on way from core data to analyt data
- validation, range mapping are processes applied to data on the way out
- priority workflow/timeline diagram: where we are, what plan to produce and when
- TNRS, GNRS also useful for data providers: get something in return for giving us data
- mech to give data provider has complete access to own data
- timeline
- FIA has FTP site where can get all their data, metadata
- reacquiring data from data sources in scope
- use cases -> need metadata