Project

General

Profile

Content management

Informal requirements

  • Need a clear home for "official" data
  • Need clear protocol for adding to holdings
  • Need clear protocol for modifying holdings
  • Need consistent documentation for all sanctioned data resources
  • Need consistent documentation of workflows
  • Documentation must be well-structured to facilitate machine processing
  • Documentation should be human-readable
  • Clear organization of data resources
    • Store data separately from processing code
    • ...but also maintain association between data resource and steps to acquire or create it
    • Gatekeeper to ensure any add/modify/remove actions are deliberate and documented
    • Mechanism to verify holdings on demand
    • Avoid (minimize?) data duplication
  • Version control for:
    • processing scripts
    • metadata
    • other documentation

Proposed components (draft)

Redmine issue tracker

  • listing of all action items
    • things to be investigated
    • things to be decided
    • things to be implemented
    • things to be fixed
    • things to be tested
  • each task entry captures
    • description of task
    • who is doing it
    • estimate of how long it will take
    • estimated % done
    • history of prose discussion and notes

Version control system

  • Options:
    • Subversion (SVN)
      • Pure centralized repository
      • Easier to learn
      • Less flexible
      • Having branches is hard, so probably would just do linear history
    • Git
      • Slightly harder to learn (though basics are about the same)
      • Can have development model whereby:
        • we keep a pristine 'production' branch that only has vetted contents
        • all work happens on 'working' branches
        • designated gatekeeper manages pushes to the production branch
        • resource registry points to items on the production branch in the repo
  • Contents
    • all 'production' procedures
    • experimental, in-development work
    • utilities
      • shared code libraries (bits of common functionality, etc)
      • project mgmt scripts (content validation, etc)
    • metadata documents we're creating
    • manuscripts we're writing

Wiki area

  • Misc notes
  • Links
  • other odds and ends

Organized on-disk data repository

  • Designated, read-only area on our file system
  • Gatekeeper manages adds/deletes/changes
  • Each dataset must have a unique identifier (see data registry document, below)
  • Require an aux subdirectory with each data source, containing:
    1. README.txt file adhering to our metadata format (below)
    2. checksum file or other means of verifying contents
      • md5sums are probably best option, though could be slow for large data holdings
      • for speed, could also compare by basename+size+modtime
    3. [ if external data ] other metadata, pubs, or other usage documents obtained from provider
      • we should have a 'verify-data' script that can operate on this
      • archival bundle of acquisition/creation/verification workflow
  • probably fairly flat directory hierarchy?
    • instead rely on metadata documents for categorizing, filtering, etc.
    • ...this should mean more stability of paths
    • maybe just top level split into:
      1. source data (acquired by us; irreproducible by us)
      2. intermediate data (produced by us but not intended for distribution)
      3. products (produced by us and intended for distribution)
  • useful to be think about possible data generation process?
    • acquisition: from elsewere
    • transformation: {single input} -> {single output}
    • derivation: {multiple inputs} -> {single output}

Resource registry

  • serves as definitive listing of sanctioned resources
  • contains an entry for every important...
    • dataset
      • the data must be in our sanctioned data repository
    • procedure (i.e., something that produces output of value)
      • the procedure (script, or prose protocol as second option) must be in our VCS
      • maybe also exported to disk 'near' data? (with checksum)
    • possibly critical, expensive-to-reproduce reports and other outputs?
  • registration is intended to provide a contract that:
    • the resource is what its document says it is
    • the resource exists where its document says it is
    • the resource in that location has not changed since it was registered
      • thus its relationship to other resources are accurate as stated
  • allow decoupling of dataset identity and location
    • scripts could do look up on this rather than using hard-coded paths?
  • can run a daily(?) script to check validity of document
    • is it well structured?
    • does the content on disk match?
  • each registry entry is a document with certain minimal information
    • unique id for the document itself
    • unique id for the resource it describes
    • current location of authoritative copy (i.e. in our sanctioned data 'repository' area)
    • short name
    • description
    • related resources
      • for data, id of dataset(s) from which it was created, if any
      • for procs, id of input dataset(s) and output dataset(s)
  • Links:

Text-based metadata document format

  • Not necessarily to be full metadata, just what's relevant to layer production pipeline
  • Possible components:
    • Short title
    • Description
    • [ if external data ] Explicit reference(s) to source provider(s)/URLs and any acquisition scripts and/or related details
    • [ if derived data ] Explicit reference(s) to input data (using IDs as given in registry)
    • Contributors
    • Tags
    • File format
    • Spatial coverage
    • Spatial resolution
    • Temporal coverage
    • Temporal resolution
    • Variables?
    • Usage tags:
      • data source
      • validation
      • testing
      • not currently used
    • Status tags:
      • complete
      • unknown
      • unvalidated
      • missing data
      • obsolete
    • Revision history entries
      • What was changed or otherwise done
      • Who did it
      • When
  • Things we can get from this [write some scripts to do this!]:
    • filtered report of available datasets based on various criteria
    • lineage of any given dataset w.r.t. other data (both upstream and downstream)

Organized workflow repository

  • Designated, read-only area on our file system

SVN->Git repo migration

# dump the svn repo
svnadmin dump /var/code/nceas-projects/environment-and-organisms > e-and-o.svn.dump

# use filter to get rid of marine directories (now stored elsewhere)
svndumpfilter include /terrestrial --drop-empty-revs --renumber-revs < e-and-o.svn.dump > terrestrial.svn.dump
## ...manually edit terrestrial.svn.dump to remove initial terrestrial/ node
##    addition and subsequent appearances leading of 'terrestrial/' in paths

# create new svn repo and load the hand-edited dump file
svnadmin create layers-svn
svnadmin load layers-svn < terrestrial.svn.dump-edited

# create a users.txt file to identify all unique committers like so:
#    
# user1 = First Last <user@address.com>
# user2 = ...
#
# now git-svn clone the svn repo
git svn clone file://`readlink -f layers-svn` --no-metadata -A users.txt layers-tmp

# finally, clone this to get a repo devoid of git-svn cruft
git clone layers-tmp/ layers
# ...and remove reference to git-svn origin
cd layers
git remote rm origin