Content management¶

Informal requirements¶

Need a clear home for "official" data
Need clear protocol for adding to holdings
Need clear protocol for modifying holdings
Need consistent documentation for all sanctioned data resources
Need consistent documentation of workflows
Documentation must be well-structured to facilitate machine processing
Documentation should be human-readable
Clear organization of data resources
- Store data separately from processing code
- ...but also maintain association between data resource and steps to acquire or create it
- Gatekeeper to ensure any add/modify/remove actions are deliberate and documented
- Mechanism to verify holdings on demand
- Avoid (minimize?) data duplication
Version control for:
- processing scripts
- metadata
- other documentation

Proposed components (draft)¶

Redmine issue tracker¶

listing of all action items
- things to be investigated
- things to be decided
- things to be implemented
- things to be fixed
- things to be tested
each task entry captures
- description of task
- who is doing it
- estimate of how long it will take
- estimated % done
- history of prose discussion and notes

Version control system¶

Options:
- Subversion (SVN)
  - Pure centralized repository
  - Easier to learn
  - Less flexible
  - Having branches is hard, so probably would just do linear history
- Git
  - Slightly harder to learn (though basics are about the same)
  - Can have development model whereby:
    - we keep a pristine 'production' branch that only has vetted contents
    - all work happens on 'working' branches
    - designated gatekeeper manages pushes to the production branch
    - resource registry points to items on the production branch in the repo
Contents
- all 'production' procedures
- experimental, in-development work
- utilities
  - shared code libraries (bits of common functionality, etc)
  - project mgmt scripts (content validation, etc)
- metadata documents we're creating
- manuscripts we're writing

Wiki area¶

Misc notes
Links
other odds and ends

Organized on-disk data repository¶

Designated, read-only area on our file system
Gatekeeper manages adds/deletes/changes
Each dataset must have a unique identifier (see data registry document, below)
Require an aux subdirectory with each data source, containing:
1. README.txt file adhering to our metadata format (below)
2. checksum file or other means of verifying contents
  - md5sums are probably best option, though could be slow for large data holdings
  - for speed, could also compare by basename+size+modtime
3. [ if external data ] other metadata, pubs, or other usage documents obtained from provider
  - we should have a 'verify-data' script that can operate on this
probably fairly flat directory hierarchy?
- instead rely on metadata documents for categorizing, filtering, etc.
- ...this should mean more stability of paths
- maybe just top level split into:
  1. source data (acquired by us; irreproducible by us)
  2. intermediate data (produced by us but not intended for distribution)
  3. products (produced by us and intended for distribution)
useful to be think about possible data generation process?
- acquisition: from elsewere
- transformation: {single input} -> {single output}
- derivation: {multiple inputs} -> {single output}

Resource registry¶

serves as definitive listing of sanctioned resources
contains an entry for every important...
- dataset
  - the data must be in our sanctioned data repository
- procedure (i.e., something that produces output of value)
  - the procedure (script, or prose protocol as second option) must be in our VCS
  - maybe also exported to disk 'near' data? (with checksum)
- possibly critical, expensive-to-reproduce reports and other outputs?
registration is intended to provide a contract that:
- the resource is what its document says it is
- the resource exists where its document says it is
- the resource in that location has not changed since it was registered
  - thus its relationship to other resources are accurate as stated
allow decoupling of dataset identity and location
- scripts could do look up on this rather than using hard-coded paths?
can run a daily(?) script to check validity of document
- is it well structured?
- does the content on disk match?
each registry entry is a document with certain minimal information
- unique id for the document itself
- unique id for the resource it describes
- current location of authoritative copy (i.e. in our sanctioned data 'repository' area)
- short name
- description
- related resources
  - for data, id of dataset(s) from which it was created, if any
  - for procs, id of input dataset(s) and output dataset(s)
Links:
- JSON Schema:
  - examples, etc
  - validation Python package
- MongoDB
  - pymongo
- Graph visualization
  - PyGraphviz package

Text-based metadata document format¶

Not necessarily to be full metadata, just what's relevant to layer production pipeline
Possible components:
- Short title
- Description
- [ if external data ] Explicit reference(s) to source provider(s)/URLs and any acquisition scripts and/or related details
- [ if derived data ] Explicit reference(s) to input data (using IDs as given in registry)
- Contributors
- Tags
- File format
- Spatial coverage
- Spatial resolution
- Temporal coverage
- Temporal resolution
- Variables?
- Usage tags:
  - data source
  - validation
  - testing
  - not currently used
- Status tags:
  - complete
  - unknown
  - unvalidated
  - missing data
  - obsolete
- Revision history entries
  - What was changed or otherwise done
  - Who did it
  - When
Things we can get from this [write some scripts to do this!]:
- filtered report of available datasets based on various criteria
- lineage of any given dataset w.r.t. other data (both upstream and downstream)

Organized workflow repository¶

Designated, read-only area on our file system

SVN->Git repo migration¶

# dump the svn repo
svnadmin dump /var/code/nceas-projects/environment-and-organisms > e-and-o.svn.dump

# use filter to get rid of marine directories (now stored elsewhere)
svndumpfilter include /terrestrial --drop-empty-revs --renumber-revs < e-and-o.svn.dump > terrestrial.svn.dump
## ...manually edit terrestrial.svn.dump to remove initial terrestrial/ node
##    addition and subsequent appearances leading of 'terrestrial/' in paths

# create new svn repo and load the hand-edited dump file
svnadmin create layers-svn
svnadmin load layers-svn < terrestrial.svn.dump-edited

# create a users.txt file to identify all unique committers like so:
#    
# user1 = First Last <user@address.com>
# user2 = ...
#
# now git-svn clone the svn repo
git svn clone file://`readlink -f layers-svn` --no-metadata -A users.txt layers-tmp

# finally, clone this to get a repo devoid of git-svn cruft
git clone layers-tmp/ layers
# ...and remove reference to git-svn origin
cd layers
git remote rm origin

Files (0)

Project

General

Profile

Wiki

Content management¶

Informal requirements¶

Proposed components (draft)¶

Redmine issue tracker¶

Version control system¶

Wiki area¶

Organized on-disk data repository¶

Resource registry¶

Text-based metadata document format¶

Organized workflow repository¶

SVN->Git repo migration¶