Content management¶
Informal requirements¶
- Need a clear home for "official" data
- Need clear protocol for adding to holdings
- Need clear protocol for modifying holdings
- Need consistent documentation for all sanctioned data resources
- Need consistent documentation of workflows
- Documentation must be well-structured to facilitate machine processing
- Documentation should be human-readable
- Clear organization of data resources
- Store data separately from processing code
- ...but also maintain association between data resource and steps to acquire or create it
- Gatekeeper to ensure any add/modify/remove actions are deliberate and documented
- Mechanism to verify holdings on demand
- Avoid (minimize?) data duplication
- Version control for:
- processing scripts
- metadata
- other documentation
Proposed components (draft)¶
Redmine issue tracker¶
- listing of all action items
- things to be investigated
- things to be decided
- things to be implemented
- things to be fixed
- things to be tested
- each task entry captures
- description of task
- who is doing it
- estimate of how long it will take
- estimated % done
- history of prose discussion and notes
Version control system¶
- Options:
- Subversion (SVN)
- Pure centralized repository
- Easier to learn
- Less flexible
- Having branches is hard, so probably would just do linear history
- Git
- Slightly harder to learn (though basics are about the same)
- Can have development model whereby:
- we keep a pristine 'production' branch that only has vetted contents
- all work happens on 'working' branches
- designated gatekeeper manages pushes to the production branch
- resource registry points to items on the production branch in the repo
- Subversion (SVN)
- Contents
- all 'production' procedures
- experimental, in-development work
- utilities
- shared code libraries (bits of common functionality, etc)
- project mgmt scripts (content validation, etc)
- metadata documents we're creating
- manuscripts we're writing
Wiki area¶
- Misc notes
- Links
- other odds and ends
Organized on-disk data repository¶
- Designated, read-only area on our file system
- Gatekeeper manages adds/deletes/changes
- Each dataset must have a unique identifier (see data registry document, below)
- Require an
aux
subdirectory with each data source, containing:- README.txt file adhering to our metadata format (below)
- checksum file or other means of verifying contents
- md5sums are probably best option, though could be slow for large data holdings
- for speed, could also compare by basename+size+modtime
- [ if external data ] other metadata, pubs, or other usage documents obtained from provider
- we should have a 'verify-data' script that can operate on this
- archival bundle of acquisition/creation/verification workflow
- probably fairly flat directory hierarchy?
- instead rely on metadata documents for categorizing, filtering, etc.
- ...this should mean more stability of paths
- maybe just top level split into:
- source data (acquired by us; irreproducible by us)
- intermediate data (produced by us but not intended for distribution)
- products (produced by us and intended for distribution)
- useful to be think about possible data generation process?
- acquisition: from elsewere
- transformation: {single input} -> {single output}
- derivation: {multiple inputs} -> {single output}
Resource registry¶
- serves as definitive listing of sanctioned resources
- contains an entry for every important...
- dataset
- the data must be in our sanctioned data repository
- procedure (i.e., something that produces output of value)
- the procedure (script, or prose protocol as second option) must be in our VCS
- maybe also exported to disk 'near' data? (with checksum)
- possibly critical, expensive-to-reproduce reports and other outputs?
- dataset
- registration is intended to provide a contract that:
- the resource is what its document says it is
- the resource exists where its document says it is
- the resource in that location has not changed since it was registered
- thus its relationship to other resources are accurate as stated
- allow decoupling of dataset identity and location
- scripts could do look up on this rather than using hard-coded paths?
- can run a daily(?) script to check validity of document
- is it well structured?
- does the content on disk match?
- each registry entry is a document with certain minimal information
- unique id for the document itself
- unique id for the resource it describes
- current location of authoritative copy (i.e. in our sanctioned data 'repository' area)
- short name
- description
- related resources
- for data, id of dataset(s) from which it was created, if any
- for procs, id of input dataset(s) and output dataset(s)
- Links:
- JSON Schema:
- MongoDB
- Graph visualization
Text-based metadata document format¶
- Not necessarily to be full metadata, just what's relevant to layer production pipeline
- Possible components:
- Short title
- Description
- [ if external data ] Explicit reference(s) to source provider(s)/URLs and any acquisition scripts and/or related details
- [ if derived data ] Explicit reference(s) to input data (using IDs as given in registry)
- Contributors
- Tags
- File format
- Spatial coverage
- Spatial resolution
- Temporal coverage
- Temporal resolution
- Variables?
- Usage tags:
data source
validation
testing
not currently used
- Status tags:
complete
unknown
unvalidated
missing data
obsolete
- Revision history entries
- What was changed or otherwise done
- Who did it
- When
- Things we can get from this [write some scripts to do this!]:
- filtered report of available datasets based on various criteria
- lineage of any given dataset w.r.t. other data (both upstream and downstream)
Organized workflow repository¶
- Designated, read-only area on our file system
SVN->Git repo migration¶
# dump the svn repo svnadmin dump /var/code/nceas-projects/environment-and-organisms > e-and-o.svn.dump # use filter to get rid of marine directories (now stored elsewhere) svndumpfilter include /terrestrial --drop-empty-revs --renumber-revs < e-and-o.svn.dump > terrestrial.svn.dump ## ...manually edit terrestrial.svn.dump to remove initial terrestrial/ node ## addition and subsequent appearances leading of 'terrestrial/' in paths # create new svn repo and load the hand-edited dump file svnadmin create layers-svn svnadmin load layers-svn < terrestrial.svn.dump-edited # create a users.txt file to identify all unique committers like so: # # user1 = First Last <user@address.com> # user2 = ... # # now git-svn clone the svn repo git svn clone file://`readlink -f layers-svn` --no-metadata -A users.txt layers-tmp # finally, clone this to get a repo devoid of git-svn cruft git clone layers-tmp/ layers # ...and remove reference to git-svn origin cd layers git remote rm origin