I also think that this a great idea, and as you described it I think it's feasible as a stand-alone galaxy tool.
Eventually you consider to implement this as a data manager (https://wiki.galaxyproject.org/Admin/Tools/DataManagers) ?


On 23 August 2014 03:24, Dooley, Damion <Damion.Dooley@bccdc.ca> wrote:
We are about to implement a fasta database (file) versioning system as a Galaxy tool.  I wanted to get interested people's feedback first before we roll ahead with the prototype implementation.  The versioning system aims to:

* Enable reproducible research: To recreate a search result at a certain point in time we need versioning so that search and mapping tools can look at sequence reference databases corresponding to a particular past date.  This recall can also explain the difference between what was known in the past vs. currently.

* Reduce hard drive space.  Some databases are too big to keep N copies around, e.g. 5 years of 16S, updated monthly, is say, 670Mb + 668Mb + 665Mb + ....  But occasionally we want to access past archives fairly quickly.

* Integrate database versioning into Galaxy without adding a lot of complexity.

A bonus would be to enable the efficient sharing of version databases between computers/servers.

The solution we think would work centres around a "Versioned Data Retrieval" tool (draft image attached) that would work as follows:

1) User selects from a list of databases provided by  "Shared Data > Data Libraries > Versioned Data".
  - Each database has a master file that keeps its various versions as a list of time-stamped insert/delete transactions of key (fasta id) value (description & sequence) pairs.
  - Each master file is managed outside of galaxy via a triggered process on regular fasta file imports from data sources like NCBI or other niche sources.
  - We're expecting, due to the nature of fasta archived sequence updates, that our master file would only be about 1.1x the latest version in size (uncompressed).
2) User enters date / version id to retrieve (validated)
3) If a cached version of that database exists, it is linked into user's history.
4) Otherwise a new version of it is created, placed in cache, and linked into history.
  - The cached version itself then shows up as linked data under a Data Library > Versioned Data subfolder.
5) User can select preconfigured workflow(s) to execute on the selected retreived fasta file to regenerate any database products they need.
  - Workflow output data would also be cached in the same way the fasta data is - by linking the Galaxy Data Library to it.
  - Workflow execution will be skipped if end data already exists in cache.
  - Simple makeblastdb or bowtie-build commands, or more specific workflows that include dustmasker etc can be implemented.

Does this sound attractive?

We're hoping such a vision could handle Fasta databases from 12mb to e.g. 200Gb (probably requires makeblastdb in parallel at that scale).

Preliminary work suggests this project is doable via the Galaxy API without galaxy customization - does that sound right?!

Feedback really appreciated!

Regards,

Damion Dooley

Hsiao lab, BC Public Health Microbiology & Reference Laboratory, BC Centre for Disease Control
655 West 12th Avenue, Vancouver, British Columbia, V5Z 4R4 Canada
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/