Re: [galaxy-dev] Concept for a Galaxy Versioned Fasta Data Retrieval Tool

23 Aug 2014

      Hi Damion,

the idea sounds fantastic!
Can we go a step further and use a specific datatype that keeps entire 
fasta files versioned and the user can choose which version he wants to 
use, in any tool? Please have a look at my talk at GCC2012. Maybe you 
are interested in the (old) patches. I would be very interested to 
restart this old project.

https://wiki.galaxyproject.org/Events/GCC2012/Abstracts#Keeping_Track_of_Lif...

Am 23.08.2014 um 03:24 schrieb Dooley, Damion:
...
We are about to implement a fasta database (file) versioning system as a Galaxy tool.  I wanted to get interested people's feedback first before we roll ahead with the prototype implementation.  The versioning system aims to:
* Enable reproducible research: To recreate a search result at a certain point in time we need versioning so that search and mapping tools can look at sequence reference databases corresponding to a particular past date.  This recall can also explain the difference between what was known in the past vs. currently.
* Reduce hard drive space.  Some databases are too big to keep N copies around, e.g. 5 years of 16S, updated monthly, is say, 670Mb + 668Mb + 665Mb + ....  But occasionally we want to access past archives fairly quickly.
* Integrate database versioning into Galaxy without adding a lot of complexity.
A bonus would be to enable the efficient sharing of version databases between computers/servers.
The solution we think would work centres around a "Versioned Data Retrieval" tool (draft image attached) that would work as follows:
1) User selects from a list of databases provided by  "Shared Data > Data Libraries > Versioned Data".
   - Each database has a master file that keeps its various versions as a list of time-stamped insert/delete transactions of key (fasta id) value (description & sequence) pairs.
   - Each master file is managed outside of galaxy via a triggered process on regular fasta file imports from data sources like NCBI or other niche sources.
   - We're expecting, due to the nature of fasta archived sequence updates, that our master file would only be about 1.1x the latest version in size (uncompressed).
2) User enters date / version id to retrieve (validated)
3) If a cached version of that database exists, it is linked into user's history.
4) Otherwise a new version of it is created, placed in cache, and linked into history.
   - The cached version itself then shows up as linked data under a Data Library > Versioned Data subfolder.
5) User can select preconfigured workflow(s) to execute on the selected retreived fasta file to regenerate any database products they need.
   - Workflow output data would also be cached in the same way the fasta data is - by linking the Galaxy Data Library to it.
   - Workflow execution will be skipped if end data already exists in cache.
   - Simple makeblastdb or bowtie-build commands, or more specific workflows that include dustmasker etc can be implemented.
Does this sound attractive?
I think all of the use cases are covered by the old project mentioned 
above. But I did not create a new tool I have created a new 'select 
type' everyone can use in all tools. It was using git underneath (yeah, 
I have the entire PDB in git and it is working fine :)) but we can 
probably change git with a database if you like.

To answer your question: Yes, very attractive!
...
We're hoping such a vision could handle Fasta databases from 12mb to e.g. 200Gb (probably requires makeblastdb in parallel at that scale).
Preliminary work suggests this project is doable via the Galaxy API without galaxy customization - does that sound right?!
Yes, as long as the User has an API key.

Cheers,
Bjoern
...
Feedback really appreciated!
Regards,
Damion Dooley
Hsiao lab, BC Public Health Microbiology & Reference Laboratory, BC Centre for Disease Control
655 West 12th Avenue, Vancouver, British Columbia, V5Z 4R4 Canada
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
   http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at:
   http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Concept for a Galaxy Versioned Fasta Data Retrieval Tool

Björn Grüning