[galaxy-dev] Implementing dataset collections

27 Feb 2013

      Hi John,

Thanks for your interest and willingness to take a look at this. 

I've changed the subject of this thread to what I see as the core issue: implementing dataset collections. The Galaxy team would prefer to see an implementation of dataset collections that can be used going forward for all sorts of things. This would prevent time and energy being devoted to creating unneeded flexibility to accommodate an unknown implementation of dataset collections.

With that in mind, I've spec'ed out an implementation of dataset collections that uses only 3 additional database tables + model objects: https://trello.com/c/325AXIEr (See the first list, where implementation is discussed.) Please take a look (as well as anyone else who's interested) and, either on the card or in this thread, comment on this approach. 

This implementation would not replace composite datatypes, but we expect that would work for JJ's Cummerbund. The key difference b/t collections and composite datatypes is that collections include Galaxy datasets that can be used individually, while composite datatypes can only be operated on together. 

Once a agreement is reached on an implementation, we would welcome a pull request for this functionality. Alternatively, I expect that the Galaxy team would implement it in the next couple months.

Best,
J.

On Feb 27, 2013, at 10:05 AM, John Chilton wrote:
...
Hey Jeremy,
I am trying to think about a path forward with this composite
multiple file dataset implementation. It seems there is consensus
among the galaxy team that it shouldn't be included because grouping
actual datasets would be superior. In that light, I am revisiting this
e-mail, because depending on the implementation of what you described
multiple file datasets are a specific case of this concept with some
likely uncontroversial enhancements for the specific case of composite
datatypes that are a homogeneous list of files. Does that make any
sense?
If I implemented (i) and (ii) in such a way that the multiple file
dataset stuff flowed out more organically is there any chance than it
could be included in galaxy-central. If no and the implicit datatypes
and parallelism stuff would remain no-gos implementing what you
described would still benefit the multiple file datasets
implementation, so I still might do this, would a clean implementation
of just what you described be accepted?
Any thoughts you or anyone has on the future direction of composite
datatypes in general would be appreciated?
Thanks for your time,
-John
On Fri, Oct 12, 2012 at 2:44 PM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:
...
Hi Jim,
This is nice and is a path forward for the immediate future.
That said, a couple extensions to Galaxy to better support composite
datatypes would enable cummerbund without the additional tools:
(i) extending the composite datatype to include definition of individual
outputs in the collection;
(ii) extend the history panel to allow usage/selection of (1) the complete
composite set of files or (2) individual items in a composite datatype
Once (i) is done, (ii) should be straightforward using the new history panel
code.
Of course, the advantage of these extensions is that they'd address both
cummerbund issues as well as other challenges, such as using output from the
barcode splitter.
J.
On Oct 11, 2012, at 6:14 PM, Jim Johnson wrote:
Checking to see if there is any interest in including a parameter option to
select outputs for cuffdiff,
potentially including a composite output and a cummeRbund sqlite database.
Issues:
 cuffdiff produces 21 output files, which is a little unwieldy in a galaxy
history.
 cummeRbund generates its database when given a cuffdiff output directory,
but manually hooking up 21 outputs to the cummerbund_wrapper is a pain.
I've put demo code in the testtoolshed under the name repository name
cummerbund
   http://jjohnson@testtoolshed.g2.bx.psu.edu/repos/jjohnson/cummerbund
This includes new datatypes defined in datatypes_conf.xml and implemented in
cuffdata.py:
     <!-- html composite dataset with cuffdiff outputs in the extra files
path -->
     <datatype extension="cuffdata"
type="galaxy.datatypes.cuffdata:CuffDiffData"/>
     <!-- cummeRbund SQLite database -->
     <datatype extension="cuffdatadb"
type="galaxy.datatypes.cuffdata:CuffDataDB"/>
The cuffdiff wrapper has a multiple select parameter to choose which output
files to put in the history.
In addition to the 21 cuffdiff outputs, the wrapper can also generate:
 cuffdata - which is a composite HTML output with links to the 21 cuffdiff
outputs
 cuffdatadb - which is the cummeRbund SQLite database
I also added utility tools:
 cuffdata_datasets - which will take files from the composite cuffdata and
copy them as datasets into the history
 cuffdata_cummerbund - which generates the cummeRbund cuffdatadb from the
composite cuffdata
I updated the cummerbund_wrapper:
 with tryCatch so that a R error on a plot won't exit the Rscript
 to include a small image of each plot on the html page
 added plots for : dispersion, scatter matrix, MDS, and PCA
Thanks,
JJ
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/