using Galaxy for map/reduce

3 Aug 2011

      Hi all,

I've been investigating use of Galaxy for our lab and it has many
attractive aspects -- a big thank you to all involved.

We still have a couple of related sticking points, however, that I would
like to get the Galaxy developers' feedback on. Basically, I want to use
Galaxy to run Map/Reduce type analysis on many initial data files. What
I mean is that I want to take many initial datasets (e.g. 250 or more),
perhaps already stored in a library, and then apply a workflow to each
and every one of them (the Map step). Then, on the many result datasets
(one from each of the initial datasets), I want to run a Reduce step
which creates a single dataset. I have achieved this in an imperfect and
not-quite-working way with a few tricks, but I hope that with a little
work, Galaxy could be much better for this type of use case.

I have a couple of specific problems and a proposal for a general solution:

1) My first specific problem is that loading many datasets (e.g. 250)
into history causes the javascript running locally withing a browser to
be extremely slow.

2) My second specific problem is that applying a workflow with N steps
to many datasets creates even more datasets (Nx250 additional datasets).
In addition to the slow Javascript problem, there seems to be other
issues I haven't diagnosed further, but the console in which I'm running
run.sh indicates many errors of the type "Exception AssertionError:
AssertionError('State <sqlalchemy.orm.state.MutableAttrInstanceState
object at 0x7f5c18c47990> is not present in this identity map',) in
<bound method MutableAttrInstanceState._cleanup of
<sqlalchemy.orm.state.MutableAttrInstanceState object at
0x7f5c18c47990>> ignored". Furthermore the webserver gets slow and my
nginx frontend proxy gives 504 gateway time-outs.

3) There's no good way to do reduce within Galaxy. Currently I work
around this by having a tool type which takes as an input a dataset and
then uploads this to a self-written webserver, which then collects such
uploads, performs the reduce, and offers a download link for the user to
collect the reduced dataset. The user must manually then upload this
dataset back into Galaxy for further processing.

My proposal for a general solution, and what I'd be interested in
feedback on, is an idea of a "dataset container" (this is just a working
name). It would look and act much like a dataset in the history, but
would in fact be a logical construct that merely bundles together a
homogeneous bunch of datasets. When a tool (or a workflow) is applied to
a dataset container, Galaxy would automatically create a new container
in which each dataset in this new container is the result of running the
tool. (Workflows with N steps would thus generate N new containers.) The
thing I like about this idea is that it preserves the ability to use
tools and workflows on both individual datasets and, with some
additional logic, on these new containers. In particular, I don't think
the tools and workflows themselves would have to be modified. This would
seemingly mitigate the slow Javascript issue by only showing a few items
in the history window (even though Galaxy may have launched many jobs in
the background). Furthermore, a new Reduce tool type could then act to
take a dataset container as input and output a single dataset.

A library doesn't seem a good candidate for the dataset container idea I
have above. I realize that a library also bundles together datasets, but
it has other attributes that don't play well with the above idea (the
idea of hierarchically arranged folders and heterogeneous datasets) nor
can it be  represented in the history.

I'm interested in thoughts on this proposal, as I think it would really
help us, and I think our use case may be representative of what others
might also like to do. I realize that in my text above I write "with
some additional logic" to describe the work required to implement this
idea, but the fact is that I have very little idea about how much work
this would be. So, practically speaking, my question boils down to how
hard would implementing this be, given the existing code base and goals?
And, would such an implementation - if done to the taste of the Galaxy
devs, of course - have a chance of making into the Galaxy distribution?

Thanks,
Andrew

-- 
Andrew D. Straw, Ph.D.
Research Institute of Molecular Pathology (IMP)
Vienna, Austria
http://strawlab.org/

Andrew Straw

Peter Cock

Ravi Madduri

James Taylor

Andrew Straw

Duddy, John

Edward Kirton

Duddy, John

Edward Kirton

tags

participants (6)