Re: [galaxy-dev] using Galaxy for map/reduce

2 Aug 2011

      On Tue, Aug 2, 2011 at 3:12 PM, Andrew Straw <andrew.straw@imp.ac.at> wrote:
...
...
My proposal for a general solution, and what I'd be interested in
feedback on, is an idea of a "dataset container" (this is just a working
name). It would look and act much like a dataset in the history, but
would in fact be a logical construct that merely bundles together a
homogeneous bunch of datasets. When a tool (or a workflow) is applied to
a dataset container, Galaxy would automatically create a new container
in which each dataset in this new container is the result of running the
tool. (Workflows with N steps would thus generate N new containers.) The
thing I like about this idea is that it preserves the ability to use
tools and workflows on both individual datasets and, with some
additional logic, on these new containers. In particular, I don't think
the tools and workflows themselves would have to be modified. This would
seemingly mitigate the slow Javascript issue by only showing a few items
in the history window (even though Galaxy may have launched many jobs in
the background). Furthermore, a new Reduce tool type could then act to
take a dataset container as input and output a single dataset.
...
That is a very interesting idea.

Note that in some of the usecases I had in mind the order of the
sub-files was important, but in other cases not. So I think that
internally I think Galaxy would have to store a "dataset collection"
aka "homogeneous filetype collection" as an ordered list of
filenames.

As you observed, at the level of an individual tool, nothing
changes - it gets given a single input file(s) as before, but
now multiple copies of the tool will be running, each with a
different input file (or files for more complex tools).

I had been mulling over what is essentially a special case of
this - a new datatype for "collection of BLAST XML files", and
debating with myself if a zip file or simple concatenation would
work here. In the case of BLAST XML files, there is precedent
from early NCBI BLAST tools outputting concatenated XML
files (which are not valid XML).

My motivating example was the embarrassingly parallel task
of multi-query BLAST searches. Here we can split up the input
query file (*) and run the searches separately (the map step).
The potentially hard part is merging the output (the reduce).
Tabular output and plain text can basically be concatenated
(note we should preserve the original query order). For XML
(or -shudder- HTML output), a bit of data munging is needed.

Your idea is much more elegant, and to me fits nicely with
a general sub-task parallelization framework (as well as your
example of running a single workflow on a collection of data
files).

Peter

(*) You can also split the BLAST database/subject file, and
there are options to adjust the e-value significance accordingly
(so it is calculated using the full database size, not the partial
database size). The downside is the merging of the results is
much more complicated.

Re: [galaxy-dev] using Galaxy for map/reduce

Peter Cock