On Tue, Aug 2, 2011 at 3:12 PM, Andrew Straw <andrew.straw@imp.ac.at> wrote:
...
My proposal for a general solution, and what I'd be interested in feedback on, is an idea of a "dataset container" (this is just a working name). It would look and act much like a dataset in the history, but would in fact be a logical construct that merely bundles together a homogeneous bunch of datasets. When a tool (or a workflow) is applied to a dataset container, Galaxy would automatically create a new container in which each dataset in this new container is the result of running the tool. (Workflows with N steps would thus generate N new containers.) The thing I like about this idea is that it preserves the ability to use tools and workflows on both individual datasets and, with some additional logic, on these new containers. In particular, I don't think the tools and workflows themselves would have to be modified. This would seemingly mitigate the slow Javascript issue by only showing a few items in the history window (even though Galaxy may have launched many jobs in the background). Furthermore, a new Reduce tool type could then act to take a dataset container as input and output a single dataset.
...
That is a very interesting idea. Note that in some of the usecases I had in mind the order of the sub-files was important, but in other cases not. So I think that internally I think Galaxy would have to store a "dataset collection" aka "homogeneous filetype collection" as an ordered list of filenames. As you observed, at the level of an individual tool, nothing changes - it gets given a single input file(s) as before, but now multiple copies of the tool will be running, each with a different input file (or files for more complex tools). I had been mulling over what is essentially a special case of this - a new datatype for "collection of BLAST XML files", and debating with myself if a zip file or simple concatenation would work here. In the case of BLAST XML files, there is precedent from early NCBI BLAST tools outputting concatenated XML files (which are not valid XML). My motivating example was the embarrassingly parallel task of multi-query BLAST searches. Here we can split up the input query file (*) and run the searches separately (the map step). The potentially hard part is merging the output (the reduce). Tabular output and plain text can basically be concatenated (note we should preserve the original query order). For XML (or -shudder- HTML output), a bit of data munging is needed. Your idea is much more elegant, and to me fits nicely with a general sub-task parallelization framework (as well as your example of running a single workflow on a collection of data files). Peter (*) You can also split the BLAST database/subject file, and there are options to adjust the e-value significance accordingly (so it is calculated using the full database size, not the partial database size). The downside is the merging of the results is much more complicated.