Re: [galaxy-dev] Preffered way of running a tool on multiple input files

12 Feb 2013

      Hagai,

Jorrit Boekel and I have implemented essentially literally what you described.

https://bitbucket.org/galaxy/galaxy-central/pull-request/116/multiple-file-d...

Merge this in to your Galaxy tree
https://bitbucket.org/jmchilton/galaxy-central-multifiles-feb2013.
Switch use_composite_multfiles to true in universe_wsgi.ini. Then you
automatically get a multiple file version of each of your datatypes
(so m:fastq, m:xls, etc...). Tools that process a singleton version of
a datatype can seamlessly process a multiple file version of that
dataset in parallel and the outputs that are created as a result are
going to be of the multifile type of the original types.

These datasets can be created using the multifile upload tool, a
directory on the FTP server, or via library imports via API.

Input names are preserved like you described.

Some huge caveats:
 - The Galaxy team has expressed reservations about this particular
implementation so it will never be officially supported.
 - Its early days and this is very experimental (use at your own risk).
 - I am pretty sure it is not going to work with bed files, since
there is special logic in Galaxy to deal with bed indices (I think we
can work around it by declaring a concrete m:bed type and replicated
that logic, its on the TODO list but happy to accept contributions :)
).

More discussion of this can be found at these places:
http://www.youtube.com/watch?v=DxJzEkOasu4
https://bitbucket.org/galaxy/galaxy-central/pull-request/116/multiple-file-d...
http://dev.list.galaxyproject.org/pass-more-information-on-a-dataset-merge-t...

-John

On Tue, Feb 12, 2013 at 9:02 AM, Hagai Cohen <hagai26@gmail.com> wrote:
...
Hi,
I'm looking for a preferred way of running Bowtie (or any other tool) on
multiple input files and run statistics on the Bowtie output afterwards.
The input is a directory of files fastq1..fastq100
The bowtie output should be bed1...bed100
The statistics tool should run on bed1...bed100 and return xls1..xls100
Then I will write a tool which will get xls1..xls100 and merge them to one
final output.
I searched for a smiliar cases, and I couldn't figure anyone which had this
problem before.
Can't use the parallelism tag, because what will be the input for each tool?
it should be a fastq file not a directory of fastq files.
Neither I would like to run each fastq file in a different workflow -
creating a mess.
I thought only on two solutions:
1. Implement new datatypes: bed_dir & fastq_dir and implements new tool
wrappers which will get a folder instead of a file.
2. merge the input files before sending to bowtie, and use parallelism tag
to make them be splitted & merged again on each tool.
Does anyone has any better suggestion?
Thanks,
Hagai
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/

Re: [galaxy-dev] Preffered way of running a tool on multiple input files

John Chilton