yes, many tools don't read from stdin, you're right. in practice, i actually have each task write it's part to the node's local scratch disk and also do implicit conversions in this step as well (e.g. scatter fastq as fasta). but not all clusters have a local scratchdisk. also, as you mentioned, the seek solution wouldn't work for compressed infiles. as i try to avoid working on the galaxy internals, i implemented this as a command-line utility. e.g. <command>psub --fastqToFasta $infile --cat $outfile qctool.py $infile $outfile</command> instead of the nonparallel: <command>qctool.py $infile $outfile</command> but it would be nice to see this functionality in galaxy. i thought about reimplementing this as a drmaa_epc.py job runner but noticed there was already tasks.py. On Fri, Aug 26, 2011 at 12:41 PM, Duddy, John <jduddy@illumina.com> wrote:
Many of the tools out there work on files, and assume they are supposed to work on the whole file (or take arguments for subsets that vary from tool to tool).
I'm working on a way for Galaxy to handle all these tools transparently, even if, as in my case, the files are compressed but the tools cannot read compressed files.
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy@illumina.com
-----Original Message----- From: Edward Kirton [mailto:eskirton@lbl.gov] Sent: Friday, August 26, 2011 12:34 PM To: Duddy, John Cc: galaxy-dev@bx.psu.edu Subject: Re: [galaxy-dev] using Galaxy for map/reduce
Not intending to hijack the thread, but in response to John's comment -- I, too, made a general solution for embarassingly parallel problems but instead of splitting the large files on disk, I just use seek to move the file pointer so each task can grab it's part.