Re: [galaxy-dev] the multi job splitter

25 Oct 2012

      On Thu, Oct 25, 2012 at 10:35 AM, Jorrit Boekel
<jorrit.boekel@scilifelab.se> wrote:
...
On 10/25/2012 11:25 AM, Peter Cock wrote:
...
I don't quite follow your example, but I can see some (simpler?) cases
for sequencing data - paired splitting of a FASTA + QUAL file, or
paired splitting of two FASTQ files (forward and reverse reads). Here
the sequence files can be broken up into any size (e.g. split in four,
or divided into batches of 10000, but not split based on size on disk),
as long as the pairing is preserved.
i.e. Given FASTA and QUAL for read1, read2, ...., read100000 then
if the FASTA file is split into read1, read2, ...., read1000 as the first
chunk, then the first QUAL chunk must also have the same one
thousand reads.
(In these examples the pairing should be verifiable via the read
names, so errors should be easy to catch - I don't know if you have
that luxury in your situation).
What you describe is pretty much the same as my situation, except that I
don't have two large single input files as your fastq files, but two sets of
the same number of files stored in the composite file directories
(galaxy/database/files/000/dataset_x_files ). I keep the files matched by
keeping a _task_%d suffix to their names. So each task is matched with its
correct counterpart with the same number.
My question is still though if it would be bad to not raise an exception
when different filetypes are split in the same job.
In general splitting multiple files of different types seems dangerous.
That is presumably the point of the Galaxy exception.

In my example of splitting a pair of FASTQ files, they are the same
format, so Galaxy can make assumptions about how they will be
split. Note splitting into chunks based on the size on disk would
be wrong (e.g. if the forward reads in the first file are all longer
than the reverse reads in the second file).

In the case of splitting a paired FASTA + QUAL file, these are now
different file formats, so more caution is required. In fact both
can be split are the sequence/read level so can be processed.

I think the key requirement here for 'matched' splitting is each
file must have the same number of 'records' (in my example,
sequencing reads, in your case sub-files), and can be split into
a chunks of the same number of 'records'.

Perhaps different file type combinations could be special cases
in the splitter code? Then if there is no dedicated splitter for a
given combination, then that combination cannot be split.

Peter