Re: [galaxy-dev] the multi job splitter

25 Oct 2012


      On 10/25/2012 12:02 PM, Peter Cock wrote:
...
On Thu, Oct 25, 2012 at 10:35 AM, Jorrit Boekel
<jorrit.boekel@scilifelab.se> wrote:
...
My question is still though if it would be bad to not raise an exception
when different filetypes are split in the same job.
In general splitting multiple files of different types seems dangerous.
That is presumably the point of the Galaxy exception.
In my example of splitting a pair of FASTQ files, they are the same
format, so Galaxy can make assumptions about how they will be
split. Note splitting into chunks based on the size on disk would
be wrong (e.g. if the forward reads in the first file are all longer
than the reverse reads in the second file).
In the case of splitting a paired FASTA + QUAL file, these are now
different file formats, so more caution is required. In fact both
can be split are the sequence/read level so can be processed.
I think the key requirement here for 'matched' splitting is each
file must have the same number of 'records' (in my example,
sequencing reads, in your case sub-files), and can be split into
a chunks of the same number of 'records'.
Perhaps different file type combinations could be special cases
in the splitter code? Then if there is no dedicated splitter for a
given combination, then that combination cannot be split.
Peter
I could imagine the multi splitter calling some sort of validating 
method of the different datatypes to gather information about the 
different datasets, e.g. split size, split numbers, matching file types, 
before executing a split. There may be more and better ways to get 
around it though. I'll settle for disabling the check now, if mainline 
galaxy would be interested we could look at it further I guess.

cheers,
jorrit


-- 
Scientific programmer
Mass spec analysis support @ BILS
Janne Lehtiö / Lukas Käll labs
SciLifeLab Stockholm