the multi job splitter

Jorrit Boekel

25 Oct 2012 25 Oct '12

8:36 a.m.

Dear list, In my galaxy fork, I extensively use the job splitters. Sometimes though, I have to split to different file types for the same job. That raises an exception in the lib/galaxy/jobs/splitters/multi.py module. I have turned this behaviour off for my own work, but am now wondering whether this is very bad practice. In other words, does somebody know why the multi splitter does not support multiple file type splitting? cheers, jorrit -- Scientific programmer Mass spec analysis support @ BILS Janne Lehtiö / Lukas Käll labs SciLifeLab Stockholm

Show replies by date

Peter Cock

25 Oct 25 Oct

8:54 a.m.

On Thu, Oct 25, 2012 at 9:36 AM, Jorrit Boekel <jorrit.boekel@scilifelab.se> wrote:

...

Dear list,

In my galaxy fork, I extensively use the job splitters. Sometimes though, I have to split to different file types for the same job. That raises an exception in the lib/galaxy/jobs/splitters/multi.py module.

I have turned this behaviour off for my own work, but am now wondering whether this is very bad practice. In other words, does somebody know why the multi splitter does not support multiple file type splitting?

cheers, jorrit

Could you clarify what you mean by showing some of your tool's XML file. i.e. How is the input and its splitting defined. Are you asking about splitting two input files at the same time? Peter

Jorrit Boekel

9 a.m.

On 10/25/2012 10:54 AM, Peter Cock wrote:

...

On Thu, Oct 25, 2012 at 9:36 AM, Jorrit Boekel <jorrit.boekel@scilifelab.se> wrote:

...
Dear list,

In my galaxy fork, I extensively use the job splitters. Sometimes though, I have to split to different file types for the same job. That raises an exception in the lib/galaxy/jobs/splitters/multi.py module.

I have turned this behaviour off for my own work, but am now wondering whether this is very bad practice. In other words, does somebody know why the multi splitter does not support multiple file type splitting?

cheers, jorrit Could you clarify what you mean by showing some of your tool's XML file. i.e. How is the input and its splitting defined.

Are you asking about splitting two input files at the same time?

Peter

Hi Peter, Something like the following: <command interpreter="python">bullseye.py $hardklor_results $ms2_in.extension $ms2_in $output $use_nonmatch</command> <parallelism method="multi" split_inputs="hardklor_results,ms2_in" shared_inputs="config_file" split_mode="from_composite" merge_outputs="output"/> <inputs> The tool takes two datasets of different formats, which are to be split in the same amount of files, which belong together as pairs. Note that I have implemented an odd way of splitting, which is from a number of files in the dataset.extra_files_path to symlinks in the task working dirs. The number of files is thus equal to the number of parts resulting from a split, and I have ensured that each part is paired correctly. I assume this hasn't been necessary in the genomics field, but for proteomics, at least in our lab, multiple-file datasets are the standard. My fork is at http://bitbucket.org/glormph/adapt if you want to check more closely. cheers, jorrit -- Scientific programmer Mass spec analysis support @ BILS Janne Lehtiö / Lukas Käll labs SciLifeLab Stockholm

Peter Cock

9:25 a.m.

On Thu, Oct 25, 2012 at 10:00 AM, Jorrit Boekel <jorrit.boekel@scilifelab.se> wrote:

...

On 10/25/2012 10:54 AM, Peter Cock wrote:

...
On Thu, Oct 25, 2012 at 9:36 AM, Jorrit Boekel <jorrit.boekel@scilifelab.se> wrote:

...
Dear list,

In my galaxy fork, I extensively use the job splitters. Sometimes though, I have to split to different file types for the same job. That raises an exception in the lib/galaxy/jobs/splitters/multi.py module.

I have turned this behaviour off for my own work, but am now wondering whether this is very bad practice. In other words, does somebody know why the multi splitter does not support multiple file type splitting?

cheers, jorrit

Could you clarify what you mean by showing some of your tool's XML file. i.e. How is the input and its splitting defined.

Are you asking about splitting two input files at the same time?

Peter

Hi Peter,

Something like the following:

<command interpreter="python">bullseye.py $hardklor_results $ms2_in.extension $ms2_in $output $use_nonmatch</command> <parallelism method="multi" split_inputs="hardklor_results,ms2_in" shared_inputs="config_file" split_mode="from_composite" merge_outputs="output"/> <inputs>

The tool takes two datasets of different formats, which are to be split in the same amount of files, which belong together as pairs.

So the inputs are $hardklor_results and $ms2_in (which should be split in a paired manor) and there is one output $output to merge? What is shared_inputs="config_file" for as that isn't in the <command> tag anywhere.

...

Note that I have implemented an odd way of splitting, which is from a number of files in the dataset.extra_files_path to symlinks in the task working dirs. The number of files is thus equal to the number of parts resulting from a split, and I have ensured that each part is paired correctly. I assume this hasn't been necessary in the genomics field, but for proteomics, at least in our lab, multiple-file datasets are the standard.

My fork is at http://bitbucket.org/glormph/adapt if you want to check more closely.

I don't quite follow your example, but I can see some (simpler?) cases for sequencing data - paired splitting of a FASTA + QUAL file, or paired splitting of two FASTQ files (forward and reverse reads). Here the sequence files can be broken up into any size (e.g. split in four, or divided into batches of 10000, but not split based on size on disk), as long as the pairing is preserved. i.e. Given FASTA and QUAL for read1, read2, ...., read100000 then if the FASTA file is split into read1, read2, ...., read1000 as the first chunk, then the first QUAL chunk must also have the same one thousand reads. (In these examples the pairing should be verifiable via the read names, so errors should be easy to catch - I don't know if you have that luxury in your situation). Peter

Jorrit Boekel

9:35 a.m.

On 10/25/2012 11:25 AM, Peter Cock wrote:

...

On Thu, Oct 25, 2012 at 10:00 AM, Jorrit Boekel <jorrit.boekel@scilifelab.se> wrote:

...
On 10/25/2012 10:54 AM, Peter Cock wrote:

...
On Thu, Oct 25, 2012 at 9:36 AM, Jorrit Boekel <jorrit.boekel@scilifelab.se> wrote:

...
Dear list,

In my galaxy fork, I extensively use the job splitters. Sometimes though, I have to split to different file types for the same job. That raises an exception in the lib/galaxy/jobs/splitters/multi.py module.

I have turned this behaviour off for my own work, but am now wondering whether this is very bad practice. In other words, does somebody know why the multi splitter does not support multiple file type splitting?

cheers, jorrit Could you clarify what you mean by showing some of your tool's XML file. i.e. How is the input and its splitting defined.

Are you asking about splitting two input files at the same time?

Peter

Hi Peter,

Something like the following:

<command interpreter="python">bullseye.py $hardklor_results $ms2_in.extension $ms2_in $output $use_nonmatch</command> <parallelism method="multi" split_inputs="hardklor_results,ms2_in" shared_inputs="config_file" split_mode="from_composite" merge_outputs="output"/> <inputs>

The tool takes two datasets of different formats, which are to be split in the same amount of files, which belong together as pairs. So the inputs are $hardklor_results and $ms2_in (which should be split in a paired manor) and there is one output $output to merge?

What is shared_inputs="config_file" for as that isn't in the <command> tag anywhere. Exactly. The tool uses results from a tool called hardklor to adjust the mass spectra contained in the ms2_input. And whoops, haven't taken out the now obsolete config file. thanks for spotting that.

...

...
Note that I have implemented an odd way of splitting, which is from a number of files in the dataset.extra_files_path to symlinks in the task working dirs. The number of files is thus equal to the number of parts resulting from a split, and I have ensured that each part is paired correctly. I assume this hasn't been necessary in the genomics field, but for proteomics, at least in our lab, multiple-file datasets are the standard.

My fork is at http://bitbucket.org/glormph/adapt if you want to check more closely. I don't quite follow your example, but I can see some (simpler?) cases for sequencing data - paired splitting of a FASTA + QUAL file, or paired splitting of two FASTQ files (forward and reverse reads). Here the sequence files can be broken up into any size (e.g. split in four, or divided into batches of 10000, but not split based on size on disk), as long as the pairing is preserved.

i.e. Given FASTA and QUAL for read1, read2, ...., read100000 then if the FASTA file is split into read1, read2, ...., read1000 as the first chunk, then the first QUAL chunk must also have the same one thousand reads.

(In these examples the pairing should be verifiable via the read names, so errors should be easy to catch - I don't know if you have that luxury in your situation).

What you describe is pretty much the same as my situation, except that I don't have two large single input files as your fastq files, but two sets of the same number of files stored in the composite file directories (galaxy/database/files/000/dataset_x_files ). I keep the files matched by keeping a _task_%d suffix to their names. So each task is matched with its correct counterpart with the same number. My question is still though if it would be bad to not raise an exception when different filetypes are split in the same job. cheers, jorrit -- Scientific programmer Mass spec analysis support @ BILS Janne Lehtiö / Lukas Käll labs SciLifeLab Stockholm

Peter Cock

10:02 a.m.

On Thu, Oct 25, 2012 at 10:35 AM, Jorrit Boekel <jorrit.boekel@scilifelab.se> wrote:

...

On 10/25/2012 11:25 AM, Peter Cock wrote:

...
I don't quite follow your example, but I can see some (simpler?) cases for sequencing data - paired splitting of a FASTA + QUAL file, or paired splitting of two FASTQ files (forward and reverse reads). Here the sequence files can be broken up into any size (e.g. split in four, or divided into batches of 10000, but not split based on size on disk), as long as the pairing is preserved.

i.e. Given FASTA and QUAL for read1, read2, ...., read100000 then if the FASTA file is split into read1, read2, ...., read1000 as the first chunk, then the first QUAL chunk must also have the same one thousand reads.

(In these examples the pairing should be verifiable via the read names, so errors should be easy to catch - I don't know if you have that luxury in your situation).

What you describe is pretty much the same as my situation, except that I don't have two large single input files as your fastq files, but two sets of the same number of files stored in the composite file directories (galaxy/database/files/000/dataset_x_files ). I keep the files matched by keeping a _task_%d suffix to their names. So each task is matched with its correct counterpart with the same number.

My question is still though if it would be bad to not raise an exception when different filetypes are split in the same job.

In general splitting multiple files of different types seems dangerous. That is presumably the point of the Galaxy exception. In my example of splitting a pair of FASTQ files, they are the same format, so Galaxy can make assumptions about how they will be split. Note splitting into chunks based on the size on disk would be wrong (e.g. if the forward reads in the first file are all longer than the reverse reads in the second file). In the case of splitting a paired FASTA + QUAL file, these are now different file formats, so more caution is required. In fact both can be split are the sequence/read level so can be processed. I think the key requirement here for 'matched' splitting is each file must have the same number of 'records' (in my example, sequencing reads, in your case sub-files), and can be split into a chunks of the same number of 'records'. Perhaps different file type combinations could be special cases in the splitter code? Then if there is no dedicated splitter for a given combination, then that combination cannot be split. Peter

Jorrit Boekel

10:59 a.m.

On 10/25/2012 12:02 PM, Peter Cock wrote:

...

On Thu, Oct 25, 2012 at 10:35 AM, Jorrit Boekel <jorrit.boekel@scilifelab.se> wrote:

...
My question is still though if it would be bad to not raise an exception when different filetypes are split in the same job.

In general splitting multiple files of different types seems dangerous. That is presumably the point of the Galaxy exception.

In my example of splitting a pair of FASTQ files, they are the same format, so Galaxy can make assumptions about how they will be split. Note splitting into chunks based on the size on disk would be wrong (e.g. if the forward reads in the first file are all longer than the reverse reads in the second file).

In the case of splitting a paired FASTA + QUAL file, these are now different file formats, so more caution is required. In fact both can be split are the sequence/read level so can be processed.

I think the key requirement here for 'matched' splitting is each file must have the same number of 'records' (in my example, sequencing reads, in your case sub-files), and can be split into a chunks of the same number of 'records'.

Perhaps different file type combinations could be special cases in the splitter code? Then if there is no dedicated splitter for a given combination, then that combination cannot be split.

Peter

I could imagine the multi splitter calling some sort of validating method of the different datatypes to gather information about the different datasets, e.g. split size, split numbers, matching file types, before executing a split. There may be more and better ways to get around it though. I'll settle for disabling the check now, if mainline galaxy would be interested we could look at it further I guess. cheers, jorrit -- Scientific programmer Mass spec analysis support @ BILS Janne Lehtiö / Lukas Käll labs SciLifeLab Stockholm

4789

Age (days ago)

4789

Last active (days ago)

List overview

Download

6 comments

2 participants

participants (2)

Jorrit Boekel
Peter Cock