Re: [galaxy-dev] problems splitting

24 Feb 2015

      Hello again,

first of all thanks for your help, it is being very useful.

What I have done up to now is to copy this method to the class Sequence

def get_split_commands_sequential(is_compressed, input_name, output_name,
start_sequence, sequence_count):
        """
        Does a brain-dead sequential scan & extract of certain sequences
        >>> Sequence.get_split_commands_sequential(True, './input.gz',
'./output.gz', start_sequence=0, sequence_count=10)
        ['zcat "./input.gz" | ( tail -n +1 2> /dev/null) | head -40 | gzip
-c > "./output.gz"']
        >>> Sequence.get_split_commands_sequential(False, './input.fastq',
'./output.fastq', start_sequence=10, sequence_count=10)
        ['tail -n +41 "./input.fastq" 2> /dev/null | head -40 >
"./output.fastq"']
        """
        start_line = start_sequence * 4
        line_count = sequence_count * 4
        # TODO: verify that tail can handle 64-bit numbers
        if is_compressed:
            cmd = 'zcat "%s" | ( tail -n +%s 2> /dev/null) | head -%s |
gzip -c' % (input_name, start_line+1, line_count)
        else:
            cmd = 'tail -n +%s "%s" 2> /dev/null | head -%s'  %
(start_line+1, input_name, line_count)
        cmd += ' > "%s"' % output_name

        return [cmd]
    get_split_commands_sequential =
staticmethod(get_split_commands_sequential)

This is something that you suggested.
When I run the tool with this configuration:

<tool id="bwa_mio" name="map with bwa">
  <description>map with bwa</description>
  <parallelism method="basic" split_size="3"
split_mode="number_of_parts"></parallelism>

  <command>
      bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa
$input > $output 2>/dev/null</command>
  <inputs>
    <param format="fastqsanger" name="input" type="data" label="fastq"/>
  </inputs>
  <outputs>
      <data format="sam" name="output" />
  </outputs>

  <help>
  bwa
  </help>

</tool>
Everything ends ok, but when I go to check how is the sam, I see that in
the alingments it is the path of the file, i.e
example_split.sam:
/home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446
4 * 0 0 * * 0 0
TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT
############################################################################
AS:i:0 XS:i:0

you know what  may be going on?
If i don't split the file, everything goes correctly.

Best regards

On 13 February 2015 at 13:39, Peter Cock <p.j.a.cock@googlemail.com> wrote:
...
On Fri, Feb 13, 2015 at 11:38 AM, Nicola Soranzo <nsoranzo@tiscali.it>
wrote:
...
Il 13.02.2015 03:17 Peter Cock ha scritto:
...
Hi Roberto,
It looks like this is a known issue with FASTQ splitting,
https://trello.com/c/qRHLFSzd/1522-issues-with-tasked-jobs-parallelism
I originally broke it during a refactor, but it looks like the
discussion died about that that method was meant to do
(e.g. FQTOC = FASTQ table of contents?):
https://bitbucket.org/galaxy/galaxy-central/commits/76277761807306ec2be3f1e4...
...
I'm away from the office so can't try this, but probably all
that is needed is to copy and paste the old method
get_split_commands_sequential and the old method
get_split_commands_with_toc (removed from the
base Sequence class in the above commit) into the
base Fastq class instead.
Nicola - did you fix this locally after noticing the
problem last year?
No, sorry, we disabled Galaxy parallelism because it was using
too many cluster nodes.
Nicola
I had similar comments from some of the cluster users
after getting it working here - but on balance a well used
cluster helps justify future investment in maintaining it.
Sorry about not following up on this - I think I might have
assumed you would take care of it. Unfortunately I won't
be able to test the obvious fix until at least a week later...
Peter
-- 
Roberto Alonso
Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico)
46012 Valencia, Spain
Tel: +34 963289680 Ext. 1021
Fax: +34 963289574
E-Mail: ralonso@cipf.es

Re: [galaxy-dev] problems splitting

Roberto Alonso CIPF