problems splitting

newer
Subject: Serving Galaxy at a sub...

Roberto Alonso CIPF

11 Feb 2015 11 Feb '15

4:45 p.m.

Hello, I am trying to map a a fastqsacer file and map it with bwa, my bwa tool config file is this: <tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="2" split_mode="number_of_parts"></parallelism> <command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>xx</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs> <help> bwa </help> </tool> And when I see the stderr I see this error: type object 'Sequence' has no attribute 'get_split_commands_sequential' It seems that this command that I see in the log is not working galaxy.jobs.runners DEBUG 2015-02-11 16:33:48,738 (74) command is: /home/ralonso/galaxy-dist/extract_dataset_parts.sh /home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0; bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_8.dat

...

/home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_75.dat When I go directly to the code, around line 559 of class galaxy.datatypes.sequence I can't find this function get_split_commands_sequential anywhere. Any idea? Thank you very much Regards -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es

Attachments:

attachment.htm (text/html — 3.0 KB)

Show replies by date

Peter Cock

13 Feb 13 Feb

3:17 a.m.

Hi Roberto, It looks like this is a known issue with FASTQ splitting, https://trello.com/c/qRHLFSzd/1522-issues-with-tasked-jobs-parallelism I originally broke it during a refactor, but it looks like the discussion died about that that method was meant to do (e.g. FQTOC = FASTQ table of contents?): https://bitbucket.org/galaxy/galaxy-central/commits/76277761807306ec2be3f1e4... I'm away from the office so can't try this, but probably all that is needed is to copy and paste the old method get_split_commands_sequential and the old method get_split_commands_with_toc (removed from the base Sequence class in the above commit) into the base Fastq class instead. Nicola - did you fix this locally after noticing the problem last year? Peter On Wed, Feb 11, 2015 at 3:45 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:

...

Hello,

I am trying to map a a fastqsacer file and map it with bwa, my bwa tool config file is this:

<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="2" split_mode="number_of_parts"></parallelism>

<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>xx</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>

<help> bwa </help>

</tool>

And when I see the stderr I see this error: type object 'Sequence' has no attribute 'get_split_commands_sequential'

It seems that this command that I see in the log is not working galaxy.jobs.runners DEBUG 2015-02-11 16:33:48,738 (74) command is: /home/ralonso/galaxy-dist/extract_dataset_parts.sh /home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0; bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_8.dat

...
/home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_75.dat

When I go directly to the code, around line 559 of class galaxy.datatypes.sequence I can't find this function get_split_commands_sequential anywhere. Any idea?

Thank you very much

Regards

-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Nicola Soranzo

12:38 p.m.

Il 13.02.2015 03:17 Peter Cock ha scritto:

...

Hi Roberto,

It looks like this is a known issue with FASTQ splitting,

https://trello.com/c/qRHLFSzd/1522-issues-with-tasked-jobs-parallelism

I originally broke it during a refactor, but it looks like the discussion died about that that method was meant to do (e.g. FQTOC = FASTQ table of contents?):

https://bitbucket.org/galaxy/galaxy-central/commits/76277761807306ec2be3f1e4...

...

I'm away from the office so can't try this, but probably all that is needed is to copy and paste the old method get_split_commands_sequential and the old method get_split_commands_with_toc (removed from the base Sequence class in the above commit) into the base Fastq class instead.

Nicola - did you fix this locally after noticing the problem last year?

No, sorry, we disabled Galaxy parallelism because it was using too many cluster nodes. Nicola

...

Peter

On Wed, Feb 11, 2015 at 3:45 PM, Roberto Alonso CIPF wrote:

...
Hello, I am trying to map a a fastqsanger file and map it with bwa, my bwa tool config file is this:map with bwabwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>xxbwaAnd when I see the stderr I see this error: type object 'Sequence' has no attribute 'get_split_commands_sequential' It seems that this command that I see in the log is not working galaxy.jobs.runners DEBUG 2015-02-11 16:33:48,738 (74) command is: /home/ralonso/galaxy-dist/extract_dataset_parts.sh

/home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0; bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa

/home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_8.dat/home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_75.dat

...

...
When I go directly to the code, around line 559 of class galaxy.datatypes.sequence I can't find this function get_split_commands_sequential anywhere. Any idea? Thank you very much Regards -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Connetti gratis il mondo con la nuova indoona: hai la chat, le chiamate, le video chiamate e persino le chiamate di gruppo. E chiami gratis anche i numeri fissi e mobili nel mondo! Scarica subito l’app Vai su https://www.indoona.com/

Peter Cock

1:39 p.m.

On Fri, Feb 13, 2015 at 11:38 AM, Nicola Soranzo <nsoranzo@tiscali.it> wrote:

...

Il 13.02.2015 03:17 Peter Cock ha scritto:

...
Hi Roberto,

It looks like this is a known issue with FASTQ splitting,

https://trello.com/c/qRHLFSzd/1522-issues-with-tasked-jobs-parallelism

I originally broke it during a refactor, but it looks like the discussion died about that that method was meant to do (e.g. FQTOC = FASTQ table of contents?):

https://bitbucket.org/galaxy/galaxy-central/commits/76277761807306ec2be3f1e4...

I'm away from the office so can't try this, but probably all that is needed is to copy and paste the old method get_split_commands_sequential and the old method get_split_commands_with_toc (removed from the base Sequence class in the above commit) into the base Fastq class instead.

Nicola - did you fix this locally after noticing the problem last year?

No, sorry, we disabled Galaxy parallelism because it was using too many cluster nodes.

Nicola

I had similar comments from some of the cluster users after getting it working here - but on balance a well used cluster helps justify future investment in maintaining it. Sorry about not following up on this - I think I might have assumed you would take care of it. Unfortunately I won't be able to test the obvious fix until at least a week later... Peter

Roberto Alonso CIPF

24 Feb 24 Feb

5:43 p.m.

Hello again, first of all thanks for your help, it is being very useful. What I have done up to now is to copy this method to the class Sequence def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): """ Does a brain-dead sequential scan & extract of certain sequences >>> Sequence.get_split_commands_sequential(True, './input.gz', './output.gz', start_sequence=0, sequence_count=10) ['zcat "./input.gz" | ( tail -n +1 2> /dev/null) | head -40 | gzip -c > "./output.gz"'] >>> Sequence.get_split_commands_sequential(False, './input.fastq', './output.fastq', start_sequence=10, sequence_count=10) ['tail -n +41 "./input.fastq" 2> /dev/null | head -40 > "./output.fastq"'] """ start_line = start_sequence * 4 line_count = sequence_count * 4 # TODO: verify that tail can handle 64-bit numbers if is_compressed: cmd = 'zcat "%s" | ( tail -n +%s 2> /dev/null) | head -%s | gzip -c' % (input_name, start_line+1, line_count) else: cmd = 'tail -n +%s "%s" 2> /dev/null | head -%s' % (start_line+1, input_name, line_count) cmd += ' > "%s"' % output_name return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential) This is something that you suggested. When I run the tool with this configuration: <tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism> <command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>/dev/null</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs> <help> bwa </help> </tool> Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam: /home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446 4 * 0 0 * * 0 0 TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT ############################################################################ AS:i:0 XS:i:0 you know what may be going on? If i don't split the file, everything goes correctly. Best regards On 13 February 2015 at 13:39, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Fri, Feb 13, 2015 at 11:38 AM, Nicola Soranzo <nsoranzo@tiscali.it> wrote:

...
Il 13.02.2015 03:17 Peter Cock ha scritto:

...
Hi Roberto,

It looks like this is a known issue with FASTQ splitting,

https://trello.com/c/qRHLFSzd/1522-issues-with-tasked-jobs-parallelism

I originally broke it during a refactor, but it looks like the discussion died about that that method was meant to do (e.g. FQTOC = FASTQ table of contents?):

https://bitbucket.org/galaxy/galaxy-central/commits/76277761807306ec2be3f1e4...

...
I'm away from the office so can't try this, but probably all that is needed is to copy and paste the old method get_split_commands_sequential and the old method get_split_commands_with_toc (removed from the base Sequence class in the above commit) into the base Fastq class instead.

Nicola - did you fix this locally after noticing the problem last year?

No, sorry, we disabled Galaxy parallelism because it was using too many cluster nodes.

Nicola

I had similar comments from some of the cluster users after getting it working here - but on balance a well used cluster helps justify future investment in maintaining it.

Sorry about not following up on this - I think I might have assumed you would take care of it. Unfortunately I won't be able to test the obvious fix until at least a week later...

Peter

-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es

Peter Cock

5:49 p.m.

On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:

...

Hello again,

first of all thanks for your help, it is being very useful.

What I have done up to now is to copy this method to the class Sequence

def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)

This is something that you suggested.

Good.

...

When I run the tool with this configuration:

<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism>

<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>/dev/null</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>

<help> bwa </help>

</tool>

One minor improvement would be to escape the ">" as ">" in your XML, or use the CDATA approach documented here: https://wiki.galaxyproject.org/Tools/BestPractices

...

Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam: /home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446 4 * 0 0 * * 0 0 TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT ############################################################################ AS:i:0 XS:i:0

you know what may be going on? If i don't split the file, everything goes correctly.

This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)? Peter

Roberto Alonso CIPF

25 Feb 25 Feb

11:13 a.m.

Hello, I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file: https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam Best regards On 24 February 2015 at 17:49, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:

...
Hello again,

first of all thanks for your help, it is being very useful.

What I have done up to now is to copy this method to the class Sequence

def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)

This is something that you suggested.

Good.

...
When I run the tool with this configuration:

<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism>

<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>/dev/null</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>

<help> bwa </help>

</tool>

One minor improvement would be to escape the ">" as ">" in your XML, or use the CDATA approach documented here:

https://wiki.galaxyproject.org/Tools/BestPractices

...
Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam:

/home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446

...
4 * 0 0 * * 0 0

TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT

...
############################################################################

...
AS:i:0 XS:i:0

you know what may be going on? If i don't split the file, everything goes correctly.

This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)?

Peter

Roberto Alonso CIPF

11:40 a.m.

Hello again, this is something that I consider important, when I see the log I see this output: galaxy.jobs.runners.tasks DEBUG 2015-02-25 11:33:30,989 execution finished -* beginning merge: bwa mem* /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null I think the merge should be done with samtools. I don't know how is this programmed in Galaxy, but I didn't indicate anywhere the path to samtools, is it maybe the problem related with this? Thanks a lot, Regards On 25 February 2015 at 11:13, Roberto Alonso CIPF <ralonso@cipf.es> wrote:

...

Hello,

I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file: https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam

Best regards

On 24 February 2015 at 17:49, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:

...
Hello again,

first of all thanks for your help, it is being very useful.

What I have done up to now is to copy this method to the class Sequence

def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)

This is something that you suggested.

Good.

...
When I run the tool with this configuration:

<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism>

<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>/dev/null</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>

<help> bwa </help>

</tool>

One minor improvement would be to escape the ">" as ">" in your XML, or use the CDATA approach documented here:

https://wiki.galaxyproject.org/Tools/BestPractices

...
Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam:

/home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446

...
4 * 0 0 * * 0 0

TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT

...
############################################################################

...
AS:i:0 XS:i:0

you know what may be going on? If i don't split the file, everything goes correctly.

This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)?

Peter

-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es

Roberto Alonso CIPF

12:31 p.m.

Ok, I think I understand the line: beginning merge: bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null it refers to the original command, so everything is fine with this line. The other problem still remains Regards, sorry for the confusion On 25 February 2015 at 11:40, Roberto Alonso CIPF <ralonso@cipf.es> wrote:

...

Hello again,

this is something that I consider important, when I see the log I see this output: galaxy.jobs.runners.tasks DEBUG 2015-02-25 11:33:30,989 execution finished -* beginning merge: bwa mem* /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null I think the merge should be done with samtools. I don't know how is this programmed in Galaxy, but I didn't indicate anywhere the path to samtools, is it maybe the problem related with this?

Thanks a lot,

Regards

On 25 February 2015 at 11:13, Roberto Alonso CIPF <ralonso@cipf.es> wrote:

...
Hello,

I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file: https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam

Best regards

On 24 February 2015 at 17:49, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:

...
Hello again,

first of all thanks for your help, it is being very useful.

What I have done up to now is to copy this method to the class Sequence

def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)

This is something that you suggested.

Good.

...
When I run the tool with this configuration:

<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism>

<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>/dev/null</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>

<help> bwa </help>

</tool>

One minor improvement would be to escape the ">" as ">" in your XML, or use the CDATA approach documented here:

https://wiki.galaxyproject.org/Tools/BestPractices

...
Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam:

/home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446

...
4 * 0 0 * * 0 0

TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT

...
############################################################################

...
AS:i:0 XS:i:0

you know what may be going on? If i don't split the file, everything goes correctly.

This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)?

Peter

-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es

-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es

Roberto Alonso CIPF

3:07 p.m.

Hello again :), I have found the problem, the code that merge the files is this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -v "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file ) This concatenates the file name into the sam file. Just adding "h" it is enough, so it will be like this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -*h*v "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file ) Thanks all for your help, best regards On 25 February 2015 at 12:31, Roberto Alonso CIPF <ralonso@cipf.es> wrote:

...

Ok, I think I understand the line: beginning merge: bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null it refers to the original command, so everything is fine with this line. The other problem still remains Regards, sorry for the confusion

On 25 February 2015 at 11:40, Roberto Alonso CIPF <ralonso@cipf.es> wrote:

...
Hello again,

this is something that I consider important, when I see the log I see this output: galaxy.jobs.runners.tasks DEBUG 2015-02-25 11:33:30,989 execution finished -* beginning merge: bwa mem* /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null I think the merge should be done with samtools. I don't know how is this programmed in Galaxy, but I didn't indicate anywhere the path to samtools, is it maybe the problem related with this?

Thanks a lot,

Regards

On 25 February 2015 at 11:13, Roberto Alonso CIPF <ralonso@cipf.es> wrote:

...
Hello,

I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file: https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam

Best regards

On 24 February 2015 at 17:49, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:

...
Hello again,

first of all thanks for your help, it is being very useful.

What I have done up to now is to copy this method to the class Sequence

def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)

This is something that you suggested.

Good.

...
When I run the tool with this configuration:

<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism>

<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>/dev/null</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>

<help> bwa </help>

</tool>

One minor improvement would be to escape the ">" as ">" in your XML, or use the CDATA approach documented here:

https://wiki.galaxyproject.org/Tools/BestPractices

...
Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam:

/home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446

...
4 * 0 0 * * 0 0

TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT

...
############################################################################

...
AS:i:0 XS:i:0

you know what may be going on? If i don't split the file, everything goes correctly.

This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)?

Peter

-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es

-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es

-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es

Nicola Soranzo

3:32 p.m.

Hi Roberto, I'm happy you solved your issue, thanks for sharing the solution! I'd suggest you open a pull request with the fixes at https://github.com/galaxyproject/galaxy . Cheers, Nicola Il 25.02.2015 15:07 Roberto Alonso CIPF ha scritto:

...

Hello again :), I have found the problem, the code that merge the files is this:

...

This concatenates the file name into the sam file. Just adding "h" it is enough, so it will be like

galaxy/datatypes/tabular.py:484: cmd = 'egrep -v "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file ) this:

...

galaxy/datatypes/tabular.py:484: cmd = 'egrep -Hv "^@" %s >>

%s' % ( ' '.join(split_files[1:]), output_file )

...

Thanks all for your help, best regards

On 25 February 2015 at 12:31, Roberto Alonso CIPF wrote:

...
Ok, I think I understand the line: beginning merge: bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null

...

...
it refers to the original command, so everything is fine with this line. The other problem still remains Regards, sorry for the confusion

On 25 February 2015 at 11:40, Roberto Alonso CIPF wrote:

...
Hello again, this is something that I consider important, when I see the log I see this output:

galaxy.jobs.runners.tasks DEBUG 2015-02-25 11:33:30,989 execution finished - BEGINNING MERGE: BWA MEM /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null

...

...
...
I think the merge should be done with samtools. I don't know how is this programmed in Galaxy, but I didn't indicate anywhere the path to samtools, is it maybe the problem related with this? Thanks a lot,

...

...
...
Regards

On 25 February 2015 at 11:13, Roberto Alonso CIPF wrote:

...
Hello, I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file:

https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam [3]

...
Best regards

...
On 24 February 2015 at 17:49, Peter Cock

wrote:

...
...
On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF

wrote:

...
...
...
Hello again,

first of all thanks for your help, it is being very useful.

What I have done up to now is to copy this method to the class Sequence

def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ...

return [cmd]

...
get_split_commands_sequential =

staticmethod(get_split_commands_sequential)

...
This is

something that you suggested.

Good.

...
When I run the tool with this configuration:

map with bwa > split_mode="number_of_parts">

bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa

$input > $output 2>/dev/null

...

...

...
...
...
...
...
bwa

...

...
...
...
...
One minor improvement would be to escape the ">" as ">" in

your XML, or use the CDATA approach documented here:

...
https://wiki.galaxyproject.org/Tools/BestPractices [2]

...
...
Everything ends ok, but when I go to check how is the sam, I see that in the

...
...
alingments it is the path of the file, i.e

example_split.sam:

...
/home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446

4 * 0 0 * * 0 0

...
...
TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT

############################################################################

...

...
...
...
...
AS:i:0 XS:i:0

...
...
you know what may be going on?

If i don't split the file, everything goes correctly.

This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)?

Peter

--

Roberto Alonso

...
Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF)

C./Eduardo Primo Yúfera (Científic), nº 3

...
(junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es [5]

--

...
Roberto Alonso Functional Genomics Unit

Bioinformatics and Genomics Department

...
Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es [7]

--

...

...
Roberto Alonso Functional Genomics Unit Bioinformatics

and Genomics Department

...
Prince Felipe Research Center (CIPF)

C./Eduardo Primo Yúfera (Científic), nº 3

...
(junto Oceanografico)

46012 Valencia, Spain

...
Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es [9]

--

Roberto Alonso

Functional Genomics Unit

...

Bioinformatics and Genomics Department

Prince Felipe Research Center (CIPF)

...

C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es [11]

Peter Cock

3:38 p.m.

On Wed, Feb 25, 2015 at 2:07 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:

...

Hello again :),

I have found the problem, the code that merge the files is this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -v "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file ) This concatenates the file name into the sam file. Just adding "h" it is enough, so it will be like this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -hv "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file )

Thanks all for your help, best regards

Well done :) It looks like the SAM merge needs fixing then, $ man egrep ... -h, --no-filename Suppress the prefixing of file names on output. This is the default when there is only one file (or only standard input) to search. I filed a pull request adding the -h option to egrep, crediting you: https://github.com/galaxyproject/galaxy/pull/4 Peter

Roberto Alonso CIPF

4:34 p.m.

Perfect, Galaxy will also need to add the function that was deleted by merge, in *galaxy/datatypes/sequence.py:206*: def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): """ Does a brain-dead sequential scan & extract of certain sequences >>> Sequence.get_split_commands_sequential(True, './input.gz', './output.gz', start_sequence=0, sequence_count=10) ['zcat "./input.gz" | ( tail -n +1 2> /dev/null) | head -40 | gzip -c > "./output.gz"'] >>> Sequence.get_split_commands_sequential(False, './input.fastq', './output.fastq', start_sequence=10, sequence_count=10) ['tail -n +41 "./input.fastq" 2> /dev/null | head -40 > "./output.fastq"'] """ start_line = start_sequence * 4 line_count = sequence_count * 4 # TODO: verify that tail can handle 64-bit numbers if is_compressed: cmd = 'zcat "%s" | ( tail -n +%s 2> /dev/null) | head -%s | gzip -c' % (input_name, start_line+1, line_count) else: cmd = 'tail -n +%s "%s" 2> /dev/null | head -%s' % (start_line+1, input_name, line_count) cmd += ' > "%s"' % output_name return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential) Best regards On 25 February 2015 at 15:38, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Wed, Feb 25, 2015 at 2:07 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:

...
Hello again :),

I have found the problem, the code that merge the files is this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -v "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file ) This concatenates the file name into the sam file. Just adding "h" it is enough, so it will be like this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -hv "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file )

Thanks all for your help, best regards

Well done :)

It looks like the SAM merge needs fixing then,

$ man egrep ... -h, --no-filename Suppress the prefixing of file names on output. This is the default when there is only one file (or only standard input) to search.

I filed a pull request adding the -h option to egrep, crediting you: https://github.com/galaxyproject/galaxy/pull/4

Peter

Peter Cock

4:38 p.m.

On Wed, Feb 25, 2015 at 3:34 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:

...

Perfect, Galaxy will also need to add the function that was deleted by merge, in galaxy/datatypes/sequence.py

Yes - if you want to do a pull request with that, please go ahead. Otherwise I hope to do it later this week... Your egrep fix has been applied to the main repository now: https://github.com/galaxyproject/galaxy/pull/4 Peter

Roberto Alonso CIPF

4:40 p.m.

perfect, I will do the pull request. Thanks!! On 25 February 2015 at 16:38, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Wed, Feb 25, 2015 at 3:34 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:

...
Perfect, Galaxy will also need to add the function that was deleted by merge, in galaxy/datatypes/sequence.py

Yes - if you want to do a pull request with that, please go ahead. Otherwise I hope to do it later this week...

Your egrep fix has been applied to the main repository now: https://github.com/galaxyproject/galaxy/pull/4

Peter

3825

Age (days ago)

3839

Last active (days ago)

List overview

Download

14 comments

3 participants

participants (3)

Nicola Soranzo
Peter Cock
Roberto Alonso CIPF

problems splitting

tags

participants (3)