Hello, I am trying to map a a fastqsacer file and map it with bwa, my bwa tool config file is this: <tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="2" split_mode="number_of_parts"></parallelism> <command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>xx</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs> <help> bwa </help> </tool> And when I see the stderr I see this error: type object 'Sequence' has no attribute 'get_split_commands_sequential' It seems that this command that I see in the log is not working galaxy.jobs.runners DEBUG 2015-02-11 16:33:48,738 (74) command is: /home/ralonso/galaxy-dist/extract_dataset_parts.sh /home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0; bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_8.dat
/home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_75.dat When I go directly to the code, around line 559 of class galaxy.datatypes.sequence I can't find this function get_split_commands_sequential anywhere. Any idea? Thank you very much Regards -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
Hi Roberto, It looks like this is a known issue with FASTQ splitting, https://trello.com/c/qRHLFSzd/1522-issues-with-tasked-jobs-parallelism I originally broke it during a refactor, but it looks like the discussion died about that that method was meant to do (e.g. FQTOC = FASTQ table of contents?): https://bitbucket.org/galaxy/galaxy-central/commits/76277761807306ec2be3f1e4... I'm away from the office so can't try this, but probably all that is needed is to copy and paste the old method get_split_commands_sequential and the old method get_split_commands_with_toc (removed from the base Sequence class in the above commit) into the base Fastq class instead. Nicola - did you fix this locally after noticing the problem last year? Peter On Wed, Feb 11, 2015 at 3:45 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
Hello,
I am trying to map a a fastqsacer file and map it with bwa, my bwa tool config file is this:
<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="2" split_mode="number_of_parts"></parallelism>
<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>xx</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>
<help> bwa </help>
</tool>
And when I see the stderr I see this error: type object 'Sequence' has no attribute 'get_split_commands_sequential'
It seems that this command that I see in the log is not working galaxy.jobs.runners DEBUG 2015-02-11 16:33:48,738 (74) command is: /home/ralonso/galaxy-dist/extract_dataset_parts.sh /home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0; bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_8.dat
/home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_75.dat
When I go directly to the code, around line 559 of class galaxy.datatypes.sequence I can't find this function get_split_commands_sequential anywhere. Any idea?
Thank you very much
Regards
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Il 13.02.2015 03:17 Peter Cock ha scritto:
Hi Roberto,
It looks like this is a known issue with FASTQ splitting,
https://trello.com/c/qRHLFSzd/1522-issues-with-tasked-jobs-parallelism
I originally broke it during a refactor, but it looks like the discussion died about that that method was meant to do (e.g. FQTOC = FASTQ table of contents?):
https://bitbucket.org/galaxy/galaxy-central/commits/76277761807306ec2be3f1e4...
I'm away from the office so can't try this, but probably all that is needed is to copy and paste the old method get_split_commands_sequential and the old method get_split_commands_with_toc (removed from the base Sequence class in the above commit) into the base Fastq class instead.
Nicola - did you fix this locally after noticing the problem last year?
No, sorry, we disabled Galaxy parallelism because it was using too many cluster nodes. Nicola
Peter
On Wed, Feb 11, 2015 at 3:45 PM, Roberto Alonso CIPF wrote:
Hello, I am trying to map a a fastqsanger file and map it with bwa, my bwa tool config file is this:map with bwabwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>xxbwaAnd when I see the stderr I see this error: type object 'Sequence' has no attribute 'get_split_commands_sequential' It seems that this command that I see in the log is not working galaxy.jobs.runners DEBUG 2015-02-11 16:33:48,738 (74) command is: /home/ralonso/galaxy-dist/extract_dataset_parts.sh
/home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0; bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa
/home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_8.dat/home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_75.dat
When I go directly to the code, around line 559 of class galaxy.datatypes.sequence I can't find this function get_split_commands_sequential anywhere. Any idea? Thank you very much Regards -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Connetti gratis il mondo con la nuova indoona: hai la chat, le chiamate, le video chiamate e persino le chiamate di gruppo. E chiami gratis anche i numeri fissi e mobili nel mondo! Scarica subito l’app Vai su https://www.indoona.com/
On Fri, Feb 13, 2015 at 11:38 AM, Nicola Soranzo <nsoranzo@tiscali.it> wrote:
Il 13.02.2015 03:17 Peter Cock ha scritto:
Hi Roberto,
It looks like this is a known issue with FASTQ splitting,
https://trello.com/c/qRHLFSzd/1522-issues-with-tasked-jobs-parallelism
I originally broke it during a refactor, but it looks like the discussion died about that that method was meant to do (e.g. FQTOC = FASTQ table of contents?):
https://bitbucket.org/galaxy/galaxy-central/commits/76277761807306ec2be3f1e4...
I'm away from the office so can't try this, but probably all that is needed is to copy and paste the old method get_split_commands_sequential and the old method get_split_commands_with_toc (removed from the base Sequence class in the above commit) into the base Fastq class instead.
Nicola - did you fix this locally after noticing the problem last year?
No, sorry, we disabled Galaxy parallelism because it was using too many cluster nodes.
Nicola
I had similar comments from some of the cluster users after getting it working here - but on balance a well used cluster helps justify future investment in maintaining it. Sorry about not following up on this - I think I might have assumed you would take care of it. Unfortunately I won't be able to test the obvious fix until at least a week later... Peter
Hello again, first of all thanks for your help, it is being very useful. What I have done up to now is to copy this method to the class Sequence def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): """ Does a brain-dead sequential scan & extract of certain sequences >>> Sequence.get_split_commands_sequential(True, './input.gz', './output.gz', start_sequence=0, sequence_count=10) ['zcat "./input.gz" | ( tail -n +1 2> /dev/null) | head -40 | gzip -c > "./output.gz"'] >>> Sequence.get_split_commands_sequential(False, './input.fastq', './output.fastq', start_sequence=10, sequence_count=10) ['tail -n +41 "./input.fastq" 2> /dev/null | head -40 > "./output.fastq"'] """ start_line = start_sequence * 4 line_count = sequence_count * 4 # TODO: verify that tail can handle 64-bit numbers if is_compressed: cmd = 'zcat "%s" | ( tail -n +%s 2> /dev/null) | head -%s | gzip -c' % (input_name, start_line+1, line_count) else: cmd = 'tail -n +%s "%s" 2> /dev/null | head -%s' % (start_line+1, input_name, line_count) cmd += ' > "%s"' % output_name return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential) This is something that you suggested. When I run the tool with this configuration: <tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism> <command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>/dev/null</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs> <help> bwa </help> </tool> Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam: /home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446 4 * 0 0 * * 0 0 TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT ############################################################################ AS:i:0 XS:i:0 you know what may be going on? If i don't split the file, everything goes correctly. Best regards On 13 February 2015 at 13:39, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Fri, Feb 13, 2015 at 11:38 AM, Nicola Soranzo <nsoranzo@tiscali.it> wrote:
Il 13.02.2015 03:17 Peter Cock ha scritto:
Hi Roberto,
It looks like this is a known issue with FASTQ splitting,
https://trello.com/c/qRHLFSzd/1522-issues-with-tasked-jobs-parallelism
I originally broke it during a refactor, but it looks like the discussion died about that that method was meant to do (e.g. FQTOC = FASTQ table of contents?):
https://bitbucket.org/galaxy/galaxy-central/commits/76277761807306ec2be3f1e4...
I'm away from the office so can't try this, but probably all that is needed is to copy and paste the old method get_split_commands_sequential and the old method get_split_commands_with_toc (removed from the base Sequence class in the above commit) into the base Fastq class instead.
Nicola - did you fix this locally after noticing the problem last year?
No, sorry, we disabled Galaxy parallelism because it was using too many cluster nodes.
Nicola
I had similar comments from some of the cluster users after getting it working here - but on balance a well used cluster helps justify future investment in maintaining it.
Sorry about not following up on this - I think I might have assumed you would take care of it. Unfortunately I won't be able to test the obvious fix until at least a week later...
Peter
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
Hello again,
first of all thanks for your help, it is being very useful.
What I have done up to now is to copy this method to the class Sequence
def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)
This is something that you suggested.
Good.
When I run the tool with this configuration:
<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism>
<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>/dev/null</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>
<help> bwa </help>
</tool>
One minor improvement would be to escape the ">" as ">" in your XML, or use the CDATA approach documented here: https://wiki.galaxyproject.org/Tools/BestPractices
Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam: /home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446 4 * 0 0 * * 0 0 TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT ############################################################################ AS:i:0 XS:i:0
you know what may be going on? If i don't split the file, everything goes correctly.
This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)? Peter
Hello, I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file: https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam Best regards On 24 February 2015 at 17:49, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
Hello again,
first of all thanks for your help, it is being very useful.
What I have done up to now is to copy this method to the class Sequence
def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)
This is something that you suggested.
Good.
When I run the tool with this configuration:
<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism>
<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>/dev/null</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>
<help> bwa </help>
</tool>
One minor improvement would be to escape the ">" as ">" in your XML, or use the CDATA approach documented here:
https://wiki.galaxyproject.org/Tools/BestPractices
Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam:
/home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446
4 * 0 0 * * 0 0
TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT
############################################################################
AS:i:0 XS:i:0
you know what may be going on? If i don't split the file, everything goes correctly.
This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)?
Peter
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
Hello again, this is something that I consider important, when I see the log I see this output: galaxy.jobs.runners.tasks DEBUG 2015-02-25 11:33:30,989 execution finished -* beginning merge: bwa mem* /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null I think the merge should be done with samtools. I don't know how is this programmed in Galaxy, but I didn't indicate anywhere the path to samtools, is it maybe the problem related with this? Thanks a lot, Regards On 25 February 2015 at 11:13, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
Hello,
I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file: https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam
Best regards
On 24 February 2015 at 17:49, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
Hello again,
first of all thanks for your help, it is being very useful.
What I have done up to now is to copy this method to the class Sequence
def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)
This is something that you suggested.
Good.
When I run the tool with this configuration:
<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism>
<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>/dev/null</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>
<help> bwa </help>
</tool>
One minor improvement would be to escape the ">" as ">" in your XML, or use the CDATA approach documented here:
https://wiki.galaxyproject.org/Tools/BestPractices
Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam:
/home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446
4 * 0 0 * * 0 0
TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT
############################################################################
AS:i:0 XS:i:0
you know what may be going on? If i don't split the file, everything goes correctly.
This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)?
Peter
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
Ok, I think I understand the line: beginning merge: bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null it refers to the original command, so everything is fine with this line. The other problem still remains Regards, sorry for the confusion On 25 February 2015 at 11:40, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
Hello again,
this is something that I consider important, when I see the log I see this output: galaxy.jobs.runners.tasks DEBUG 2015-02-25 11:33:30,989 execution finished -* beginning merge: bwa mem* /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null I think the merge should be done with samtools. I don't know how is this programmed in Galaxy, but I didn't indicate anywhere the path to samtools, is it maybe the problem related with this?
Thanks a lot,
Regards
On 25 February 2015 at 11:13, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
Hello,
I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file: https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam
Best regards
On 24 February 2015 at 17:49, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
Hello again,
first of all thanks for your help, it is being very useful.
What I have done up to now is to copy this method to the class Sequence
def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)
This is something that you suggested.
Good.
When I run the tool with this configuration:
<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism>
<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>/dev/null</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>
<help> bwa </help>
</tool>
One minor improvement would be to escape the ">" as ">" in your XML, or use the CDATA approach documented here:
https://wiki.galaxyproject.org/Tools/BestPractices
Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam:
/home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446
4 * 0 0 * * 0 0
TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT
############################################################################
AS:i:0 XS:i:0
you know what may be going on? If i don't split the file, everything goes correctly.
This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)?
Peter
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
Hello again :), I have found the problem, the code that merge the files is this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -v "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file ) This concatenates the file name into the sam file. Just adding "h" it is enough, so it will be like this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -*h*v "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file ) Thanks all for your help, best regards On 25 February 2015 at 12:31, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
Ok, I think I understand the line: beginning merge: bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null it refers to the original command, so everything is fine with this line. The other problem still remains Regards, sorry for the confusion
On 25 February 2015 at 11:40, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
Hello again,
this is something that I consider important, when I see the log I see this output: galaxy.jobs.runners.tasks DEBUG 2015-02-25 11:33:30,989 execution finished -* beginning merge: bwa mem* /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null I think the merge should be done with samtools. I don't know how is this programmed in Galaxy, but I didn't indicate anywhere the path to samtools, is it maybe the problem related with this?
Thanks a lot,
Regards
On 25 February 2015 at 11:13, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
Hello,
I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file: https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam
Best regards
On 24 February 2015 at 17:49, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
Hello again,
first of all thanks for your help, it is being very useful.
What I have done up to now is to copy this method to the class Sequence
def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)
This is something that you suggested.
Good.
When I run the tool with this configuration:
<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism>
<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>/dev/null</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>
<help> bwa </help>
</tool>
One minor improvement would be to escape the ">" as ">" in your XML, or use the CDATA approach documented here:
https://wiki.galaxyproject.org/Tools/BestPractices
Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam:
/home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446
4 * 0 0 * * 0 0
TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT
############################################################################
AS:i:0 XS:i:0
you know what may be going on? If i don't split the file, everything goes correctly.
This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)?
Peter
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
Hi Roberto, I'm happy you solved your issue, thanks for sharing the solution! I'd suggest you open a pull request with the fixes at https://github.com/galaxyproject/galaxy . Cheers, Nicola Il 25.02.2015 15:07 Roberto Alonso CIPF ha scritto:
Hello again :), I have found the problem, the code that merge the files is this:
This concatenates the file name into the sam file. Just adding "h" it is enough, so it will be like
galaxy/datatypes/tabular.py:484: cmd = 'egrep -v "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file ) this:
galaxy/datatypes/tabular.py:484: cmd = 'egrep -Hv "^@" %s >>
%s' % ( ' '.join(split_files[1:]), output_file )
Thanks all for your help, best regards
On 25 February 2015 at 12:31, Roberto Alonso CIPF wrote:
Ok, I think I understand the line: beginning merge: bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null
it refers to the original command, so everything is fine with this line. The other problem still remains Regards, sorry for the confusion
On 25 February 2015 at 11:40, Roberto Alonso CIPF wrote:
Hello again, this is something that I consider important, when I see the log I see this output:
galaxy.jobs.runners.tasks DEBUG 2015-02-25 11:33:30,989 execution finished - BEGINNING MERGE: BWA MEM /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null
I think the merge should be done with samtools. I don't know how is this programmed in Galaxy, but I didn't indicate anywhere the path to samtools, is it maybe the problem related with this? Thanks a lot,
Regards
On 25 February 2015 at 11:13, Roberto Alonso CIPF wrote:
Hello, I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file:
https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam [3]
Best regards
On 24 February 2015 at 17:49, Peter Cock
wrote:
On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF
wrote:
Hello again,
first of all thanks for your help, it is being very useful.
What I have done up to now is to copy this method to the class Sequence
def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ...
return [cmd]
get_split_commands_sequential =
staticmethod(get_split_commands_sequential)
This is
something that you suggested.
Good.
When I run the tool with this configuration:
map with bwa > split_mode="number_of_parts">
bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa
$input > $output 2>/dev/null
bwa
One minor improvement would be to escape the ">" as ">" in
your XML, or use the CDATA approach documented here:
https://wiki.galaxyproject.org/Tools/BestPractices [2]
Everything ends ok, but when I go to check how is the sam, I see that in the
alingments it is the path of the file, i.e
example_split.sam:
/home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446
4 * 0 0 * * 0 0
TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT
############################################################################
AS:i:0 XS:i:0
you know what may be going on?
If i don't split the file, everything goes correctly.
This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)?
Peter
--
Roberto Alonso
Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es [5]
--
Roberto Alonso Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es [7]
--
Roberto Alonso Functional Genomics Unit Bioinformatics
and Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico)
46012 Valencia, Spain
Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es [9]
--
Roberto Alonso
Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es [11]
Connetti gratis il mondo con la nuova indoona: hai la chat, le chiamate, le video chiamate e persino le chiamate di gruppo. E chiami gratis anche i numeri fissi e mobili nel mondo! Scarica subito l’app Vai su https://www.indoona.com/
On Wed, Feb 25, 2015 at 2:07 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
Hello again :),
I have found the problem, the code that merge the files is this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -v "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file ) This concatenates the file name into the sam file. Just adding "h" it is enough, so it will be like this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -hv "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file )
Thanks all for your help, best regards
Well done :) It looks like the SAM merge needs fixing then, $ man egrep ... -h, --no-filename Suppress the prefixing of file names on output. This is the default when there is only one file (or only standard input) to search. I filed a pull request adding the -h option to egrep, crediting you: https://github.com/galaxyproject/galaxy/pull/4 Peter
Perfect, Galaxy will also need to add the function that was deleted by merge, in *galaxy/datatypes/sequence.py:206*: def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): """ Does a brain-dead sequential scan & extract of certain sequences >>> Sequence.get_split_commands_sequential(True, './input.gz', './output.gz', start_sequence=0, sequence_count=10) ['zcat "./input.gz" | ( tail -n +1 2> /dev/null) | head -40 | gzip -c > "./output.gz"'] >>> Sequence.get_split_commands_sequential(False, './input.fastq', './output.fastq', start_sequence=10, sequence_count=10) ['tail -n +41 "./input.fastq" 2> /dev/null | head -40 > "./output.fastq"'] """ start_line = start_sequence * 4 line_count = sequence_count * 4 # TODO: verify that tail can handle 64-bit numbers if is_compressed: cmd = 'zcat "%s" | ( tail -n +%s 2> /dev/null) | head -%s | gzip -c' % (input_name, start_line+1, line_count) else: cmd = 'tail -n +%s "%s" 2> /dev/null | head -%s' % (start_line+1, input_name, line_count) cmd += ' > "%s"' % output_name return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential) Best regards On 25 February 2015 at 15:38, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Wed, Feb 25, 2015 at 2:07 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
Hello again :),
I have found the problem, the code that merge the files is this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -v "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file ) This concatenates the file name into the sam file. Just adding "h" it is enough, so it will be like this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -hv "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file )
Thanks all for your help, best regards
Well done :)
It looks like the SAM merge needs fixing then,
$ man egrep ... -h, --no-filename Suppress the prefixing of file names on output. This is the default when there is only one file (or only standard input) to search.
I filed a pull request adding the -h option to egrep, crediting you: https://github.com/galaxyproject/galaxy/pull/4
Peter
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
On Wed, Feb 25, 2015 at 3:34 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
Perfect, Galaxy will also need to add the function that was deleted by merge, in galaxy/datatypes/sequence.py
Yes - if you want to do a pull request with that, please go ahead. Otherwise I hope to do it later this week... Your egrep fix has been applied to the main repository now: https://github.com/galaxyproject/galaxy/pull/4 Peter
perfect, I will do the pull request. Thanks!! On 25 February 2015 at 16:38, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Wed, Feb 25, 2015 at 3:34 PM, Roberto Alonso CIPF <ralonso@cipf.es> wrote:
Perfect, Galaxy will also need to add the function that was deleted by merge, in galaxy/datatypes/sequence.py
Yes - if you want to do a pull request with that, please go ahead. Otherwise I hope to do it later this week...
Your egrep fix has been applied to the main repository now: https://github.com/galaxyproject/galaxy/pull/4
Peter
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
participants (3)
-
Nicola Soranzo
-
Peter Cock
-
Roberto Alonso CIPF