Hello,
I am trying to map a a fastqsacer file and map it with bwa, my bwa tool config file is this:
<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="2" split_mode="number_of_parts"></parallelism>
<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>xx</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>
<help> bwa </help>
</tool>
And when I see the stderr I see this error: type object 'Sequence' has no attribute 'get_split_commands_sequential'
It seems that this command that I see in the log is not working galaxy.jobs.runners DEBUG 2015-02-11 16:33:48,738 (74) command is: /home/ralonso/galaxy-dist/extract_dataset_parts.sh /home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0; bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_8.dat
/home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_75.dat
When I go directly to the code, around line 559 of class galaxy.datatypes.sequence I can't find this function get_split_commands_sequential anywhere. Any idea?
Thank you very much
Regards
Hi Roberto,
It looks like this is a known issue with FASTQ splitting, https://trello.com/c/qRHLFSzd/1522-issues-with-tasked-jobs-parallelism
I originally broke it during a refactor, but it looks like the discussion died about that that method was meant to do (e.g. FQTOC = FASTQ table of contents?):
https://bitbucket.org/galaxy/galaxy-central/commits/76277761807306ec2be3f1e4...
I'm away from the office so can't try this, but probably all that is needed is to copy and paste the old method get_split_commands_sequential and the old method get_split_commands_with_toc (removed from the base Sequence class in the above commit) into the base Fastq class instead.
Nicola - did you fix this locally after noticing the problem last year?
Peter
On Wed, Feb 11, 2015 at 3:45 PM, Roberto Alonso CIPF ralonso@cipf.es wrote:
Hello,
I am trying to map a a fastqsacer file and map it with bwa, my bwa tool config file is this:
<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="2" split_mode="number_of_parts"></parallelism>
<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>xx</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>
<help> bwa </help>
</tool>
And when I see the stderr I see this error: type object 'Sequence' has no attribute 'get_split_commands_sequential'
It seems that this command that I see in the log is not working galaxy.jobs.runners DEBUG 2015-02-11 16:33:48,738 (74) command is: /home/ralonso/galaxy-dist/extract_dataset_parts.sh /home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0; bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_8.dat
/home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_75.dat
When I go directly to the code, around line 559 of class galaxy.datatypes.sequence I can't find this function get_split_commands_sequential anywhere. Any idea?
Thank you very much
Regards
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Il 13.02.2015 03:17 Peter Cock ha scritto:
Hi Roberto,
It looks like this is a known issue with FASTQ splitting,
https://trello.com/c/qRHLFSzd/1522-issues-with-tasked-jobs-parallelism
I originally broke it during a refactor, but it looks like the discussion died about that that method was meant to do (e.g. FQTOC = FASTQ table of contents?):
https://bitbucket.org/galaxy/galaxy-central/commits/76277761807306ec2be3f1e4...
I'm away from the office so can't try this, but probably all that is needed is to copy and paste the old method get_split_commands_sequential and the old method get_split_commands_with_toc (removed from the base Sequence class in the above commit) into the base Fastq class instead.
Nicola - did you fix this locally after noticing the problem last year?
No, sorry, we disabled Galaxy parallelism because it was using too many cluster nodes.
Nicola
Peter
On Wed, Feb 11, 2015 at 3:45 PM, Roberto Alonso CIPF wrote:
Hello, I am trying to map a a fastqsanger file and map it with bwa, my bwa tool config file is this:map with bwabwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>xxbwaAnd when I see the stderr I see this error: type object 'Sequence' has no attribute 'get_split_commands_sequential' It seems that this command that I see in the log is not working galaxy.jobs.runners DEBUG 2015-02-11 16:33:48,738 (74) command is: /home/ralonso/galaxy-dist/extract_dataset_parts.sh
/home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0; bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa
/home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_8.dat/home/ralonso/galaxy-dist/database/job_working_directory/000/74/task_0/dataset_75.dat
When I go directly to the code, around line 559 of class galaxy.datatypes.sequence I can't find this function get_split_commands_sequential anywhere. Any idea? Thank you very much Regards -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Connetti gratis il mondo con la nuova indoona: hai la chat, le chiamate, le video chiamate e persino le chiamate di gruppo. E chiami gratis anche i numeri fissi e mobili nel mondo! Scarica subito l’app Vai su https://www.indoona.com/
On Fri, Feb 13, 2015 at 11:38 AM, Nicola Soranzo nsoranzo@tiscali.it wrote:
Il 13.02.2015 03:17 Peter Cock ha scritto:
Hi Roberto,
It looks like this is a known issue with FASTQ splitting,
https://trello.com/c/qRHLFSzd/1522-issues-with-tasked-jobs-parallelism
I originally broke it during a refactor, but it looks like the discussion died about that that method was meant to do (e.g. FQTOC = FASTQ table of contents?):
https://bitbucket.org/galaxy/galaxy-central/commits/76277761807306ec2be3f1e4...
I'm away from the office so can't try this, but probably all that is needed is to copy and paste the old method get_split_commands_sequential and the old method get_split_commands_with_toc (removed from the base Sequence class in the above commit) into the base Fastq class instead.
Nicola - did you fix this locally after noticing the problem last year?
No, sorry, we disabled Galaxy parallelism because it was using too many cluster nodes.
Nicola
I had similar comments from some of the cluster users after getting it working here - but on balance a well used cluster helps justify future investment in maintaining it.
Sorry about not following up on this - I think I might have assumed you would take care of it. Unfortunately I won't be able to test the obvious fix until at least a week later...
Peter
Hello again,
first of all thanks for your help, it is being very useful.
What I have done up to now is to copy this method to the class Sequence
def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): """ Does a brain-dead sequential scan & extract of certain sequences >>> Sequence.get_split_commands_sequential(True, './input.gz', './output.gz', start_sequence=0, sequence_count=10) ['zcat "./input.gz" | ( tail -n +1 2> /dev/null) | head -40 | gzip -c > "./output.gz"'] >>> Sequence.get_split_commands_sequential(False, './input.fastq', './output.fastq', start_sequence=10, sequence_count=10) ['tail -n +41 "./input.fastq" 2> /dev/null | head -40 > "./output.fastq"'] """ start_line = start_sequence * 4 line_count = sequence_count * 4 # TODO: verify that tail can handle 64-bit numbers if is_compressed: cmd = 'zcat "%s" | ( tail -n +%s 2> /dev/null) | head -%s | gzip -c' % (input_name, start_line+1, line_count) else: cmd = 'tail -n +%s "%s" 2> /dev/null | head -%s' % (start_line+1, input_name, line_count) cmd += ' > "%s"' % output_name
return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)
This is something that you suggested. When I run the tool with this configuration:
<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism>
<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>/dev/null</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>
<help> bwa </help>
</tool> Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam: /home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446 4 * 0 0 * * 0 0 TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT ############################################################################ AS:i:0 XS:i:0
you know what may be going on? If i don't split the file, everything goes correctly.
Best regards
On 13 February 2015 at 13:39, Peter Cock p.j.a.cock@googlemail.com wrote:
On Fri, Feb 13, 2015 at 11:38 AM, Nicola Soranzo nsoranzo@tiscali.it wrote:
Il 13.02.2015 03:17 Peter Cock ha scritto:
Hi Roberto,
It looks like this is a known issue with FASTQ splitting,
https://trello.com/c/qRHLFSzd/1522-issues-with-tasked-jobs-parallelism
I originally broke it during a refactor, but it looks like the discussion died about that that method was meant to do (e.g. FQTOC = FASTQ table of contents?):
https://bitbucket.org/galaxy/galaxy-central/commits/76277761807306ec2be3f1e4...
I'm away from the office so can't try this, but probably all that is needed is to copy and paste the old method get_split_commands_sequential and the old method get_split_commands_with_toc (removed from the base Sequence class in the above commit) into the base Fastq class instead.
Nicola - did you fix this locally after noticing the problem last year?
No, sorry, we disabled Galaxy parallelism because it was using too many cluster nodes.
Nicola
I had similar comments from some of the cluster users after getting it working here - but on balance a well used cluster helps justify future investment in maintaining it.
Sorry about not following up on this - I think I might have assumed you would take care of it. Unfortunately I won't be able to test the obvious fix until at least a week later...
Peter
On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF ralonso@cipf.es wrote:
Hello again,
first of all thanks for your help, it is being very useful.
What I have done up to now is to copy this method to the class Sequence
def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)
This is something that you suggested.
Good.
When I run the tool with this configuration:
<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism>
<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>/dev/null</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>
<help> bwa </help>
</tool>
One minor improvement would be to escape the ">" as ">" in your XML, or use the CDATA approach documented here:
https://wiki.galaxyproject.org/Tools/BestPractices
Everything ends ok, but when I go to check how is the sam, I see that in the alingments it is the path of the file, i.e example_split.sam: /home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446 4 * 0 0 * * 0 0 TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT ############################################################################ AS:i:0 XS:i:0
you know what may be going on? If i don't split the file, everything goes correctly.
This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)?
Peter
Hello,
I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file: https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam
Best regards
On 24 February 2015 at 17:49, Peter Cock p.j.a.cock@googlemail.com wrote:
On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF ralonso@cipf.es wrote:
Hello again,
first of all thanks for your help, it is being very useful.
What I have done up to now is to copy this method to the class Sequence
def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)
This is something that you suggested.
Good.
When I run the tool with this configuration:
<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism>
<command> bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa $input > $output 2>/dev/null</command> <inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>
<help> bwa </help>
</tool>
One minor improvement would be to escape the ">" as ">" in your XML, or use the CDATA approach documented here:
https://wiki.galaxyproject.org/Tools/BestPractices
Everything ends ok, but when I go to check how is the sam, I see that in
the
alingments it is the path of the file, i.e example_split.sam:
/home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446
4 * 0 0 * * 0 0
TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT
############################################################################
AS:i:0 XS:i:0
you know what may be going on? If i don't split the file, everything goes correctly.
This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)?
Peter
Hello again,
this is something that I consider important, when I see the log I see this output: galaxy.jobs.runners.tasks DEBUG 2015-02-25 11:33:30,989 execution finished -* beginning merge: bwa mem* /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null I think the merge should be done with samtools. I don't know how is this programmed in Galaxy, but I didn't indicate anywhere the path to samtools, is it maybe the problem related with this?
Thanks a lot,
Regards
On 25 February 2015 at 11:13, Roberto Alonso CIPF ralonso@cipf.es wrote:
Hello,
I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file: https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam
Best regards
On 24 February 2015 at 17:49, Peter Cock p.j.a.cock@googlemail.com wrote:
On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF ralonso@cipf.es wrote:
Hello again,
first of all thanks for your help, it is being very useful.
What I have done up to now is to copy this method to the class Sequence
def get_split_commands_sequential(is_compressed, input_name,
output_name,
start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)
This is something that you suggested.
Good.
When I run the tool with this configuration:
<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism>
<command> bwa mem
/home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa
$input > $output 2>/dev/null</command>
<inputs> <param format="fastqsanger" name="input" type="data" label="fastq"/> </inputs> <outputs> <data format="sam" name="output" /> </outputs>
<help> bwa </help>
</tool>
One minor improvement would be to escape the ">" as ">" in your XML, or use the CDATA approach documented here:
https://wiki.galaxyproject.org/Tools/BestPractices
Everything ends ok, but when I go to check how is the sam, I see that
in the
alingments it is the path of the file, i.e example_split.sam:
/home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446
4 * 0 0 * * 0 0
TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT
############################################################################
AS:i:0 XS:i:0
you know what may be going on? If i don't split the file, everything goes correctly.
This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)?
Peter
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
Ok, I think I understand the line: beginning merge: bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null it refers to the original command, so everything is fine with this line. The other problem still remains Regards, sorry for the confusion
On 25 February 2015 at 11:40, Roberto Alonso CIPF ralonso@cipf.es wrote:
Hello again,
this is something that I consider important, when I see the log I see this output: galaxy.jobs.runners.tasks DEBUG 2015-02-25 11:33:30,989 execution finished -* beginning merge: bwa mem* /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null I think the merge should be done with samtools. I don't know how is this programmed in Galaxy, but I didn't indicate anywhere the path to samtools, is it maybe the problem related with this?
Thanks a lot,
Regards
On 25 February 2015 at 11:13, Roberto Alonso CIPF ralonso@cipf.es wrote:
Hello,
I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file: https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam
Best regards
On 24 February 2015 at 17:49, Peter Cock p.j.a.cock@googlemail.com wrote:
On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF ralonso@cipf.es wrote:
Hello again,
first of all thanks for your help, it is being very useful.
What I have done up to now is to copy this method to the class Sequence
def get_split_commands_sequential(is_compressed, input_name,
output_name,
start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)
This is something that you suggested.
Good.
When I run the tool with this configuration:
<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism>
<command> bwa mem
/home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa
$input > $output 2>/dev/null</command>
<inputs> <param format="fastqsanger" name="input" type="data"
label="fastq"/>
</inputs> <outputs> <data format="sam" name="output" /> </outputs>
<help> bwa </help>
</tool>
One minor improvement would be to escape the ">" as ">" in your XML, or use the CDATA approach documented here:
https://wiki.galaxyproject.org/Tools/BestPractices
Everything ends ok, but when I go to check how is the sam, I see that
in the
alingments it is the path of the file, i.e example_split.sam:
/home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446
4 * 0 0 * * 0 0
TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT
############################################################################
AS:i:0 XS:i:0
you know what may be going on? If i don't split the file, everything goes correctly.
This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)?
Peter
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
Hello again :),
I have found the problem, the code that merge the files is this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -v "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file ) This concatenates the file name into the sam file. Just adding "h" it is enough, so it will be like this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -*h*v "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file )
Thanks all for your help, best regards
On 25 February 2015 at 12:31, Roberto Alonso CIPF ralonso@cipf.es wrote:
Ok, I think I understand the line: beginning merge: bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null it refers to the original command, so everything is fine with this line. The other problem still remains Regards, sorry for the confusion
On 25 February 2015 at 11:40, Roberto Alonso CIPF ralonso@cipf.es wrote:
Hello again,
this is something that I consider important, when I see the log I see this output: galaxy.jobs.runners.tasks DEBUG 2015-02-25 11:33:30,989 execution finished -* beginning merge: bwa mem* /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null I think the merge should be done with samtools. I don't know how is this programmed in Galaxy, but I didn't indicate anywhere the path to samtools, is it maybe the problem related with this?
Thanks a lot,
Regards
On 25 February 2015 at 11:13, Roberto Alonso CIPF ralonso@cipf.es wrote:
Hello,
I just changed for the CDATA format, but the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file: https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam
Best regards
On 24 February 2015 at 17:49, Peter Cock p.j.a.cock@googlemail.com wrote:
On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF ralonso@cipf.es wrote:
Hello again,
first of all thanks for your help, it is being very useful.
What I have done up to now is to copy this method to the class
Sequence
def get_split_commands_sequential(is_compressed, input_name,
output_name,
start_sequence, sequence_count): ... return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)
This is something that you suggested.
Good.
When I run the tool with this configuration:
<tool id="bwa_mio" name="map with bwa"> <description>map with bwa</description> <parallelism method="basic" split_size="3" split_mode="number_of_parts"></parallelism>
<command> bwa mem
/home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa
$input > $output 2>/dev/null</command>
<inputs> <param format="fastqsanger" name="input" type="data"
label="fastq"/>
</inputs> <outputs> <data format="sam" name="output" /> </outputs>
<help> bwa </help>
</tool>
One minor improvement would be to escape the ">" as ">" in your XML, or use the CDATA approach documented here:
https://wiki.galaxyproject.org/Tools/BestPractices
Everything ends ok, but when I go to check how is the sam, I see that
in the
alingments it is the path of the file, i.e example_split.sam:
/home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446
4 * 0 0 * * 0 0
TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT
############################################################################
AS:i:0 XS:i:0
you know what may be going on? If i don't split the file, everything goes correctly.
This sounds to me like there may be a problem with SAM merging? Could you share the entire example_split.sam file (e.g. as a gist on GitHub, or via dropbox)?
Peter
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
-- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es
Hi Roberto, I'm happy you solved your issue, thanks for sharing the solution! I'd suggest you open a pull request with the fixes at https://github.com/galaxyproject/galaxy .
Cheers, Nicola
Il 25.02.2015 15:07 Roberto Alonso CIPF ha scritto:
Hello again :), I have
found the problem, the code that merge the files is this:
galaxy/datatypes/tabular.py:484: cmd = 'egrep -v "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file )
This concatenates the file
name into the sam file. Just adding "h" it is enough, so it will be like this:
galaxy/datatypes/tabular.py:484: cmd = 'egrep -Hv "^@" %s >>
%s' % ( ' '.join(split_files[1:]), output_file )
Thanks all for your
help, best regards
On 25 February 2015 at 12:31, Roberto Alonso
CIPF wrote:
Ok, I think I understand the line: beginning
merge: bwa mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null
it refers to the original command, so everything is fine with this
line. The other problem still remains
Regards, sorry for the
confusion
On 25 February 2015 at 11:40, Roberto Alonso CIPF
wrote:
Hello again, this is something that I consider
important, when I see the log I see this output:
galaxy.jobs.runners.tasks DEBUG 2015-02-25 11:33:30,989 execution finished - BEGINNING MERGE: BWA MEM /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa /home/ralonso/galaxy-dist/database/files/000/dataset_8.dat > /home/ralonso/galaxy-dist/database/files/000/dataset_94.dat 2> /dev/null
I think the merge should be done with samtools. I don't know how is
this programmed in Galaxy, but I didn't indicate anywhere the path to samtools, is it maybe the problem related with this?
Thanks a lot,
Regards
On 25 February 2015 at 11:13, Roberto Alonso CIPF
wrote:
Hello, I just changed for the CDATA format, but
the problem still remains. When I split by 2, there is no problem, but when I go for 3, it happens the problem commented before. Here it is the link to the sam/bam file:
https://dl.dropboxusercontent.com/u/1669701/ejemplo_split.bam [3]
Best regards
On 24 February 2015 at 17:49, Peter Cock
wrote:
On Tue, Feb 24, 2015 at 4:43 PM, Roberto Alonso CIPF
wrote:
Hello again,
first of all thanks for your
help, it is being very useful.
What I have done up to
now is to copy this method to the class Sequence
def
get_split_commands_sequential(is_compressed, input_name, output_name,
start_sequence, sequence_count): ...
return [cmd]
get_split_commands_sequential =
staticmethod(get_split_commands_sequential)
This is
something that you suggested.
Good.
When I
run the tool with this configuration:
map with
bwa
> split_mode="number_of_parts">
bwa
mem /home/ralonso/BiB/Galaxy/data/Cclementina_v1.0_scaffolds.fa
$input > $output 2>/dev/null
bwa
One minor improvement would be to escape the ">" as ">" in
your XML, or use the CDATA approach documented here:
https://wiki.galaxyproject.org/Tools/BestPractices [2]
Everything ends ok, but when I go to check how is the sam, I see that in the
alingments it is the path of the file, i.e
example_split.sam:
/home/ralonso/galaxy-dist/database/job_working_directory/000/90/task_2/dataset_91.dat:SRR098409.1113446
4 * 0 0 * * 0 0
TCTGGGTGAGGGAGTGGGGAGTGGGTTTTTGAGGGTGTGTGAGGATGTGTAAGTGGATGGAAGTAGATTGAATGTT
############################################################################
AS:i:0 XS:i:0
you know what may be going on?
If i don't split the file, everything goes correctly.
This
sounds to me like there may be a problem with SAM merging?
Could
you share the entire example_split.sam file (e.g. as a gist
on
GitHub, or via dropbox)?
Peter
--
Roberto Alonso
Functional Genomics Unit Bioinformatics and
Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto
Oceanografico)
46012 Valencia, Spain Tel: +34 963289680 Ext.
1021
Fax: +34 963289574 E-Mail: ralonso@cipf.es [5]
On Wed, Feb 25, 2015 at 2:07 PM, Roberto Alonso CIPF ralonso@cipf.es wrote:
Hello again :),
I have found the problem, the code that merge the files is this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -v "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file ) This concatenates the file name into the sam file. Just adding "h" it is enough, so it will be like this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -hv "^@" %s >> %s' % ( ' '.join(split_files[1:]), output_file )
Thanks all for your help, best regards
Well done :)
It looks like the SAM merge needs fixing then,
$ man egrep ... -h, --no-filename Suppress the prefixing of file names on output. This is the default when there is only one file (or only standard input) to search.
I filed a pull request adding the -h option to egrep, crediting you: https://github.com/galaxyproject/galaxy/pull/4
Peter
Perfect, Galaxy will also need to add the function that was deleted by merge, in *galaxy/datatypes/sequence.py:206*:
def get_split_commands_sequential(is_compressed, input_name, output_name, start_sequence, sequence_count): """ Does a brain-dead sequential scan & extract of certain sequences >>> Sequence.get_split_commands_sequential(True, './input.gz', './output.gz', start_sequence=0, sequence_count=10) ['zcat "./input.gz" | ( tail -n +1 2> /dev/null) | head -40 | gzip -c > "./output.gz"'] >>> Sequence.get_split_commands_sequential(False, './input.fastq', './output.fastq', start_sequence=10, sequence_count=10) ['tail -n +41 "./input.fastq" 2> /dev/null | head -40 > "./output.fastq"'] """ start_line = start_sequence * 4 line_count = sequence_count * 4 # TODO: verify that tail can handle 64-bit numbers if is_compressed: cmd = 'zcat "%s" | ( tail -n +%s 2> /dev/null) | head -%s | gzip -c' % (input_name, start_line+1, line_count) else: cmd = 'tail -n +%s "%s" 2> /dev/null | head -%s' % (start_line+1, input_name, line_count) cmd += ' > "%s"' % output_name
return [cmd] get_split_commands_sequential = staticmethod(get_split_commands_sequential)
Best regards
On 25 February 2015 at 15:38, Peter Cock p.j.a.cock@googlemail.com wrote:
On Wed, Feb 25, 2015 at 2:07 PM, Roberto Alonso CIPF ralonso@cipf.es wrote:
Hello again :),
I have found the problem, the code that merge the files is this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -v "^@" %s >>
%s' %
( ' '.join(split_files[1:]), output_file ) This concatenates the file name into the sam file. Just adding "h" it is enough, so it will be like this: galaxy/datatypes/tabular.py:484: cmd = 'egrep -hv "^@" %s >>
%s'
% ( ' '.join(split_files[1:]), output_file )
Thanks all for your help, best regards
Well done :)
It looks like the SAM merge needs fixing then,
$ man egrep ... -h, --no-filename Suppress the prefixing of file names on output. This is the default when there is only one file (or only standard input) to search.
I filed a pull request adding the -h option to egrep, crediting you: https://github.com/galaxyproject/galaxy/pull/4
Peter
On Wed, Feb 25, 2015 at 3:34 PM, Roberto Alonso CIPF ralonso@cipf.es wrote:
Perfect, Galaxy will also need to add the function that was deleted by merge, in galaxy/datatypes/sequence.py
Yes - if you want to do a pull request with that, please go ahead. Otherwise I hope to do it later this week...
Your egrep fix has been applied to the main repository now: https://github.com/galaxyproject/galaxy/pull/4
Peter
perfect, I will do the pull request.
Thanks!!
On 25 February 2015 at 16:38, Peter Cock p.j.a.cock@googlemail.com wrote:
On Wed, Feb 25, 2015 at 3:34 PM, Roberto Alonso CIPF ralonso@cipf.es wrote:
Perfect, Galaxy will also need to add the function that was deleted by merge, in galaxy/datatypes/sequence.py
Yes - if you want to do a pull request with that, please go ahead. Otherwise I hope to do it later this week...
Your egrep fix has been applied to the main repository now: https://github.com/galaxyproject/galaxy/pull/4
Peter
galaxy-dev@lists.galaxyproject.org