samtools sam-to-bam problem
I've performed an alignment using BWA on a file of paired-end illumina reads. The SAM file looks fine, and contains header information. I'm converting it to BAM using the sam to bam converter, however it consistently errors out after running for a while. The error is: "Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), " but no error is provided. Looking at the sam_to_bam.py on line 156 is where the error is thrown. Nothing is in e (I think). BTW - If I run the samtools command from the shell by hand, the BAM file is created properly. I do see information on stderr: $ samtools view -bt /data/genomes/H_sapiens/hg19/hg19.fa.fai -o /tmp/killme.bam /home/galaxy/galaxy-dist/database/files/000/dataset_785.dat [samopen] SAM header is present: 25 sequences. I'm using samtools version 0.1.14 (r933:170) on Linux, 64-bit. What do I do? Ryan
Any ideas why I would get this? If I run the sam_to_bam python script from the shell, I get the same error: (galaxy_env)[galaxy@vail pbs]$ sh 471.sh Linux vail 2.6.18-194.3.1.el5xen #1 SMP Sun May 2 04:26:43 EDT 2010 x8 6_64 x86_64 x86_64 GNU/Linux Samtools Version: 0.1.14 (r933:170) Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), However running the samtools command works fine On 4/5/11 5:58 PM, Ryan Golhar wrote:
I've performed an alignment using BWA on a file of paired-end illumina reads. The SAM file looks fine, and contains header information. I'm converting it to BAM using the sam to bam converter, however it consistently errors out after running for a while. The error is:
"Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), "
but no error is provided. Looking at the sam_to_bam.py on line 156 is where the error is thrown. Nothing is in e (I think).
BTW - If I run the samtools command from the shell by hand, the BAM file is created properly. I do see information on stderr:
$ samtools view -bt /data/genomes/H_sapiens/hg19/hg19.fa.fai -o /tmp/killme.bam /home/galaxy/galaxy-dist/database/files/000/dataset_785.dat [samopen] SAM header is present: 25 sequences.
I'm using samtools version 0.1.14 (r933:170) on Linux, 64-bit.
What do I do?
Ryan ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
So it looks like I can get small sam files converted to bam files, but not large sam files (~50GB-80GB). I'm still trying to debug this, but not sure what's going on. Has anyone else run into anything like this? On 4/6/11 10:08 AM, Ryan Golhar wrote:
Any ideas why I would get this? If I run the sam_to_bam python script from the shell, I get the same error:
(galaxy_env)[galaxy@vail pbs]$ sh 471.sh Linux vail 2.6.18-194.3.1.el5xen #1 SMP Sun May 2 04:26:43 EDT 2010 x8 6_64 x86_64 x86_64 GNU/Linux Samtools Version: 0.1.14 (r933:170) Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat),
However running the samtools command works fine
On 4/5/11 5:58 PM, Ryan Golhar wrote:
I've performed an alignment using BWA on a file of paired-end illumina reads. The SAM file looks fine, and contains header information. I'm converting it to BAM using the sam to bam converter, however it consistently errors out after running for a while. The error is:
"Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), "
but no error is provided. Looking at the sam_to_bam.py on line 156 is where the error is thrown. Nothing is in e (I think).
BTW - If I run the samtools command from the shell by hand, the BAM file is created properly. I do see information on stderr:
$ samtools view -bt /data/genomes/H_sapiens/hg19/hg19.fa.fai -o /tmp/killme.bam /home/galaxy/galaxy-dist/database/files/000/dataset_785.dat [samopen] SAM header is present: 25 sequences.
I'm using samtools version 0.1.14 (r933:170) on Linux, 64-bit.
What do I do?
Ryan ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Alright, I'm at a loss I can run the sam to bam converter on a small sam file but not a big sam file. The small SAM file is only 65K, the big SAM file is 44G. I have more than 8TB of free space. Running the job script from the shell results in the small conversion succeeding and the big one failing. The return code from samtools in both instances in 0 so I can't for any reason think of why there the script is getting caught in an exception. I even added a write statement to stdout to double-check the return code and stderr message and they are the same in both cases. Why is this failing in one case and not the other? I'm stuck. Help.... Ryan On 4/6/11 4:58 PM, Ryan Golhar wrote:
So it looks like I can get small sam files converted to bam files, but not large sam files (~50GB-80GB). I'm still trying to debug this, but not sure what's going on.
Has anyone else run into anything like this?
On 4/6/11 10:08 AM, Ryan Golhar wrote:
Any ideas why I would get this? If I run the sam_to_bam python script from the shell, I get the same error:
(galaxy_env)[galaxy@vail pbs]$ sh 471.sh Linux vail 2.6.18-194.3.1.el5xen #1 SMP Sun May 2 04:26:43 EDT 2010 x8 6_64 x86_64 x86_64 GNU/Linux Samtools Version: 0.1.14 (r933:170) Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat),
However running the samtools command works fine
On 4/5/11 5:58 PM, Ryan Golhar wrote:
I've performed an alignment using BWA on a file of paired-end illumina reads. The SAM file looks fine, and contains header information. I'm converting it to BAM using the sam to bam converter, however it consistently errors out after running for a while. The error is:
"Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), "
but no error is provided. Looking at the sam_to_bam.py on line 156 is where the error is thrown. Nothing is in e (I think).
BTW - If I run the samtools command from the shell by hand, the BAM file is created properly. I do see information on stderr:
$ samtools view -bt /data/genomes/H_sapiens/hg19/hg19.fa.fai -o /tmp/killme.bam /home/galaxy/galaxy-dist/database/files/000/dataset_785.dat [samopen] SAM header is present: 25 sequences.
I'm using samtools version 0.1.14 (r933:170) on Linux, 64-bit.
What do I do?
Ryan ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Ryan, Since we're shooting in the dark here, best to try and understand what's the exception. Add the following line to the beginning of "sam_to_bam.py": import traceback and add the following line to "sam_to_bam.py" line 156 (before the call to "stop_err"): traceback.print_exc() Hopefully this will print out which exception you're getting, and where is it thrown from. -gordon Ryan Golhar wrote, On 04/06/2011 06:05 PM:
Alright, I'm at a loss
I can run the sam to bam converter on a small sam file but not a big sam file. The small SAM file is only 65K, the big SAM file is 44G. I have more than 8TB of free space.
Running the job script from the shell results in the small conversion succeeding and the big one failing. The return code from samtools in both instances in 0 so I can't for any reason think of why there the script is getting caught in an exception.
I even added a write statement to stdout to double-check the return code and stderr message and they are the same in both cases.
Why is this failing in one case and not the other? I'm stuck. Help....
Ryan
On 4/6/11 4:58 PM, Ryan Golhar wrote:
So it looks like I can get small sam files converted to bam files, but not large sam files (~50GB-80GB). I'm still trying to debug this, but not sure what's going on.
Has anyone else run into anything like this?
On 4/6/11 10:08 AM, Ryan Golhar wrote:
Any ideas why I would get this? If I run the sam_to_bam python script from the shell, I get the same error:
(galaxy_env)[galaxy@vail pbs]$ sh 471.sh Linux vail 2.6.18-194.3.1.el5xen #1 SMP Sun May 2 04:26:43 EDT 2010 x8 6_64 x86_64 x86_64 GNU/Linux Samtools Version: 0.1.14 (r933:170) Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat),
However running the samtools command works fine
On 4/5/11 5:58 PM, Ryan Golhar wrote:
I've performed an alignment using BWA on a file of paired-end illumina reads. The SAM file looks fine, and contains header information. I'm converting it to BAM using the sam to bam converter, however it consistently errors out after running for a while. The error is:
"Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), "
but no error is provided. Looking at the sam_to_bam.py on line 156 is where the error is thrown. Nothing is in e (I think).
BTW - If I run the samtools command from the shell by hand, the BAM file is created properly. I do see information on stderr:
$ samtools view -bt /data/genomes/H_sapiens/hg19/hg19.fa.fai -o /tmp/killme.bam /home/galaxy/galaxy-dist/database/files/000/dataset_785.dat [samopen] SAM header is present: 25 sequences.
I'm using samtools version 0.1.14 (r933:170) on Linux, 64-bit.
What do I do?
Ryan
Here's what I get: (galaxy_env)[galaxy@vail pbs]$ sh ./big.sh Samtools Version: 0.1.14 (r933:170) Traceback (most recent call last): File "/home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py", line 150, in __main__ if len( open( tmp_aligns_file_name ).read() ) == 0: MemoryError Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), (galaxy_env)[galaxy@vail pbs]$ On 4/6/11 7:29 PM, Assaf Gordon wrote:
Ryan,
Since we're shooting in the dark here, best to try and understand what's the exception.
Add the following line to the beginning of "sam_to_bam.py": import traceback
and add the following line to "sam_to_bam.py" line 156 (before the call to "stop_err"): traceback.print_exc()
Hopefully this will print out which exception you're getting, and where is it thrown from.
-gordon
Ryan Golhar wrote, On 04/06/2011 06:05 PM:
Alright, I'm at a loss
I can run the sam to bam converter on a small sam file but not a big sam file. The small SAM file is only 65K, the big SAM file is 44G. I have more than 8TB of free space.
Running the job script from the shell results in the small conversion succeeding and the big one failing. The return code from samtools in both instances in 0 so I can't for any reason think of why there the script is getting caught in an exception.
I even added a write statement to stdout to double-check the return code and stderr message and they are the same in both cases.
Why is this failing in one case and not the other? I'm stuck. Help....
Ryan
On 4/6/11 4:58 PM, Ryan Golhar wrote:
So it looks like I can get small sam files converted to bam files, but not large sam files (~50GB-80GB). I'm still trying to debug this, but not sure what's going on.
Has anyone else run into anything like this?
On 4/6/11 10:08 AM, Ryan Golhar wrote:
Any ideas why I would get this? If I run the sam_to_bam python script from the shell, I get the same error:
(galaxy_env)[galaxy@vail pbs]$ sh 471.sh Linux vail 2.6.18-194.3.1.el5xen #1 SMP Sun May 2 04:26:43 EDT 2010 x8 6_64 x86_64 x86_64 GNU/Linux Samtools Version: 0.1.14 (r933:170) Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat),
However running the samtools command works fine
On 4/5/11 5:58 PM, Ryan Golhar wrote:
I've performed an alignment using BWA on a file of paired-end illumina reads. The SAM file looks fine, and contains header information. I'm converting it to BAM using the sam to bam converter, however it consistently errors out after running for a while. The error is:
"Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), "
but no error is provided. Looking at the sam_to_bam.py on line 156 is where the error is thrown. Nothing is in e (I think).
BTW - If I run the samtools command from the shell by hand, the BAM file is created properly. I do see information on stderr:
$ samtools view -bt /data/genomes/H_sapiens/hg19/hg19.fa.fai -o /tmp/killme.bam /home/galaxy/galaxy-dist/database/files/000/dataset_785.dat [samopen] SAM header is present: 25 sequences.
I'm using samtools version 0.1.14 (r933:170) on Linux, 64-bit.
What do I do?
Ryan
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Just another example why python's misleadingly simple idioms are quite dangerous in production code (couldn't help myself from teasing about python... sorry about that). Seems like line 150 in "sam_to_bam.py" tries to read the entire BAM file into memory just to find out if it's empty or not... As a stop gap solution with minimal changes, change line 150 from: if len( open( tmp_aligns_file_name ).read() ) == 0: to if len( open( tmp_aligns_file_name ).read(10) ) == 0: Which will read up to the first 10 bytes (instead of the entire file). A slightly better (but still wrong) solution is to simply check the file size, with: if os.path.getsize(tmp_aligns_file_name) == 0: But it's still wrong because even an invalid sam file will create a non-empty BAM file (when using "samtools view -bt") - the BAM file will still contain the chromosome names and sizes. Example: ======== $ cat mm9.fa.fai chr1 197195432 6 50 51 chr10 129993255 201139354 50 51 chr11 121843856 333732482 50 51 chr12 121257530 458013223 50 51 chr13 120284312 581695911 50 51 chr13_random 400311 704385924 50 51 chr14 125194864 704794249 50 51 chr15 103494974 832493018 50 51 ... ... $ cat 1.sam Hello World This is not a SAM file $ samtools view -bt mm9.fa.fai -o 1.bam 1.sam [sam_header_read2] 35 sequences loaded. [sam_read1] reference 'This is not a SAM file' is recognized as '*'. [main_samview] truncated file. $ ls -l 1.* -rw-r--r-- 1 gordon hannon 348 Apr 7 00:57 1.bam -rw-r--r-- 1 gordon hannon 35 Apr 7 00:57 1.sam ======== So in short, this whole sam-to-bam wrapper tool is not suitable for large SAM files (if they don't fit entirely in memory), and not for error checking of invalid SAM files. -gordon On 04/07/2011 12:30 AM, Ryan Golhar wrote:
Here's what I get:
(galaxy_env)[galaxy@vail pbs]$ sh ./big.sh Samtools Version: 0.1.14 (r933:170) Traceback (most recent call last): File "/home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py", line 150, in __main__ if len( open( tmp_aligns_file_name ).read() ) == 0: MemoryError Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), (galaxy_env)[galaxy@vail pbs]$
On 4/6/11 7:29 PM, Assaf Gordon wrote:
Ryan,
Since we're shooting in the dark here, best to try and understand what's the exception.
Add the following line to the beginning of "sam_to_bam.py": import traceback
and add the following line to "sam_to_bam.py" line 156 (before the call to "stop_err"): traceback.print_exc()
Hopefully this will print out which exception you're getting, and where is it thrown from.
-gordon
<shameless plug> If your sam file already contains header lines, you can use our version of the sam-to-bam wrapper. It works without python and without writing a temporary (non-sorted) bam file to disk. Not so fast, and with minimal error checking - but it mostly works. http://cancan.cshl.edu/labmembers/gordon/files/cshl_sam_to_bam.tar.bz2 </shameless plug> On 04/07/2011 01:05 AM, Assaf Gordon wrote:
Just another example why python's misleadingly simple idioms are quite dangerous in production code (couldn't help myself from teasing about python... sorry about that).
Seems like line 150 in "sam_to_bam.py" tries to read the entire BAM file into memory just to find out if it's empty or not...
As a stop gap solution with minimal changes, change line 150 from: if len( open( tmp_aligns_file_name ).read() ) == 0: to if len( open( tmp_aligns_file_name ).read(10) ) == 0:
Which will read up to the first 10 bytes (instead of the entire file).
A slightly better (but still wrong) solution is to simply check the file size, with: if os.path.getsize(tmp_aligns_file_name) == 0:
But it's still wrong because even an invalid sam file will create a non-empty BAM file (when using "samtools view -bt") - the BAM file will still contain the chromosome names and sizes.
Example: ======== $ cat mm9.fa.fai chr1 197195432 6 50 51 chr10 129993255 201139354 50 51 chr11 121843856 333732482 50 51 chr12 121257530 458013223 50 51 chr13 120284312 581695911 50 51 chr13_random 400311 704385924 50 51 chr14 125194864 704794249 50 51 chr15 103494974 832493018 50 51 ... ...
$ cat 1.sam Hello World This is not a SAM file
$ samtools view -bt mm9.fa.fai -o 1.bam 1.sam [sam_header_read2] 35 sequences loaded. [sam_read1] reference 'This is not a SAM file' is recognized as '*'. [main_samview] truncated file.
$ ls -l 1.* -rw-r--r-- 1 gordon hannon 348 Apr 7 00:57 1.bam -rw-r--r-- 1 gordon hannon 35 Apr 7 00:57 1.sam ========
So in short, this whole sam-to-bam wrapper tool is not suitable for large SAM files (if they don't fit entirely in memory), and not for error checking of invalid SAM files.
-gordon
On 04/07/2011 12:30 AM, Ryan Golhar wrote:
Here's what I get:
(galaxy_env)[galaxy@vail pbs]$ sh ./big.sh Samtools Version: 0.1.14 (r933:170) Traceback (most recent call last): File "/home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py", line 150, in __main__ if len( open( tmp_aligns_file_name ).read() ) == 0: MemoryError Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), (galaxy_env)[galaxy@vail pbs]$
On 4/6/11 7:29 PM, Assaf Gordon wrote:
Ryan,
Since we're shooting in the dark here, best to try and understand what's the exception.
Add the following line to the beginning of "sam_to_bam.py": import traceback
and add the following line to "sam_to_bam.py" line 156 (before the call to "stop_err"): traceback.print_exc()
Hopefully this will print out which exception you're getting, and where is it thrown from.
-gordon
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
That opening-the-entire-file inefficiency in our sam_to_bam.py was fixed quite some time ago (4/30/2010, in 3724:007f175c8b88). It has used getsize since then. It's really just intended to catch weird errors that don't throw an actual error (also, I think that if the sam file has no header, no info would be output into the bam file). It seems somewhat more informative to tell the user the file is empty rather than just "successfully" output an empty file. Kelly On Apr 7, 2011, at 1:05 AM, Assaf Gordon wrote:
Just another example why python's misleadingly simple idioms are quite dangerous in production code (couldn't help myself from teasing about python... sorry about that).
Seems like line 150 in "sam_to_bam.py" tries to read the entire BAM file into memory just to find out if it's empty or not...
As a stop gap solution with minimal changes, change line 150 from: if len( open( tmp_aligns_file_name ).read() ) == 0: to if len( open( tmp_aligns_file_name ).read(10) ) == 0:
Which will read up to the first 10 bytes (instead of the entire file).
A slightly better (but still wrong) solution is to simply check the file size, with: if os.path.getsize(tmp_aligns_file_name) == 0:
But it's still wrong because even an invalid sam file will create a non-empty BAM file (when using "samtools view -bt") - the BAM file will still contain the chromosome names and sizes.
Example: ======== $ cat mm9.fa.fai chr1 197195432 6 50 51 chr10 129993255 201139354 50 51 chr11 121843856 333732482 50 51 chr12 121257530 458013223 50 51 chr13 120284312 581695911 50 51 chr13_random 400311 704385924 50 51 chr14 125194864 704794249 50 51 chr15 103494974 832493018 50 51 ... ...
$ cat 1.sam Hello World This is not a SAM file
$ samtools view -bt mm9.fa.fai -o 1.bam 1.sam [sam_header_read2] 35 sequences loaded. [sam_read1] reference 'This is not a SAM file' is recognized as '*'. [main_samview] truncated file.
$ ls -l 1.* -rw-r--r-- 1 gordon hannon 348 Apr 7 00:57 1.bam -rw-r--r-- 1 gordon hannon 35 Apr 7 00:57 1.sam ========
So in short, this whole sam-to-bam wrapper tool is not suitable for large SAM files (if they don't fit entirely in memory), and not for error checking of invalid SAM files.
-gordon
On 04/07/2011 12:30 AM, Ryan Golhar wrote:
Here's what I get:
(galaxy_env)[galaxy@vail pbs]$ sh ./big.sh Samtools Version: 0.1.14 (r933:170) Traceback (most recent call last): File "/home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py", line 150, in __main__ if len( open( tmp_aligns_file_name ).read() ) == 0: MemoryError Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), (galaxy_env)[galaxy@vail pbs]$
On 4/6/11 7:29 PM, Assaf Gordon wrote:
Ryan,
Since we're shooting in the dark here, best to try and understand what's the exception.
Add the following line to the beginning of "sam_to_bam.py": import traceback
and add the following line to "sam_to_bam.py" line 156 (before the call to "stop_err"): traceback.print_exc()
Hopefully this will print out which exception you're getting, and where is it thrown from.
-gordon
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Kelly Vincent wrote, On 04/07/2011 11:35 AM:
That opening-the-entire-file inefficiency in our sam_to_bam.py was fixed quite some time ago (4/30/2010, in 3724:007f175c8b88). It has used getsize since then.
Fixed is one thing, pushed to "dist" is another :) ==== $ hg clone https://bitbucket.org/galaxy/galaxy-dist galaxy_test2 requesting all changes adding changesets adding manifests adding file changes added 5071 changesets with 21893 changes to 4940 files updating to branch default 3256 files updated, 0 files merged, 0 files removed, 0 files unresolved $ cd galaxy_test2/ $ hg tip changeset: 5070:ca0c4ad2bb39 tag: tip user: Greg Von Kuster <greg@bx.psu.edu> date: Wed Feb 16 10:14:24 2011 -0500 summary: When rendering the contents of a data library, make sure a LibraryDataset has an associated LibraryDatasetDatasetAssociation. There should always be an ldda, but some users running their own instances have reported that some of their LibraryDatasets are missing these associations. $ sed -n '148,152p' tools/samtools/sam_to_bam.py if returncode != 0: raise Exception, stderr if len( open( tmp_aligns_file_name ).read() ) == 0: raise Exception, 'Initial BAM file empty' except Exception, e: ==== It only exists in "galaxy-central", not in "galaxy-dist" which is what you recommend people to use. Both me and Ryan have this line of code, and I'm sure we've both pulled since April 2010.
It's really just intended to catch weird errors that don't throw an actual error (also, I think that if the sam file has no header, no info would be output into the bam file).
Nope - look at my example below - an invalid sam file still generates a non-empty BAM file. This happens because of the "-t" parameter: samtools takes the header information from another file and generates the BAM file before even reading the SAM file.
It seems somewhat more informative to tell the user the file is empty rather than just "successfully" output an empty file.
Depends on your definition of "informative" - in this case the error message was not so informative at all, to a point a traceback was needed... -gordon
On Apr 7, 2011, at 1:05 AM, Assaf Gordon wrote:
Just another example why python's misleadingly simple idioms are quite dangerous in production code (couldn't help myself from teasing about python... sorry about that).
Seems like line 150 in "sam_to_bam.py" tries to read the entire BAM file into memory just to find out if it's empty or not...
As a stop gap solution with minimal changes, change line 150 from: if len( open( tmp_aligns_file_name ).read() ) == 0: to if len( open( tmp_aligns_file_name ).read(10) ) == 0:
Which will read up to the first 10 bytes (instead of the entire file).
A slightly better (but still wrong) solution is to simply check the file size, with: if os.path.getsize(tmp_aligns_file_name) == 0:
But it's still wrong because even an invalid sam file will create a non-empty BAM file (when using "samtools view -bt") - the BAM file will still contain the chromosome names and sizes.
Example: ======== $ cat mm9.fa.fai chr1 197195432 6 50 51 chr10 129993255 201139354 50 51 chr11 121843856 333732482 50 51 chr12 121257530 458013223 50 51 chr13 120284312 581695911 50 51 chr13_random 400311 704385924 50 51 chr14 125194864 704794249 50 51 chr15 103494974 832493018 50 51 ... ...
$ cat 1.sam Hello World This is not a SAM file
$ samtools view -bt mm9.fa.fai -o 1.bam 1.sam [sam_header_read2] 35 sequences loaded. [sam_read1] reference 'This is not a SAM file' is recognized as '*'. [main_samview] truncated file.
$ ls -l 1.* -rw-r--r-- 1 gordon hannon 348 Apr 7 00:57 1.bam -rw-r--r-- 1 gordon hannon 35 Apr 7 00:57 1.sam ========
So in short, this whole sam-to-bam wrapper tool is not suitable for large SAM files (if they don't fit entirely in memory), and not for error checking of invalid SAM files.
-gordon
Here's what I get:
(galaxy_env)[galaxy@vail pbs]$ sh ./big.sh Samtools Version: 0.1.14 (r933:170) Traceback (most recent call last): File "/home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py", line 150, in __main__ if len( open( tmp_aligns_file_name ).read() ) == 0: MemoryError Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), (galaxy_env)[galaxy@vail pbs]$
On 4/6/11 7:29 PM, Assaf Gordon wrote:
Ryan,
Since we're shooting in the dark here, best to try and understand what's the exception.
Add the following line to the beginning of "sam_to_bam.py": import traceback
and add the following line to "sam_to_bam.py" line 156 (before the call to "stop_err"): traceback.print_exc()
Hopefully this will print out which exception you're getting, and where is it thrown from.
-gordon
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy
On 04/07/2011 12:30 AM, Ryan Golhar wrote: lists, please use the interface at:
Speaking of which, the Wiki site hasn't been update to reflect that step 1 uses "hg clone https://bitbucket.org/galaxy/galaxy-dist/" and step 5 moves to galaxy-central. Is this supposed to be the case? On 4/7/11 12:13 PM, Assaf Gordon wrote:
Kelly Vincent wrote, On 04/07/2011 11:35 AM:
That opening-the-entire-file inefficiency in our sam_to_bam.py was fixed quite some time ago (4/30/2010, in 3724:007f175c8b88). It has used getsize since then.
Fixed is one thing, pushed to "dist" is another :) ==== $ hg clone https://bitbucket.org/galaxy/galaxy-dist galaxy_test2 requesting all changes adding changesets adding manifests adding file changes added 5071 changesets with 21893 changes to 4940 files updating to branch default 3256 files updated, 0 files merged, 0 files removed, 0 files unresolved $ cd galaxy_test2/ $ hg tip changeset: 5070:ca0c4ad2bb39 tag: tip user: Greg Von Kuster<greg@bx.psu.edu> date: Wed Feb 16 10:14:24 2011 -0500 summary: When rendering the contents of a data library, make sure a LibraryDataset has an associated LibraryDatasetDatasetAssociation. There should always be an ldda, but some users running their own instances have reported that some of their LibraryDatasets are missing these associations. $ sed -n '148,152p' tools/samtools/sam_to_bam.py if returncode != 0: raise Exception, stderr if len( open( tmp_aligns_file_name ).read() ) == 0: raise Exception, 'Initial BAM file empty' except Exception, e: ==== It only exists in "galaxy-central", not in "galaxy-dist" which is what you recommend people to use.
Both me and Ryan have this line of code, and I'm sure we've both pulled since April 2010.
It's really just intended to catch weird errors that don't throw an actual error (also, I think that if the sam file has no header, no info would be output into the bam file).
Nope - look at my example below - an invalid sam file still generates a non-empty BAM file. This happens because of the "-t" parameter: samtools takes the header information from another file and generates the BAM file before even reading the SAM file.
It seems somewhat more informative to tell the user the file is empty rather than just "successfully" output an empty file.
Depends on your definition of "informative" - in this case the error message was not so informative at all, to a point a traceback was needed...
-gordon
On Apr 7, 2011, at 1:05 AM, Assaf Gordon wrote:
Just another example why python's misleadingly simple idioms are quite dangerous in production code (couldn't help myself from teasing about python... sorry about that).
Seems like line 150 in "sam_to_bam.py" tries to read the entire BAM file into memory just to find out if it's empty or not...
As a stop gap solution with minimal changes, change line 150 from: if len( open( tmp_aligns_file_name ).read() ) == 0: to if len( open( tmp_aligns_file_name ).read(10) ) == 0:
Which will read up to the first 10 bytes (instead of the entire file).
A slightly better (but still wrong) solution is to simply check the file size, with: if os.path.getsize(tmp_aligns_file_name) == 0:
But it's still wrong because even an invalid sam file will create a non-empty BAM file (when using "samtools view -bt") - the BAM file will still contain the chromosome names and sizes.
Example: ======== $ cat mm9.fa.fai chr1 197195432 6 50 51 chr10 129993255 201139354 50 51 chr11 121843856 333732482 50 51 chr12 121257530 458013223 50 51 chr13 120284312 581695911 50 51 chr13_random 400311 704385924 50 51 chr14 125194864 704794249 50 51 chr15 103494974 832493018 50 51 ... ...
$ cat 1.sam Hello World This is not a SAM file
$ samtools view -bt mm9.fa.fai -o 1.bam 1.sam [sam_header_read2] 35 sequences loaded. [sam_read1] reference 'This is not a SAM file' is recognized as '*'. [main_samview] truncated file.
$ ls -l 1.* -rw-r--r-- 1 gordon hannon 348 Apr 7 00:57 1.bam -rw-r--r-- 1 gordon hannon 35 Apr 7 00:57 1.sam ========
So in short, this whole sam-to-bam wrapper tool is not suitable for large SAM files (if they don't fit entirely in memory), and not for error checking of invalid SAM files.
-gordon
Here's what I get:
(galaxy_env)[galaxy@vail pbs]$ sh ./big.sh Samtools Version: 0.1.14 (r933:170) Traceback (most recent call last): File "/home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py", line 150, in __main__ if len( open( tmp_aligns_file_name ).read() ) == 0: MemoryError Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), (galaxy_env)[galaxy@vail pbs]$
On 4/6/11 7:29 PM, Assaf Gordon wrote:
Ryan,
Since we're shooting in the dark here, best to try and understand what's the exception.
Add the following line to the beginning of "sam_to_bam.py": import traceback
and add the following line to "sam_to_bam.py" line 156 (before the call to "stop_err"): traceback.print_exc()
Hopefully this will print out which exception you're getting, and where is it thrown from.
-gordon
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy
On 04/07/2011 12:30 AM, Ryan Golhar wrote: lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error, please notify the sender immediately by email and permanently delete all copies of this email including all attachments without reading them. If you are the intended recipient, secure the contents in a manner that conforms to all applicable state and/or federal requirements related to privacy and confidentiality of such information.
Ryan Golhar wrote:
Speaking of which, the Wiki site hasn't been update to reflect that step 1 uses "hg clone https://bitbucket.org/galaxy/galaxy-dist/"
and step 5 moves to galaxy-central. Is this supposed to be the case?
That was just a bit of incorrect output from 'hg incoming', but the instructions were still correct. I've fixed the output, thanks for the heads up.
On 4/7/11 12:13 PM, Assaf Gordon wrote:
Kelly Vincent wrote, On 04/07/2011 11:35 AM:
That opening-the-entire-file inefficiency in our sam_to_bam.py was fixed quite some time ago (4/30/2010, in 3724:007f175c8b88). It has used getsize since then.
Fixed is one thing, pushed to "dist" is another :) ==== $ hg clone https://bitbucket.org/galaxy/galaxy-dist galaxy_test2 requesting all changes adding changesets adding manifests adding file changes added 5071 changesets with 21893 changes to 4940 files updating to branch default 3256 files updated, 0 files merged, 0 files removed, 0 files unresolved $ cd galaxy_test2/ $ hg tip changeset: 5070:ca0c4ad2bb39 tag: tip user: Greg Von Kuster<greg@bx.psu.edu> date: Wed Feb 16 10:14:24 2011 -0500 summary: When rendering the contents of a data library, make sure a LibraryDataset has an associated LibraryDatasetDatasetAssociation. There should always be an ldda, but some users running their own instances have reported that some of their LibraryDatasets are missing these associations. $ sed -n '148,152p' tools/samtools/sam_to_bam.py if returncode != 0: raise Exception, stderr if len( open( tmp_aligns_file_name ).read() ) == 0: raise Exception, 'Initial BAM file empty' except Exception, e: ==== It only exists in "galaxy-central", not in "galaxy-dist" which is what you recommend people to use.
Both me and Ryan have this line of code, and I'm sure we've both pulled since April 2010.
It's really just intended to catch weird errors that don't throw an actual error (also, I think that if the sam file has no header, no info would be output into the bam file).
Nope - look at my example below - an invalid sam file still generates a non-empty BAM file. This happens because of the "-t" parameter: samtools takes the header information from another file and generates the BAM file before even reading the SAM file.
It seems somewhat more informative to tell the user the file is empty rather than just "successfully" output an empty file.
Depends on your definition of "informative" - in this case the error message was not so informative at all, to a point a traceback was needed...
-gordon
On Apr 7, 2011, at 1:05 AM, Assaf Gordon wrote:
Just another example why python's misleadingly simple idioms are quite dangerous in production code (couldn't help myself from teasing about python... sorry about that).
Seems like line 150 in "sam_to_bam.py" tries to read the entire BAM file into memory just to find out if it's empty or not...
As a stop gap solution with minimal changes, change line 150 from: if len( open( tmp_aligns_file_name ).read() ) == 0: to if len( open( tmp_aligns_file_name ).read(10) ) == 0:
Which will read up to the first 10 bytes (instead of the entire file).
A slightly better (but still wrong) solution is to simply check the file size, with: if os.path.getsize(tmp_aligns_file_name) == 0:
But it's still wrong because even an invalid sam file will create a non-empty BAM file (when using "samtools view -bt") - the BAM file will still contain the chromosome names and sizes.
Example: ======== $ cat mm9.fa.fai chr1 197195432 6 50 51 chr10 129993255 201139354 50 51 chr11 121843856 333732482 50 51 chr12 121257530 458013223 50 51 chr13 120284312 581695911 50 51 chr13_random 400311 704385924 50 51 chr14 125194864 704794249 50 51 chr15 103494974 832493018 50 51 ... ...
$ cat 1.sam Hello World This is not a SAM file
$ samtools view -bt mm9.fa.fai -o 1.bam 1.sam [sam_header_read2] 35 sequences loaded. [sam_read1] reference 'This is not a SAM file' is recognized as '*'. [main_samview] truncated file.
$ ls -l 1.* -rw-r--r-- 1 gordon hannon 348 Apr 7 00:57 1.bam -rw-r--r-- 1 gordon hannon 35 Apr 7 00:57 1.sam ========
So in short, this whole sam-to-bam wrapper tool is not suitable for large SAM files (if they don't fit entirely in memory), and not for error checking of invalid SAM files.
-gordon
Here's what I get:
(galaxy_env)[galaxy@vail pbs]$ sh ./big.sh Samtools Version: 0.1.14 (r933:170) Traceback (most recent call last): File "/home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py", line 150, in __main__ if len( open( tmp_aligns_file_name ).read() ) == 0: MemoryError Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), (galaxy_env)[galaxy@vail pbs]$
On 4/6/11 7:29 PM, Assaf Gordon wrote:
Ryan,
Since we're shooting in the dark here, best to try and understand what's the exception.
Add the following line to the beginning of "sam_to_bam.py": import traceback
and add the following line to "sam_to_bam.py" line 156 (before the call to "stop_err"): traceback.print_exc()
Hopefully this will print out which exception you're getting, and where is it thrown from.
-gordon
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy
On 04/07/2011 12:30 AM, Ryan Golhar wrote: lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error, please notify the sender immediately by email and permanently delete all copies of this email including all attachments without reading them. If you are the intended recipient, secure the contents in a manner that conforms to all applicable state and/or federal requirements related to privacy and confidentiality of such information.
begin:vcard fn:Ryan Golhar, Ph.D. n:Golhar;Ryan org:The Cancer Institute of NJ;Cancer Informatics Core/Bioinformatics adr:5th floor;;120 Albany St;New Brunswick;NJ;08901;USA email;internet:golharam@umdnj.edu title:NGS Bioinformatics Specialist tel;work:(732) 235-6613 tel;fax:(732) 235-6267 tel;cell:(732) 236-1176 x-mozilla-html:FALSE url:http://www.cinj.org version:2.1 end:vcard
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Yeah, I installed back on Feb 18th, and just did a "hg incoming". hg reports nothing has changed. That doesn't sound correct. On 4/7/11 12:13 PM, Assaf Gordon wrote:
Kelly Vincent wrote, On 04/07/2011 11:35 AM:
That opening-the-entire-file inefficiency in our sam_to_bam.py was fixed quite some time ago (4/30/2010, in 3724:007f175c8b88). It has used getsize since then.
Fixed is one thing, pushed to "dist" is another :) ==== $ hg clone https://bitbucket.org/galaxy/galaxy-dist galaxy_test2 requesting all changes adding changesets adding manifests adding file changes added 5071 changesets with 21893 changes to 4940 files updating to branch default 3256 files updated, 0 files merged, 0 files removed, 0 files unresolved $ cd galaxy_test2/ $ hg tip changeset: 5070:ca0c4ad2bb39 tag: tip user: Greg Von Kuster<greg@bx.psu.edu> date: Wed Feb 16 10:14:24 2011 -0500 summary: When rendering the contents of a data library, make sure a LibraryDataset has an associated LibraryDatasetDatasetAssociation. There should always be an ldda, but some users running their own instances have reported that some of their LibraryDatasets are missing these associations. $ sed -n '148,152p' tools/samtools/sam_to_bam.py if returncode != 0: raise Exception, stderr if len( open( tmp_aligns_file_name ).read() ) == 0: raise Exception, 'Initial BAM file empty' except Exception, e: ==== It only exists in "galaxy-central", not in "galaxy-dist" which is what you recommend people to use.
Both me and Ryan have this line of code, and I'm sure we've both pulled since April 2010.
It's really just intended to catch weird errors that don't throw an actual error (also, I think that if the sam file has no header, no info would be output into the bam file).
Nope - look at my example below - an invalid sam file still generates a non-empty BAM file. This happens because of the "-t" parameter: samtools takes the header information from another file and generates the BAM file before even reading the SAM file.
It seems somewhat more informative to tell the user the file is empty rather than just "successfully" output an empty file.
Depends on your definition of "informative" - in this case the error message was not so informative at all, to a point a traceback was needed...
-gordon
On Apr 7, 2011, at 1:05 AM, Assaf Gordon wrote:
Just another example why python's misleadingly simple idioms are quite dangerous in production code (couldn't help myself from teasing about python... sorry about that).
Seems like line 150 in "sam_to_bam.py" tries to read the entire BAM file into memory just to find out if it's empty or not...
As a stop gap solution with minimal changes, change line 150 from: if len( open( tmp_aligns_file_name ).read() ) == 0: to if len( open( tmp_aligns_file_name ).read(10) ) == 0:
Which will read up to the first 10 bytes (instead of the entire file).
A slightly better (but still wrong) solution is to simply check the file size, with: if os.path.getsize(tmp_aligns_file_name) == 0:
But it's still wrong because even an invalid sam file will create a non-empty BAM file (when using "samtools view -bt") - the BAM file will still contain the chromosome names and sizes.
Example: ======== $ cat mm9.fa.fai chr1 197195432 6 50 51 chr10 129993255 201139354 50 51 chr11 121843856 333732482 50 51 chr12 121257530 458013223 50 51 chr13 120284312 581695911 50 51 chr13_random 400311 704385924 50 51 chr14 125194864 704794249 50 51 chr15 103494974 832493018 50 51 ... ...
$ cat 1.sam Hello World This is not a SAM file
$ samtools view -bt mm9.fa.fai -o 1.bam 1.sam [sam_header_read2] 35 sequences loaded. [sam_read1] reference 'This is not a SAM file' is recognized as '*'. [main_samview] truncated file.
$ ls -l 1.* -rw-r--r-- 1 gordon hannon 348 Apr 7 00:57 1.bam -rw-r--r-- 1 gordon hannon 35 Apr 7 00:57 1.sam ========
So in short, this whole sam-to-bam wrapper tool is not suitable for large SAM files (if they don't fit entirely in memory), and not for error checking of invalid SAM files.
-gordon
Here's what I get:
(galaxy_env)[galaxy@vail pbs]$ sh ./big.sh Samtools Version: 0.1.14 (r933:170) Traceback (most recent call last): File "/home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py", line 150, in __main__ if len( open( tmp_aligns_file_name ).read() ) == 0: MemoryError Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), (galaxy_env)[galaxy@vail pbs]$
On 4/6/11 7:29 PM, Assaf Gordon wrote:
Ryan,
Since we're shooting in the dark here, best to try and understand what's the exception.
Add the following line to the beginning of "sam_to_bam.py": import traceback
and add the following line to "sam_to_bam.py" line 156 (before the call to "stop_err"): traceback.print_exc()
Hopefully this will print out which exception you're getting, and where is it thrown from.
-gordon
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy
On 04/07/2011 12:30 AM, Ryan Golhar wrote: lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error, please notify the sender immediately by email and permanently delete all copies of this email including all attachments without reading them. If you are the intended recipient, secure the contents in a manner that conforms to all applicable state and/or federal requirements related to privacy and confidentiality of such information.
Ryan Golhar wrote:
Yeah, I installed back on Feb 18th, and just did a "hg incoming". hg reports nothing has changed. That doesn't sound correct.
Actually, February 18 was the date of our last stable release. The next one is scheduled for tomorrow. --nate
On 4/7/11 12:13 PM, Assaf Gordon wrote:
Kelly Vincent wrote, On 04/07/2011 11:35 AM:
That opening-the-entire-file inefficiency in our sam_to_bam.py was fixed quite some time ago (4/30/2010, in 3724:007f175c8b88). It has used getsize since then.
Fixed is one thing, pushed to "dist" is another :) ==== $ hg clone https://bitbucket.org/galaxy/galaxy-dist galaxy_test2 requesting all changes adding changesets adding manifests adding file changes added 5071 changesets with 21893 changes to 4940 files updating to branch default 3256 files updated, 0 files merged, 0 files removed, 0 files unresolved $ cd galaxy_test2/ $ hg tip changeset: 5070:ca0c4ad2bb39 tag: tip user: Greg Von Kuster<greg@bx.psu.edu> date: Wed Feb 16 10:14:24 2011 -0500 summary: When rendering the contents of a data library, make sure a LibraryDataset has an associated LibraryDatasetDatasetAssociation. There should always be an ldda, but some users running their own instances have reported that some of their LibraryDatasets are missing these associations. $ sed -n '148,152p' tools/samtools/sam_to_bam.py if returncode != 0: raise Exception, stderr if len( open( tmp_aligns_file_name ).read() ) == 0: raise Exception, 'Initial BAM file empty' except Exception, e: ==== It only exists in "galaxy-central", not in "galaxy-dist" which is what you recommend people to use.
Both me and Ryan have this line of code, and I'm sure we've both pulled since April 2010.
It's really just intended to catch weird errors that don't throw an actual error (also, I think that if the sam file has no header, no info would be output into the bam file).
Nope - look at my example below - an invalid sam file still generates a non-empty BAM file. This happens because of the "-t" parameter: samtools takes the header information from another file and generates the BAM file before even reading the SAM file.
It seems somewhat more informative to tell the user the file is empty rather than just "successfully" output an empty file.
Depends on your definition of "informative" - in this case the error message was not so informative at all, to a point a traceback was needed...
-gordon
On Apr 7, 2011, at 1:05 AM, Assaf Gordon wrote:
Just another example why python's misleadingly simple idioms are quite dangerous in production code (couldn't help myself from teasing about python... sorry about that).
Seems like line 150 in "sam_to_bam.py" tries to read the entire BAM file into memory just to find out if it's empty or not...
As a stop gap solution with minimal changes, change line 150 from: if len( open( tmp_aligns_file_name ).read() ) == 0: to if len( open( tmp_aligns_file_name ).read(10) ) == 0:
Which will read up to the first 10 bytes (instead of the entire file).
A slightly better (but still wrong) solution is to simply check the file size, with: if os.path.getsize(tmp_aligns_file_name) == 0:
But it's still wrong because even an invalid sam file will create a non-empty BAM file (when using "samtools view -bt") - the BAM file will still contain the chromosome names and sizes.
Example: ======== $ cat mm9.fa.fai chr1 197195432 6 50 51 chr10 129993255 201139354 50 51 chr11 121843856 333732482 50 51 chr12 121257530 458013223 50 51 chr13 120284312 581695911 50 51 chr13_random 400311 704385924 50 51 chr14 125194864 704794249 50 51 chr15 103494974 832493018 50 51 ... ...
$ cat 1.sam Hello World This is not a SAM file
$ samtools view -bt mm9.fa.fai -o 1.bam 1.sam [sam_header_read2] 35 sequences loaded. [sam_read1] reference 'This is not a SAM file' is recognized as '*'. [main_samview] truncated file.
$ ls -l 1.* -rw-r--r-- 1 gordon hannon 348 Apr 7 00:57 1.bam -rw-r--r-- 1 gordon hannon 35 Apr 7 00:57 1.sam ========
So in short, this whole sam-to-bam wrapper tool is not suitable for large SAM files (if they don't fit entirely in memory), and not for error checking of invalid SAM files.
-gordon
Here's what I get:
(galaxy_env)[galaxy@vail pbs]$ sh ./big.sh Samtools Version: 0.1.14 (r933:170) Traceback (most recent call last): File "/home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py", line 150, in __main__ if len( open( tmp_aligns_file_name ).read() ) == 0: MemoryError Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), (galaxy_env)[galaxy@vail pbs]$
On 4/6/11 7:29 PM, Assaf Gordon wrote:
Ryan,
Since we're shooting in the dark here, best to try and understand what's the exception.
Add the following line to the beginning of "sam_to_bam.py": import traceback
and add the following line to "sam_to_bam.py" line 156 (before the call to "stop_err"): traceback.print_exc()
Hopefully this will print out which exception you're getting, and where is it thrown from.
-gordon
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy
On 04/07/2011 12:30 AM, Ryan Golhar wrote: lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error, please notify the sender immediately by email and permanently delete all copies of this email including all attachments without reading them. If you are the intended recipient, secure the contents in a manner that conforms to all applicable state and/or federal requirements related to privacy and confidentiality of such information.
begin:vcard fn:Ryan Golhar, Ph.D. n:Golhar;Ryan org:The Cancer Institute of NJ;Cancer Informatics Core/Bioinformatics adr:5th floor;;120 Albany St;New Brunswick;NJ;08901;USA email;internet:golharam@umdnj.edu title:NGS Bioinformatics Specialist tel;work:(732) 235-6613 tel;fax:(732) 235-6267 tel;cell:(732) 236-1176 x-mozilla-html:FALSE url:http://www.cinj.org version:2.1 end:vcard
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
I really need to update my local copy. I must admit, I'm a bit afraid to do so. On 4/7/11 11:35 AM, Kelly Vincent wrote:
That opening-the-entire-file inefficiency in our sam_to_bam.py was fixed quite some time ago (4/30/2010, in 3724:007f175c8b88). It has used getsize since then. It's really just intended to catch weird errors that don't throw an actual error (also, I think that if the sam file has no header, no info would be output into the bam file). It seems somewhat more informative to tell the user the file is empty rather than just "successfully" output an empty file.
Kelly
On Apr 7, 2011, at 1:05 AM, Assaf Gordon wrote:
Just another example why python's misleadingly simple idioms are quite dangerous in production code (couldn't help myself from teasing about python... sorry about that).
Seems like line 150 in "sam_to_bam.py" tries to read the entire BAM file into memory just to find out if it's empty or not...
As a stop gap solution with minimal changes, change line 150 from: if len( open( tmp_aligns_file_name ).read() ) == 0: to if len( open( tmp_aligns_file_name ).read(10) ) == 0:
Which will read up to the first 10 bytes (instead of the entire file).
A slightly better (but still wrong) solution is to simply check the file size, with: if os.path.getsize(tmp_aligns_file_name) == 0:
But it's still wrong because even an invalid sam file will create a non-empty BAM file (when using "samtools view -bt") - the BAM file will still contain the chromosome names and sizes.
Example: ======== $ cat mm9.fa.fai chr1 197195432 6 50 51 chr10 129993255 201139354 50 51 chr11 121843856 333732482 50 51 chr12 121257530 458013223 50 51 chr13 120284312 581695911 50 51 chr13_random 400311 704385924 50 51 chr14 125194864 704794249 50 51 chr15 103494974 832493018 50 51 ... ...
$ cat 1.sam Hello World This is not a SAM file
$ samtools view -bt mm9.fa.fai -o 1.bam 1.sam [sam_header_read2] 35 sequences loaded. [sam_read1] reference 'This is not a SAM file' is recognized as '*'. [main_samview] truncated file.
$ ls -l 1.* -rw-r--r-- 1 gordon hannon 348 Apr 7 00:57 1.bam -rw-r--r-- 1 gordon hannon 35 Apr 7 00:57 1.sam ========
So in short, this whole sam-to-bam wrapper tool is not suitable for large SAM files (if they don't fit entirely in memory), and not for error checking of invalid SAM files.
-gordon
On 04/07/2011 12:30 AM, Ryan Golhar wrote:
Here's what I get:
(galaxy_env)[galaxy@vail pbs]$ sh ./big.sh Samtools Version: 0.1.14 (r933:170) Traceback (most recent call last): File "/home/galaxy/galaxy-dist/tools/samtools/sam_to_bam.py", line 150, in __main__ if len( open( tmp_aligns_file_name ).read() ) == 0: MemoryError Error extracting alignments from (/home/galaxy/galaxy-dist/database/files/000/dataset_785.dat), (galaxy_env)[galaxy@vail pbs]$
On 4/6/11 7:29 PM, Assaf Gordon wrote:
Ryan,
Since we're shooting in the dark here, best to try and understand what's the exception.
Add the following line to the beginning of "sam_to_bam.py": import traceback
and add the following line to "sam_to_bam.py" line 156 (before the call to "stop_err"): traceback.print_exc()
Hopefully this will print out which exception you're getting, and where is it thrown from.
-gordon
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error, please notify the sender immediately by email and permanently delete all copies of this email including all attachments without reading them. If you are the intended recipient, secure the contents in a manner that conforms to all applicable state and/or federal requirements related to privacy and confidentiality of such information.
participants (4)
-
Assaf Gordon
-
Kelly Vincent
-
Nate Coraor
-
Ryan Golhar