Error importing BAM file into Library
Hi all, I think I have fallen over the same problem Glen Beane reported in Nov 2010, http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-November/003943.html I recall from reading the mailing list, that when you import a BAM file into Galaxy, it gets sorted and indexed. That makes sense since most tools need that to be done, and resorting an already sorted file should be quick. However, I'm trying to import some presorted and indexed BAM files into a library in Galaxy via the Admin settings, linking to the file not copying it. I'm getting this: <quote> File size: 4.4 Gb Data type: auto Build: ? Miscellaneous information: uploaded bam fileTraceback (most recent call last): File "/opt/galaxy-dist/tools/data_source/upload.py", line 447, in __main__() File "/opt/galaxy-dist/tools/data_source/upload.py", line 439, in __main__ add_file( dataset, registry, j Job Standard Output [bam_sort_core] merging from 28 files... Job Standard Error Traceback (most recent call last): File "/opt/galaxy-dist/tools/data_source/upload.py", line 447, in __main__() File "/opt/galaxy-dist/tools/data_source/upload.py", line 439, in __main__ add_file( dataset, registry, json_file, output_path ) File "/opt/galaxy-dist/tools/data_source/upload.py", line 381, in add_file datatype.groom_dataset_content( output_path ) File "/opt/galaxy-dist/lib/galaxy/datatypes/binary.py", line 98, in groom_dataset_content shutil.move( samtools_created_sorted_file_name, file_name ) File "/usr/local/lib/python2.6/shutil.py", line 260, in move copy2(src, real_dst) File "/usr/local/lib/python2.6/shutil.py", line 95, in copy2 copyfile(src, dst) File "/usr/local/lib/python2.6/shutil.py", line 51, in copyfile with open(dst, 'wb') as fdst: IOError: [Errno 13] Permission denied: '/data/XXX-bwa-out.sorted.bam' error Database/Build: ? Number of data lines: None Disk file: /data/XXX-bwa-out.sorted.bam </quote> Clearly from the error message, Galaxy is trying to edit the source file (and the Unix account it is running in only has read permission for this file and its containing folder). From the stdout message "[bam_sort_core] merging from 28 files..." it looks like Galaxy is trying to (re)sort my file, and may well attempt to reindex it. Is that likely to be the case? If the "copy" option had been used, then sorting and indexing should work - but I want Galaxy to link to the file as it it. If however "copy" was not selected, then I don't want Galaxy trying to alter the file like this. Could the sort+index be disabled in this mode? I think it is reasonable to expect administrators trying to import BAM files from the local file system to take care of this. Alternatively, you could actually check if the BAM file is pre-sorted or not. Peter
Hello Peter, Breaking this issue into the following 2 parts, here is the status. 1. Don't alter the contents of files being uploaded to a data library if using the "upload_directory" or "upload_paths" options in conjunction with the "Link to files without copying into Galaxy" option. This issue has been resolved in change set 5221:b5ecb8f4839d. 2. Determine if a BAM file is sorted before it is introduced into the Galaxy environment so that it will only be sorted if necessary. We have a very simple test for this in the Bam class's _is_coordinate_sorted(0 method in ~/lib/galaxy/datatypes/binary.py, but this method obviously needs improvements. The improved implementation is a bit non-trivial, but it is high priority, so should be completed soon. In the meantime, Bam files cannot be uploaded to a data library using the combinations of options described in 1 above if they do not pass the current simple, rigid test in the Bam class's method. Thanks for your message, Greg Von Kuster On Mar 10, 2011, at 1:18 PM, Peter Cock wrote:
Hi all,
I think I have fallen over the same problem Glen Beane reported in Nov 2010, http://lists.bx.psu.edu/pipermail/galaxy-dev/2010-November/003943.html
I recall from reading the mailing list, that when you import a BAM file into Galaxy, it gets sorted and indexed. That makes sense since most tools need that to be done, and resorting an already sorted file should be quick.
However, I'm trying to import some presorted and indexed BAM files into a library in Galaxy via the Admin settings, linking to the file not copying it. I'm getting this:
<quote>
File size: 4.4 Gb Data type: auto Build: ? Miscellaneous information: uploaded bam fileTraceback (most recent call last): File "/opt/galaxy-dist/tools/data_source/upload.py", line 447, in __main__() File "/opt/galaxy-dist/tools/data_source/upload.py", line 439, in __main__ add_file( dataset, registry, j Job Standard Output
[bam_sort_core] merging from 28 files...
Job Standard Error
Traceback (most recent call last): File "/opt/galaxy-dist/tools/data_source/upload.py", line 447, in __main__() File "/opt/galaxy-dist/tools/data_source/upload.py", line 439, in __main__ add_file( dataset, registry, json_file, output_path ) File "/opt/galaxy-dist/tools/data_source/upload.py", line 381, in add_file datatype.groom_dataset_content( output_path ) File "/opt/galaxy-dist/lib/galaxy/datatypes/binary.py", line 98, in groom_dataset_content shutil.move( samtools_created_sorted_file_name, file_name ) File "/usr/local/lib/python2.6/shutil.py", line 260, in move copy2(src, real_dst) File "/usr/local/lib/python2.6/shutil.py", line 95, in copy2 copyfile(src, dst) File "/usr/local/lib/python2.6/shutil.py", line 51, in copyfile with open(dst, 'wb') as fdst: IOError: [Errno 13] Permission denied: '/data/XXX-bwa-out.sorted.bam'
error Database/Build: ? Number of data lines: None Disk file: /data/XXX-bwa-out.sorted.bam
</quote>
Clearly from the error message, Galaxy is trying to edit the source file (and the Unix account it is running in only has read permission for this file and its containing folder). From the stdout message "[bam_sort_core] merging from 28 files..." it looks like Galaxy is trying to (re)sort my file, and may well attempt to reindex it. Is that likely to be the case?
If the "copy" option had been used, then sorting and indexing should work - but I want Galaxy to link to the file as it it.
If however "copy" was not selected, then I don't want Galaxy trying to alter the file like this. Could the sort+index be disabled in this mode? I think it is reasonable to expect administrators trying to import BAM files from the local file system to take care of this.
Alternatively, you could actually check if the BAM file is pre-sorted or not.
Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
On Tue, Mar 15, 2011 at 1:51 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
Hello Peter,
Breaking this issue into the following 2 parts, here is the status.
1. Don't alter the contents of files being uploaded to a data library if using the "upload_directory" or "upload_paths" options in conjunction with the "Link to files without copying into Galaxy" option. This issue has been resolved in change set 5221:b5ecb8f4839d.
2. Determine if a BAM file is sorted before it is introduced into the Galaxy environment so that it will only be sorted if necessary. We have a very simple test for this in the Bam class's _is_coordinate_sorted(0 method in ~/lib/galaxy/datatypes/binary.py, but this method obviously needs improvements. The improved implementation is a bit non-trivial, but it is high priority, so should be completed soon. In the meantime, Bam files cannot be uploaded to a data library using the combinations of options described in 1 above if they do not pass the current simple, rigid test in the Bam class's method.
Thanks for your message,
Greg Von Kuster
Thanks Greg - I'll have to retest with that update. I was thinking about this over the weekend, and perhaps you could assume (for the special case of a library import where the file is being linked to) that if the BAI index file already exists then the BAM file should be sorted already. i.e. Use both the BAM and BAI files as provided. Peter
Peter and Greg;
2. Determine if a BAM file is sorted before it is introduced into the Galaxy environment so that it will only be sorted if necessary. We have a very simple test for this in the Bam class's _is_coordinate_sorted(0 method in ~/lib/galaxy/datatypes/binary.py, but this method obviously needs improvements. The improved implementation is a bit non-trivial, but it is high priority, so should be completed soon. In the meantime, Bam files cannot be uploaded to a data library using the combinations of options described in 1 above if they do not pass the current simple, rigid test in the Bam class's method.
I was thinking about this over the weekend, and perhaps you could assume (for the special case of a library import where the file is being linked to) that if the BAI index file already exists then the BAM file should be sorted already. i.e. Use both the BAM and BAI files as provided.
I added in that initial sorted test and agree that it is imperfect. Several tools sort the files but do not set the SO: header since it's not required by the spec. We recently had a discussion about this: http://biostar.stackexchange.com/questions/5273/is-my-bam-file-sorted I believe the new 0.1.13 samtools has the fixes Heng mentioned in the comments thread so a good process to check for sorting is to do 'samtools index your.bam' and check the error code. It will complain for non-sorted files. The only disadvantage is that you need a new samtools for it to work on 100% of cases but that seems like a good choice moving forward. Brad
Greg wrote:
Breaking this issue into the following 2 parts, here is the status.
1. Don't alter the contents of files being uploaded to a data library if using the "upload_directory" or "upload_paths" options in conjunction with the "Link to files without copying into Galaxy" option. This issue has been resolved in change set 5221:b5ecb8f4839d.
I've just updated my test Galaxy instance to get the 5221:b5ecb8f4839d fix, and I now get a different behaviour - still an error state. Data type: auto Build: ? Miscellaneous information: The uploaded files need grooming, so change your Copy data into Galaxy? selection to be Copy files into Galaxy instead of Link to files without copying into Galaxy so grooming can be performed. error Presumably Galaxy uses 'Grooming' in several settings (e.g. FASTQ) to mean 'data sanitising', and what that message is trying to tell me is Galaxy doesn't think my BAM file is sorted (and therefore needs 'grooming'). Right? On Tue, Mar 15, 2011 at 2:39 PM, Brad Chapman <chapmanb@50mail.com> wrote:
Peter and Greg;
2. Determine if a BAM file is sorted before it is introduced into the Galaxy environment so that it will only be sorted if necessary. We have a very simple test for this in the Bam class's _is_coordinate_sorted(0 method in ~/lib/galaxy/datatypes/binary.py, but this method obviously needs improvements. The improved implementation is a bit non-trivial, but it is high priority, so should be completed soon. In the meantime, Bam files cannot be uploaded to a data library using the combinations of options described in 1 above if they do not pass the current simple, rigid test in the Bam class's method.
I was thinking about this over the weekend, and perhaps you could assume (for the special case of a library import where the file is being linked to) that if the BAI index file already exists then the BAM file should be sorted already. i.e. Use both the BAM and BAI files as provided.
I added in that initial sorted test and agree that it is imperfect. Several tools sort the files but do not set the SO: header since it's not required by the spec.
Having checked my BAM files with samtools, I can confirm they don't have the SO header. samtools view -H myfile.bam | grep "SO:" They were generated with BWA in a split+merge pipeline to use multiple cores. I support I could run samtools reheader on them... but it would be nice to avoid that.
We recently had a discussion about this:
http://biostar.stackexchange.com/questions/5273/is-my-bam-file-sorted
I believe the new 0.1.13 samtools has the fixes Heng mentioned in the comments thread so a good process to check for sorting is to do 'samtools index your.bam' and check the error code. It will complain for non-sorted files.
Did you see Pierre's little C tool using the samtools API to do this? http://plindenbaum.blogspot.com/2011/02/testing-if-bam-file-is-sorted-using....
The only disadvantage is that you need a new samtools for it to work on 100% of cases but that seems like a good choice moving forward.
Yes, since Galaxy will typically do sort the index anyway, it makes sense to try and do the indexing immediately, and thus find out if a sort is required or not. Meanwhile, the following trivial patch resolves my problem with getting pre-existing BAM files loaded into Galaxy: https://bitbucket.org/peterjc/galaxy-central/changeset/7f17701740b2 As a follow up, Galaxy doesn't need to re-index the file if there is already a BAI index. However, making it do this seems to mean knowing a bit more about how Galaxy deals with its metadata. Peter
Hello Peter, On Mar 18, 2011, at 11:26 AM, Peter Cock wrote:
I've just updated my test Galaxy instance to get the 5221:b5ecb8f4839d fix, and I now get a different behaviour - still an error state.
Data type: auto Build: ? Miscellaneous information: The uploaded files need grooming, so change your Copy data into Galaxy? selection to be Copy files into Galaxy instead of Link to files without copying into Galaxy so grooming can be performed. error
Presumably Galaxy uses 'Grooming' in several settings (e.g. FASTQ) to mean 'data sanitising', and what that message is trying to tell me is Galaxy doesn't think my BAM file is sorted (and therefore needs 'grooming'). Right?
This is correct.
Having checked my BAM files with samtools, I can confirm they don't have the SO header.
samtools view -H myfile.bam | grep "SO:"
They were generated with BWA in a split+merge pipeline to use multiple cores. I support I could run samtools reheader on them... but it would be nice to avoid that.
Change set 5256:4acde9321b63 now includes more robust checking if a bam file is sorted. If using a version of samtools 0.1.13 or newer, an error condition occurs if attempting to index an unsorted bam file. We take advantage of this in our checks.
Did you see Pierre's little C tool using the samtools API to do this? http://plindenbaum.blogspot.com/2011/02/testing-if-bam-file-is-sorted-using....
Yes, however in testing, a 6.6GB BAM file took 138 seconds to check with the posted 'bamsorted' code that uses the SAMtools API and 128 seconds to index with SAMtools, so we're using samtools for the check.
The only disadvantage is that you need a new samtools for it to work on 100% of cases but that seems like a good choice moving forward.
Yes, since Galaxy will typically do sort the index anyway, it makes sense to try and do the indexing immediately, and thus find out if a sort is required or not.
Meanwhile, the following trivial patch resolves my problem with getting pre-existing BAM files loaded into Galaxy:
https://bitbucket.org/peterjc/galaxy-central/changeset/7f17701740b2
As a follow up, Galaxy doesn't need to re-index the file if there is already a BAI index. However, making it do this seems to mean knowing a bit more about how Galaxy deals with its metadata.
Peter
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
On Wed, Mar 23, 2011 at 4:24 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
Hello Peter,
On Mar 18, 2011, at 11:26 AM, Peter Cock wrote:
Having checked my BAM files with samtools, I can confirm they don't have the SO header.
samtools view -H myfile.bam | grep "SO:"
They were generated with BWA in a split+merge pipeline to use multiple cores. I support I could run samtools reheader on them... but it would be nice to avoid that.
Change set 5256:4acde9321b63 now includes more robust checking if a bam file is sorted.
Thanks!
If using a version of samtools 0.1.13 or newer, an error condition occurs if attempting to index an unsorted bam file. We take advantage of this in our checks.
Yes, that is what Heng Li recently recommended on the samtools mailing list.
Did you see Pierre's little C tool using the samtools API to do this? http://plindenbaum.blogspot.com/2011/02/testing-if-bam-file-is-sorted-using....
Yes, however in testing, a 6.6GB BAM file took 138 seconds to check with the posted 'bamsorted' code that uses the SAMtools API and 128 seconds to index with SAMtools, so we're using samtools for the check.
Given Heng Li's recommendation to use this as a check for being indexed, I would do the same.
The only disadvantage is that you need a new samtools for it to work on 100% of cases but that seems like a good choice moving forward.
Yes, since Galaxy will typically do sort the index anyway, it makes sense to try and do the indexing immediately, and thus find out if a sort is required or not.
I see that is still pending, but since indexing is quite fast, doing it twice isn't the end of the world. Do you think it is worth trying to reuse a provided BAI file when linking to a BAM file rather than copying it into Galaxy? Peter
participants (3)
-
Brad Chapman
-
Greg Von Kuster
-
Peter Cock