Hi Peter, On Nov 18, 2013, at 10:33 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Mon, Nov 18, 2013 at 2:24 PM, Dave Bouvier <dave@bx.psu.edu> wrote:
Peter,
It turns out there were two problems. First, the test environment was not resolving the upload tool's dependency on samtools, which I've now corrected.
Excellent.
On a closely related point, I understand Galaxy likes to store all BAM files co-ordinate sorted and indexed - when a tool produces a BAM file where does this happen? i.e. Is it the individual tool's responsibility, or the framework (e.g. during setting metadata). I am assume the later, in which case is there still an implicit samtools dependency there?
This is (unfortunately) performed in multiple methods in the Bam class methods in ~/galaxy/datatypes/binary.py. There are some comments (pasted here) that include an old "TODO" in the Bam class's dataset_content_needs_grooming() method that clarifies some of the reasons for this: # Samtools version 0.1.13 or newer produces an error condition when attempting to index an # unsorted bam file - see http://biostar.stackexchange.com/questions/5273/is-my-bam-file-sorted. # So when using a newer version of samtools, we'll first check if the input BAM file is sorted # from the header information. If the header is present and sorted, we do nothing by returning False. # If it's present and unsorted or if it's missing, we'll index the bam file to see if it produces the # error. If it does, sorting is needed so we return True (otherwise False). # # TODO: we're creating an index file here and throwing it away. We then create it again when # the set_meta() method below is called later in the job process. We need to enhance this overall # process so we don't create an index twice. In order to make it worth the time to implement the # upload tool / framework to allow setting metadata from directly within the tool itself, it should be # done generically so that all tools will have the ability. In testing, a 6.6 gb BAM file took 128 # seconds to index with samtools, and 45 minutes to sort, so indexing is relatively inexpensive.
Second, the bam file detection on upload was broken due to the bug in python 2.7.4's gzip module, which I've also corrected.
You mean http://bugs.python.org/issue17666 fixed in 2.7.5?
Yes
I reported that when Biopython's BGZF support broke (BGZF being the gzip flavour used for BAM and tabix style indexed files).
Thanks!
I have re-run the test framework on samtools_idxstats, and it has now passed its test.
--Dave B.
Thanks Dave :)
Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/