BAM to BigWig (and tool ID clashes)
Hi Brad & Lance, I've been using Brad's bam_to_bigwig tool in Galaxy but realized today (with a new dataset using a splice-aware mapper) that it doesn't seem to be ignoring CIGAR N operators where a read is split over an intron. Looking over Brad's Python script which calculates the coverage to write an intermediate wiggle file, this is done with the samtools via pysam. It is not obvious to me if this can be easily modified to ignore introns. Is this possible Brad? I wasn't aware of Lance's rival bam_to_bigwig tool in the ToolShed till now, and that does talk about this issue. It has a boolean option to ignore gaps when computing coverage, recommended for RNA-Seq where reads are mapped across long splice junctions. Lance, from your tool's help it sounds like it needs a genome database build filled in. I don't understand this requirement - Brad's tool works just fine for standalone BAM files (for example reads mapped to an in house assembly). Is that not supported in your tool? Galaxy team - why does the ToolShed allow duplicate repository names (here bam_to_bigwig) AND duplicate tool IDs (again, here bam_to_bigwig)? Won't this cause chaos when sharing workflows? I would suggest checking this when a tool is uploaded and rejecting repository name or tool ID clashes. Regards, Peter P.S. Brad, your tool is missing an explicit <requirements> tag listing the UCSC binary wigToBigWig, and the Python library pysam. Lance, your tool doesn't seem to include any author information like your name or email address. I'm inferring it is yours from the Galaxy tool shed user id, lparsons.
On Thu, Apr 19, 2012 at 1:55 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hi Brad & Lance,
I've been using Brad's bam_to_bigwig tool in Galaxy but realized today (with a new dataset using a splice-aware mapper) that it doesn't seem to be ignoring CIGAR N operators where a read is split over an intron. Looking over Brad's Python script which calculates the coverage to write an intermediate wiggle file, this is done with the samtools via pysam. It is not obvious to me if this can be easily modified to ignore introns. Is this possible Brad?
Looking into this a bit more, perhaps 'samtools depth' might be useful (bam2depth.c), maybe we can use this code to update your python+pysam code? Peter
The tool shed forces unique repository names per user account, allowing for uniqueness with that combination. All tools uploaded into a tool shed repository are assigned a unique id called a guid, which is unique for all tools across all possible tool sheds. These guids follow a named spacing convention that ensures that any tool installed into any Galaxy instance will be uniquely identified regardless of "old" tool ids or tool versions. For example, the guid for version 0.0.2 of Brad's tool is toolshed.g2.bx.psu.edu/repos/brad-chapman/bam_to_bigwig/bam_to_bigwig/0.0.2 while the guid for version 0.1 of Lance's tool is toolshed.g2.bx.psu.edu/repos/lparsons/bam_to_bigwig/bam_to_bigwig/0.1 This information can be seen when viewing the tool's metadata in the tool shed. When these tools are installed into a local Galaxy instance, this guid is the tool's id in Galaxy rather than the "old" id (e.g., tool id="bam_to_bigwig"). The "old" id is still important and must be included in the tool config as usual, but is not used to identify a tool that is installed in a repository from the tool shed. All of these details are explained in the tool shed wiki in the following section. http://wiki.g2.bx.psu.edu/Tool%20Shed#Automatic_installation_of_Galaxy_tool_... This section is also relevant to this discussion. http://wiki.g2.bx.psu.edu/Tool%20Shed#Galaxy_Tool_Versions On Apr 19, 2012, at 8:55 AM, Peter Cock wrote:
Galaxy team - why does the ToolShed allow duplicate repository names (here bam_to_bigwig) AND duplicate tool IDs (again, here bam_to_bigwig)? Won't this cause chaos when sharing workflows? I would suggest checking this when a tool is uploaded and rejecting repository name or tool ID clashes.
Regards,
Peter
On Thu, Apr 19, 2012 at 2:32 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
The tool shed forces unique repository names per user account, allowing for uniqueness with that combination. All tools uploaded into a tool shed repository are assigned a unique id called a guid, which is unique for all tools across all possible tool sheds. These guids follow a named spacing convention that ensures that any tool installed into any Galaxy instance will be uniquely identified regardless of "old" tool ids or tool versions.
... The "old" id is still important and must be included in the tool config as usual, but is not used to identify a tool that is installed in a repository from the tool shed.
Ah - so the "old" tool ID clashes are only going to be a problem with Galaxy servers where the tools were installed 'the old fashioned way' (like ours). So there is still scope for clashes with shared workflows - but this will be less and less of a problem as local Galaxy installs switch to installing tools via the Tool Shed? What happens if (for example) Brad gives Lance commit rights to his repository (or the other way round)? Then you'd have a clash.
All of these details are explained in the tool shed wiki in the following section.
http://wiki.g2.bx.psu.edu/Tool%20Shed#Automatic_installation_of_Galaxy_tool_...
This section is also relevant to this discussion.
Thanks for the background. Peter
On Apr 19, 2012, at 10:04 AM, Peter Cock wrote:
On Thu, Apr 19, 2012 at 2:32 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
The tool shed forces unique repository names per user account, allowing for uniqueness with that combination. All tools uploaded into a tool shed repository are assigned a unique id called a guid, which is unique for all tools across all possible tool sheds. These guids follow a named spacing convention that ensures that any tool installed into any Galaxy instance will be uniquely identified regardless of "old" tool ids or tool versions.
... The "old" id is still important and must be included in the tool config as usual, but is not used to identify a tool that is installed in a repository from the tool shed.
Ah - so the "old" tool ID clashes are only going to be a problem with Galaxy servers where the tools were installed 'the old fashioned way' (like ours).
Yes, it is highly recommended to install tool shed repositories using the installation process that has been implemented rather than downloading the repository contents as an archive and manually manipulating it to be incorporated into your Galaxy instance. Using the installation process includes many benefits in addition to eliminating the potential tool id clashes. Examples of benefits include not having to stop / restart your Galaxy server in order to use freshly installed tools, being able to deactivate / uninstall tools on-the-fly when finished with them, being able to run multiple versions of the same tool simultaneously in the same Galaxy instance, etc.
So there is still scope for clashes with shared workflows - but this will be less and less of a problem as local Galaxy installs switch to installing tools via the Tool Shed?
Correct - if you manually download the contents of a repository and install it into your local Galaxy instance, there is no way to eliminate the potential for tool id / version clashes. In fact, it may be beneficial to eliminate the feature enabling users to manually download repository contents, but we'll leave it there as long as the community wants it.
What happens if (for example) Brad gives Lance commit rights to his repository (or the other way round)? Then you'd have a clash.
Assuming automatic installation using the tool shed install process, no clashes will occur in this scenario, because no matter who pushes changes to the repository, it is still "name spaced" by the original owner, which can never change. The only part of the guid that could potentially change is the tool version component ( e.g., toolshed.g2.bx.psu.edu/repos/brad-chapman/bam_to_bigwig/bam_to_bigwig/0.0.2 becomes toolshed.g2.bx.psu.edu/repos/brad-chapman/bam_to_bigwig/bam_to_bigwig/0.0.3 if Brad gives Lance the ability to push to his repository and Lance change's the tool version ).
All of these details are explained in the tool shed wiki in the following section.
http://wiki.g2.bx.psu.edu/Tool%20Shed#Automatic_installation_of_Galaxy_tool_...
This section is also relevant to this discussion.
Thanks for the background.
Peter
Hi Peter, Thanks for the thoughtful comments. I believe the requirement for the genome was imposed by the use of an underlying BedTools utility. I also think that in a newer version of that tool, the requirement was removed, since you correctly point out it is not really necessary. I will see if I can update the tool to remove that requirement and also see about changing the tool id. Sorry for the conflict, that was an oversight on my part, though it would be nice if the Tool Shed could check and warn when someone tries to create a new tool. I would suggest flagging the new repo as invalid until the id is updated instead of outright rejection. As for the author info, you're right, I should really add that as well. That tool was put together very quickly to meet the need of a customer and I didn't properly clean things up before I uploaded. I'll let you know once I get an update out. Of course, any patches etc. are welcome. ;-) Lance Peter Cock wrote:
Hi Brad& Lance,
I've been using Brad's bam_to_bigwig tool in Galaxy but realized today (with a new dataset using a splice-aware mapper) that it doesn't seem to be ignoring CIGAR N operators where a read is split over an intron. Looking over Brad's Python script which calculates the coverage to write an intermediate wiggle file, this is done with the samtools via pysam. It is not obvious to me if this can be easily modified to ignore introns. Is this possible Brad?
I wasn't aware of Lance's rival bam_to_bigwig tool in the ToolShed till now, and that does talk about this issue. It has a boolean option to ignore gaps when computing coverage, recommended for RNA-Seq where reads are mapped across long splice junctions.
Lance, from your tool's help it sounds like it needs a genome database build filled in. I don't understand this requirement - Brad's tool works just fine for standalone BAM files (for example reads mapped to an in house assembly). Is that not supported in your tool?
Galaxy team - why does the ToolShed allow duplicate repository names (here bam_to_bigwig) AND duplicate tool IDs (again, here bam_to_bigwig)? Won't this cause chaos when sharing workflows? I would suggest checking this when a tool is uploaded and rejecting repository name or tool ID clashes.
Regards,
Peter
P.S.
Brad, your tool is missing an explicit<requirements> tag listing the UCSC binary wigToBigWig, and the Python library pysam.
Lance, your tool doesn't seem to include any author information like your name or email address. I'm inferring it is yours from the Galaxy tool shed user id, lparsons.
-- Lance Parsons - Scientific Programmer 134 Carl C. Icahn Laboratory Lewis-Sigler Institute for Integrative Genomics Princeton University
On Apr 19, 2012, at 10:37 AM, Lance Parsons wrote:
and also see about changing the tool id.
I would recommend NOT doing this - see the separate thread for this message that describes ow this works in the tool shed.
Sorry for the conflict, that was an oversight on my part, though it would be nice if the Tool Shed could check and warn when someone tries to create a new tool. I would suggest flagging the new repo as invalid until the id is updated instead of outright rejection.
Again, see the separate thread for this message - the tool shed does correctly handle this when the automatic installation process is used.
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Lance and Peter; Peter, thanks for noticing the problem and duplicate tools. Lance, I'm happy to merge these so there are not two different versions out there. I prefer your use for genomeCoverageBed over my custom hacks. That's a nice approach I totally missed. I avoid the need for the sam indexes by creating the file directly from the information in the BAM header. I don't think there is any way around creating it since it's required by the UCSC tools as well, but everything you need is in the BAM header. There might be a sneaky way to do this with samtools -H and awk but I'm not nearly skilled enough to pull that out. Let me know what you think. I can also update my python wrapper script to use the genomeCoverageBed approach instead if you think that's easier. Brad
Hi Peter,
Thanks for the thoughtful comments. I believe the requirement for the genome was imposed by the use of an underlying BedTools utility. I also think that in a newer version of that tool, the requirement was removed, since you correctly point out it is not really necessary.
I will see if I can update the tool to remove that requirement and also see about changing the tool id. Sorry for the conflict, that was an oversight on my part, though it would be nice if the Tool Shed could check and warn when someone tries to create a new tool. I would suggest flagging the new repo as invalid until the id is updated instead of outright rejection.
As for the author info, you're right, I should really add that as well. That tool was put together very quickly to meet the need of a customer and I didn't properly clean things up before I uploaded. I'll let you know once I get an update out. Of course, any patches etc. are welcome. ;-)
Lance
Peter Cock wrote:
Hi Brad& Lance,
I've been using Brad's bam_to_bigwig tool in Galaxy but realized today (with a new dataset using a splice-aware mapper) that it doesn't seem to be ignoring CIGAR N operators where a read is split over an intron. Looking over Brad's Python script which calculates the coverage to write an intermediate wiggle file, this is done with the samtools via pysam. It is not obvious to me if this can be easily modified to ignore introns. Is this possible Brad?
I wasn't aware of Lance's rival bam_to_bigwig tool in the ToolShed till now, and that does talk about this issue. It has a boolean option to ignore gaps when computing coverage, recommended for RNA-Seq where reads are mapped across long splice junctions.
Lance, from your tool's help it sounds like it needs a genome database build filled in. I don't understand this requirement - Brad's tool works just fine for standalone BAM files (for example reads mapped to an in house assembly). Is that not supported in your tool?
Galaxy team - why does the ToolShed allow duplicate repository names (here bam_to_bigwig) AND duplicate tool IDs (again, here bam_to_bigwig)? Won't this cause chaos when sharing workflows? I would suggest checking this when a tool is uploaded and rejecting repository name or tool ID clashes.
Regards,
Peter
P.S.
Brad, your tool is missing an explicit<requirements> tag listing the UCSC binary wigToBigWig, and the Python library pysam.
Lance, your tool doesn't seem to include any author information like your name or email address. I'm inferring it is yours from the Galaxy tool shed user id, lparsons.
-- Lance Parsons - Scientific Programmer 134 Carl C. Icahn Laboratory Lewis-Sigler Institute for Integrative Genomics Princeton University
On Fri, Apr 20, 2012 at 2:17 AM, Brad Chapman <chapmanb@50mail.com> wrote:
Lance and Peter; Peter, thanks for noticing the problem and duplicate tools. Lance, I'm happy to merge these so there are not two different versions out there.
I prefer your use for genomeCoverageBed over my custom hacks. That's a nice approach I totally missed.
I avoid the need for the sam indexes by creating the file directly from the information in the BAM header. I don't think there is any way around creating it since it's required by the UCSC tools as well, but everything you need is in the BAM header.
Indeed - I remember looking at that with you back in March 2011, including the special case of BAM files lacking an embedded SAM header (where the BAM header alone suffices).
There might be a sneaky way to do this with samtools -H and awk but I'm not nearly skilled enough to pull that out.
Using pysam works nicely, and therefore I stuck with Python ;)
Let me know what you think. I can also update my python wrapper script to use the genomeCoverageBed approach instead if you think that's easier.
I've made the update to Brad's script from the Tool Shed (attached), switching to using genomeCoverageBed and bedGraphToBigWig (based on the approach used in Lance's script), although in doing so I dropped the region support (which wasn't exposed to the Galaxy interface anyway). Since genomeCoverageBed doesn't support this directly, we could use samtools view for this I think - if you want this functionality. Sadly then I noticed that the Tool Shed version was out of date - lacking the new normalization option added here: https://github.com/chapmanb/bcbb/commits/master/nextgen/scripts/bam_to_wiggl... This was enough for my immediate needs today, but I'd happily try and merge this into the git version and update the XML file to match and add the new split option. We could list this as three contributing authors if you both like? Peter
I'm happy to have a merged version and put us all down as contributing authors. I won't have a chance to look at the code until next week, but I'd be happy to help out with any merging, etc. Thanks to both of you for you help an input. Lance Peter Cock wrote:
On Fri, Apr 20, 2012 at 2:17 AM, Brad Chapman<chapmanb@50mail.com> wrote:
Lance and Peter; Peter, thanks for noticing the problem and duplicate tools. Lance, I'm happy to merge these so there are not two different versions out there.
I prefer your use for genomeCoverageBed over my custom hacks. That's a nice approach I totally missed.
I avoid the need for the sam indexes by creating the file directly from the information in the BAM header. I don't think there is any way around creating it since it's required by the UCSC tools as well, but everything you need is in the BAM header.
Indeed - I remember looking at that with you back in March 2011, including the special case of BAM files lacking an embedded SAM header (where the BAM header alone suffices).
There might be a sneaky way to do this with samtools -H and awk but I'm not nearly skilled enough to pull that out.
Using pysam works nicely, and therefore I stuck with Python ;)
Let me know what you think. I can also update my python wrapper script to use the genomeCoverageBed approach instead if you think that's easier.
I've made the update to Brad's script from the Tool Shed (attached), switching to using genomeCoverageBed and bedGraphToBigWig (based on the approach used in Lance's script), although in doing so I dropped the region support (which wasn't exposed to the Galaxy interface anyway). Since genomeCoverageBed doesn't support this directly, we could use samtools view for this I think - if you want this functionality.
Sadly then I noticed that the Tool Shed version was out of date - lacking the new normalization option added here: https://github.com/chapmanb/bcbb/commits/master/nextgen/scripts/bam_to_wiggl...
This was enough for my immediate needs today, but I'd happily try and merge this into the git version and update the XML file to match and add the new split option.
We could list this as three contributing authors if you both like?
Peter
-- Lance Parsons - Scientific Programmer 134 Carl C. Icahn Laboratory Lewis-Sigler Institute for Integrative Genomics Princeton University
Peter and Lance;
I've made the update to Brad's script from the Tool Shed (attached), switching to using genomeCoverageBed and bedGraphToBigWig (based on the approach used in Lance's script), although in doing so I dropped the region support (which wasn't exposed to the Galaxy interface anyway). Since genomeCoverageBed doesn't support this directly, we could use samtools view for this I think - if you want this functionality.
Awesome, thanks for doing this so quickly. Leaving out the region support for now is fine with me. I gave both of you write access to that repo on the Toolshed so feel free to check in and edit away. Thanks again for tackling this, Brad
On Fri, Apr 20, 2012 at 4:12 PM, Brad Chapman <chapmanb@50mail.com> wrote:
Peter and Lance;
I've made the update to Brad's script from the Tool Shed (attached), switching to using genomeCoverageBed and bedGraphToBigWig (based on the approach used in Lance's script), although in doing so I dropped the region support (which wasn't exposed to the Galaxy interface anyway). Since genomeCoverageBed doesn't support this directly, we could use samtools view for this I think - if you want this functionality.
Awesome, thanks for doing this so quickly. Leaving out the region support for now is fine with me.
I gave both of you write access to that repo on the Toolshed so feel free to check in and edit away. Thanks again for tackling this, Brad
Lovely. I propose to initially update the Python script as described, with minor changes to the XML to add the new option and change the dependencies. After that we should probably talk about merging the changes already committed on your github repo (normalization). Peter
participants (4)
-
Brad Chapman
-
Greg Von Kuster
-
Lance Parsons
-
Peter Cock