Parallelism using metadata
Hello Galaxy Dev, I have a question regarding parallelism on a BAM file. I have currently implemented 3 split options for the BAM datatype 1) by_rname -> splits the bam into files based on the chromosome 2) by_interval -> splits the bam into files based on a defined bp length, and does so across the entire genome present in the BAM file 3) by_read -> splits the bam into files based on the number of reads encountered (if multiple files, all other files match the interval as the first) Now, as you can imagine, reading and writing large BAM files is a pain, and I personally think this is not the best solution for Galaxy. What I was hoping to implement (but don't know how) is to create a new metadata option in bam (bam.metadata.bam_interval) which would generate the interval without creating a new file (essentially, I would create a symbolic link to the old large file, and then update the metadata.bam_interval, this would contain some string of the form chrom:start-end which could then be used in a variety of tools which accept an interval as an option (for example samtools view)) This would be far more efficient then my first implementation, but the thing I don't know how to do is specify some kind of metadata at the split level. I was hoping maybe you could direct me to an example that does this? I have added the following to my metadata.py file: class IntervalParameter( MetadataParamter ) def __init__( self, spec ): MetadataParamter.__init__( self, spec ): self.rname = self.spec.get( "rname" ) self.start = self.spec.get( "start" ) self.end = self.spec.get( "end" ) def to_string(self): if self.rname = 'all': return '' else: return ''.join([self.rname, ':', self.start, '-', self.end]) And the following to my binary.py file: ### UNDER THE BAM CLASS MetadataElement( name="bam_interval", desc="BAM Interval", param=metadata.IntervalParameter, rname="all", start="", end="", visible=False, optional=True) I somehow want rname="all" to be the default, but upon parallelism, I want to be able to adjust this parameter in the split functions. So, <parallelism method="multi" split_inputs="normal,tumour" split_mode="by_interval" split_size="50000000" merge_outputs="output"/> Would actually change the metadata of each file, and not create sub-bams. PLEASE HELP!!! Marco
Hey Marco, Thanks for the e-mail. This is an awesome idea, but I am worried it is very hard to do this well in Galaxy. If you create symbolic links to the original file - then Galaxy might delete the original file and the derived files would all break without warning. Galaxy does have this separate concept of datasets and history dataset associations so that a dataset can exist in more than one place simultaneously - and one could imagine sticking this metadata there and just sort of dynamically splitting up the BAM file whenever it is used in a tool or served out over the API - but this would be a large effort and would require all sorts of modifications to various parts of Galaxy. Something worth looking at is this work by Kyle Ellrott: https://bitbucket.org/galaxy/galaxy-central/pull-request/175/parameter-based... This was a very localized effort to just work some of the ideas just into the task splitting framework in Galaxy. This has the advantage of not needing to mess with metadata and datasets, etc.... all the way up the chain. Kyle has abandoned that approach however, but it is promising start to something like this I think - and it would be much less disruptive than doing this with metadata and datasets (though admittedly more limited as well). If this will be primarily used for workflows - there are a couple of recent developments that might make splitting more feasible. Dannon introduced the ability to delete intermediate outputs from workflows a few releases ago - and the upcoming release (15.03) will introduce the ability to write tools that split up a single input into a collection. The existing workflow and dataset collection framework can then apply normal tools over every element of the collection and you can write a tool to merge the results. More information can be found here - https://bitbucket.org/galaxy/galaxy-central/pull-request/634/allow-tools-to-.... These common pipelines where you split up a BAM files, run a bunch of steps, and then merge the results will be executable in the near future (though 15.03 won't have workflow editor support for it - I will try to get to this by the following release - and you can manually build up workflows to do this - https://bitbucket.org/galaxy/galaxy-central/src/0468d285f89c799559926c94f300...). Thanks again, -John On Fri, Feb 27, 2015 at 10:04 PM, Marco Albuquerque <marcoalbuquerque.sfu@gmail.com> wrote:
Hello Galaxy Dev,
I have a question regarding parallelism on a BAM file.
I have currently implemented 3 split options for the BAM datatype
1) by_rname -> splits the bam into files based on the chromosome 2) by_interval -> splits the bam into files based on a defined bp length, and does so across the entire genome present in the BAM file 3) by_read -> splits the bam into files based on the number of reads encountered (if multiple files, all other files match the interval as the first)
Now, as you can imagine, reading and writing large BAM files is a pain, and I personally think this is not the best solution for Galaxy. What I was hoping to implement (but don't know how) is to create a new metadata option in bam (bam.metadata.bam_interval) which would generate the interval without creating a new file (essentially, I would create a symbolic link to the old large file, and then update the metadata.bam_interval, this would contain some string of the form chrom:start-end which could then be used in a variety of tools which accept an interval as an option (for example samtools view))
This would be far more efficient then my first implementation, but the thing I don't know how to do is specify some kind of metadata at the split level. I was hoping maybe you could direct me to an example that does this?
I have added the following to my metadata.py file:
class IntervalParameter( MetadataParamter )
def __init__( self, spec ): MetadataParamter.__init__( self, spec ): self.rname = self.spec.get( "rname" ) self.start = self.spec.get( "start" ) self.end = self.spec.get( "end" )
def to_string(self): if self.rname = 'all': return '' else: return ''.join([self.rname, ':', self.start, '-', self.end])
And the following to my binary.py file:
### UNDER THE BAM CLASS
MetadataElement( name="bam_interval", desc="BAM Interval", param=metadata.IntervalParameter, rname="all", start="", end="", visible=False, optional=True)
I somehow want rname="all" to be the default, but upon parallelism, I want to be able to adjust this parameter in the split functions.
So,
<parallelism method="multi" split_inputs="normal,tumour" split_mode="by_interval" split_size="50000000" merge_outputs="output"/>
Would actually change the metadata of each file, and not create sub-bams.
PLEASE HELP!!!
Marco
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Hi John, Thanks for your reply. I think for the time being, I will simply create a tool that creates an interval file, and the parallelize on this interval file. Though I agree, this would be a useful feature to include but I don't think I am anywhere near ready to start dabbling in galaxy's core filesystem as I have only been developing for a short while. Im hoping that I can learn more about how galaxy works at the GCC, and maybe then I will know how to efficiently adjust galaxy code. I am actually hoping to present my work at the GCC15, I am part of a project that is adding a variety of Cancer Genomic Tools into galaxy. And thanks so much for the resources, I will surely look into all of these. Much appreciated, Marco On 2015-03-02 9:22 AM, "John Chilton" <jmchilton@gmail.com> wrote:
Hey Marco,
Thanks for the e-mail. This is an awesome idea, but I am worried it is very hard to do this well in Galaxy. If you create symbolic links to the original file - then Galaxy might delete the original file and the derived files would all break without warning. Galaxy does have this separate concept of datasets and history dataset associations so that a dataset can exist in more than one place simultaneously - and one could imagine sticking this metadata there and just sort of dynamically splitting up the BAM file whenever it is used in a tool or served out over the API - but this would be a large effort and would require all sorts of modifications to various parts of Galaxy.
Something worth looking at is this work by Kyle Ellrott:
https://bitbucket.org/galaxy/galaxy-central/pull-request/175/parameter-bas ed-bam-file-parallelization
This was a very localized effort to just work some of the ideas just into the task splitting framework in Galaxy. This has the advantage of not needing to mess with metadata and datasets, etc.... all the way up the chain.
Kyle has abandoned that approach however, but it is promising start to something like this I think - and it would be much less disruptive than doing this with metadata and datasets (though admittedly more limited as well).
If this will be primarily used for workflows - there are a couple of recent developments that might make splitting more feasible. Dannon introduced the ability to delete intermediate outputs from workflows a few releases ago - and the upcoming release (15.03) will introduce the ability to write tools that split up a single input into a collection. The existing workflow and dataset collection framework can then apply normal tools over every element of the collection and you can write a tool to merge the results. More information can be found here - https://bitbucket.org/galaxy/galaxy-central/pull-request/634/allow-tools-t o-explicitly-produce-dataset.
These common pipelines where you split up a BAM files, run a bunch of steps, and then merge the results will be executable in the near future (though 15.03 won't have workflow editor support for it - I will try to get to this by the following release - and you can manually build up workflows to do this - https://bitbucket.org/galaxy/galaxy-central/src/0468d285f89c799559926c94f3 00c42d05e8c47a/test/api/test_workflows.py?at=default#cl-544).
Thanks again, -John
On Fri, Feb 27, 2015 at 10:04 PM, Marco Albuquerque <marcoalbuquerque.sfu@gmail.com> wrote:
Hello Galaxy Dev,
I have a question regarding parallelism on a BAM file.
I have currently implemented 3 split options for the BAM datatype
1) by_rname -> splits the bam into files based on the chromosome 2) by_interval -> splits the bam into files based on a defined bp length, and does so across the entire genome present in the BAM file 3) by_read -> splits the bam into files based on the number of reads encountered (if multiple files, all other files match the interval as the first)
Now, as you can imagine, reading and writing large BAM files is a pain, and I personally think this is not the best solution for Galaxy. What I was hoping to implement (but don't know how) is to create a new metadata option in bam (bam.metadata.bam_interval) which would generate the interval without creating a new file (essentially, I would create a symbolic link to the old large file, and then update the metadata.bam_interval, this would contain some string of the form chrom:start-end which could then be used in a variety of tools which accept an interval as an option (for example samtools view))
This would be far more efficient then my first implementation, but the thing I don't know how to do is specify some kind of metadata at the split level. I was hoping maybe you could direct me to an example that does this?
I have added the following to my metadata.py file:
class IntervalParameter( MetadataParamter )
def __init__( self, spec ): MetadataParamter.__init__( self, spec ): self.rname = self.spec.get( "rname" ) self.start = self.spec.get( "start" ) self.end = self.spec.get( "end" )
def to_string(self): if self.rname = 'all': return '' else: return ''.join([self.rname, ':', self.start, '-', self.end])
And the following to my binary.py file:
### UNDER THE BAM CLASS
MetadataElement( name="bam_interval", desc="BAM Interval", param=metadata.IntervalParameter, rname="all", start="", end="", visible=False, optional=True)
I somehow want rname="all" to be the default, but upon parallelism, I want to be able to adjust this parameter in the split functions.
So,
<parallelism method="multi" split_inputs="normal,tumour" split_mode="by_interval" split_size="50000000" merge_outputs="output"/>
Would actually change the metadata of each file, and not create sub-bams.
PLEASE HELP!!!
Marco
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Hello all, Marco, Jon and I are going to try to get together during GCC2015 for a BOF (a "Birds of a Feather" informal meeting) to talk about Galaxy parallelisation (and collections). We've not yet picked a time, but details should be on the wiki shortly... https://wiki.galaxyproject.org/Events/GCC2015/BoFs If anyone else is interested, please get in touch. (e.g. If Roberto Alonso is here in Norwich his input would be very valuable) Thanks, Peter On Mon, Mar 2, 2015 at 6:58 PM, Marco Albuquerque <marcoalbuquerque.sfu@gmail.com> wrote:
Hi John,
Thanks for your reply.
I think for the time being, I will simply create a tool that creates an interval file, and the parallelize on this interval file. Though I agree, this would be a useful feature to include but I don't think I am anywhere near ready to start dabbling in galaxy's core filesystem as I have only been developing for a short while. Im hoping that I can learn more about how galaxy works at the GCC, and maybe then I will know how to efficiently adjust galaxy code.
I am actually hoping to present my work at the GCC15, I am part of a project that is adding a variety of Cancer Genomic Tools into galaxy.
And thanks so much for the resources, I will surely look into all of these.
Much appreciated,
Marco
On 2015-03-02 9:22 AM, "John Chilton" <jmchilton@gmail.com> wrote:
Hey Marco,
Thanks for the e-mail. This is an awesome idea, but I am worried it is very hard to do this well in Galaxy. If you create symbolic links to the original file - then Galaxy might delete the original file and the derived files would all break without warning. Galaxy does have this separate concept of datasets and history dataset associations so that a dataset can exist in more than one place simultaneously - and one could imagine sticking this metadata there and just sort of dynamically splitting up the BAM file whenever it is used in a tool or served out over the API - but this would be a large effort and would require all sorts of modifications to various parts of Galaxy.
Something worth looking at is this work by Kyle Ellrott:
https://bitbucket.org/galaxy/galaxy-central/pull-request/175/parameter-bas ed-bam-file-parallelization
This was a very localized effort to just work some of the ideas just into the task splitting framework in Galaxy. This has the advantage of not needing to mess with metadata and datasets, etc.... all the way up the chain.
Kyle has abandoned that approach however, but it is promising start to something like this I think - and it would be much less disruptive than doing this with metadata and datasets (though admittedly more limited as well).
If this will be primarily used for workflows - there are a couple of recent developments that might make splitting more feasible. Dannon introduced the ability to delete intermediate outputs from workflows a few releases ago - and the upcoming release (15.03) will introduce the ability to write tools that split up a single input into a collection. The existing workflow and dataset collection framework can then apply normal tools over every element of the collection and you can write a tool to merge the results. More information can be found here - https://bitbucket.org/galaxy/galaxy-central/pull-request/634/allow-tools-t o-explicitly-produce-dataset.
These common pipelines where you split up a BAM files, run a bunch of steps, and then merge the results will be executable in the near future (though 15.03 won't have workflow editor support for it - I will try to get to this by the following release - and you can manually build up workflows to do this - https://bitbucket.org/galaxy/galaxy-central/src/0468d285f89c799559926c94f3 00c42d05e8c47a/test/api/test_workflows.py?at=default#cl-544).
Thanks again, -John
On Fri, Feb 27, 2015 at 10:04 PM, Marco Albuquerque <marcoalbuquerque.sfu@gmail.com> wrote:
Hello Galaxy Dev,
I have a question regarding parallelism on a BAM file.
I have currently implemented 3 split options for the BAM datatype
1) by_rname -> splits the bam into files based on the chromosome 2) by_interval -> splits the bam into files based on a defined bp length, and does so across the entire genome present in the BAM file 3) by_read -> splits the bam into files based on the number of reads encountered (if multiple files, all other files match the interval as the first)
Now, as you can imagine, reading and writing large BAM files is a pain, and I personally think this is not the best solution for Galaxy. What I was hoping to implement (but don't know how) is to create a new metadata option in bam (bam.metadata.bam_interval) which would generate the interval without creating a new file (essentially, I would create a symbolic link to the old large file, and then update the metadata.bam_interval, this would contain some string of the form chrom:start-end which could then be used in a variety of tools which accept an interval as an option (for example samtools view))
This would be far more efficient then my first implementation, but the thing I don't know how to do is specify some kind of metadata at the split level. I was hoping maybe you could direct me to an example that does this?
I have added the following to my metadata.py file:
class IntervalParameter( MetadataParamter )
def __init__( self, spec ): MetadataParamter.__init__( self, spec ): self.rname = self.spec.get( "rname" ) self.start = self.spec.get( "start" ) self.end = self.spec.get( "end" )
def to_string(self): if self.rname = 'all': return '' else: return ''.join([self.rname, ':', self.start, '-', self.end])
And the following to my binary.py file:
### UNDER THE BAM CLASS
MetadataElement( name="bam_interval", desc="BAM Interval", param=metadata.IntervalParameter, rname="all", start="", end="", visible=False, optional=True)
I somehow want rname="all" to be the default, but upon parallelism, I want to be able to adjust this parameter in the split functions.
So,
<parallelism method="multi" split_inputs="normal,tumour" split_mode="by_interval" split_size="50000000" merge_outputs="output"/>
Would actually change the metadata of each file, and not create sub-bams.
PLEASE HELP!!!
Marco
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
BOF wiki page here, will add time and place once settled... https://wiki.galaxyproject.org/Events/GCC2015/BoFs/DataSplittingAndParalleli... On Mon, Jul 6, 2015 at 6:22 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hello all,
Marco, Jon and I are going to try to get together during GCC2015 for a BOF (a "Birds of a Feather" informal meeting) to talk about Galaxy parallelisation (and collections). We've not yet picked a time, but details should be on the wiki shortly...
https://wiki.galaxyproject.org/Events/GCC2015/BoFs
If anyone else is interested, please get in touch.
(e.g. If Roberto Alonso is here in Norwich his input would be very valuable)
Thanks,
Peter
participants (3)
-
John Chilton
-
Marco Albuquerque
-
Peter Cock