Parallelism using metadata

28 Feb 2015

      Hello Galaxy Dev,

I have a question regarding parallelism on a BAM file.

I have currently implemented 3 split options for the BAM datatype

1) by_rname -> splits the bam into files based on the chromosome
2) by_interval -> splits the bam into files based on  a defined bp length,
and does so across the entire genome present in the BAM file
3) by_read -> splits the bam into files based on the number of reads
encountered (if multiple files, all other files match the interval as the
first)

Now, as you can imagine, reading and writing large BAM files is a pain, and
I personally think this is not the best solution for Galaxy.
What I was hoping to implement (but don't know how) is to create a new
metadata option in bam (bam.metadata.bam_interval) which would generate the
interval without creating a new file (essentially, I would create a symbolic
link to the old large file, and then update the metadata.bam_interval, this
would contain some string of the form chrom:start-end which could then be
used in a variety of tools which accept an interval as an option (for
example samtools view))

This would be far more efficient then my first implementation, but the thing
I don't know how to do is specify some kind of metadata at the split level.
I was hoping maybe you could direct me to an example that does this?

I have added the following to my metadata.py file:

class IntervalParameter( MetadataParamter )

    def __init__( self, spec ):
        MetadataParamter.__init__( self, spec ):
        self.rname = self.spec.get( "rname" )
        self.start = self.spec.get( "start" )
        self.end = self.spec.get( "end" )

    def to_string(self):
        if self.rname = 'all':
            return ''
        else:
            return ''.join([self.rname, ':', self.start, '-', self.end])

And the following to my binary.py file:

### UNDER THE BAM CLASS

MetadataElement( name="bam_interval", desc="BAM Interval",
param=metadata.IntervalParameter, rname="all", start="", end="",
visible=False, optional=True)

I somehow want rname="all" to be the default, but upon parallelism, I want
to be able to adjust this parameter in the split functions.

So,

<parallelism method="multi" split_inputs="normal,tumour"
split_mode="by_interval" split_size="50000000" merge_outputs="output"/>

Would actually change the metadata of each file, and not create sub-bams.

PLEASE HELP!!!

Marco

Marco Albuquerque

John Chilton

Marco Albuquerque

Peter Cock

Peter Cock

tags

participants (3)