[galaxy-dev] splitting bams bai

27 Apr 2015

      Hello,

I am writing some code in Galaxy for splitting bams. Up to know, I am
following the ideas that Marco Albuquerque proposed in this thread
http://dev.list.galaxyproject.org/Parallelism-using-metadata-td4666763.html
.
He proposed three ways of splitting:

1) by_rname -> splits the bam into files based on the chromosome
2) by_interval -> splits the bam into files based on  a defined bp length,
and does so across the entire genome present in the BAM file
3) by_read -> splits the bam into files based on the number of reads
encountered (if multiple files, all other files match the interval as the
first)

As I think the easiest is the first one, I started with this option.

First of all , I had to change line 82 of
lib/galaxy/jobs/splitters/multi.py as that "if" didn't let the code to
continue (I talked this in another thread).

Next, I had to do some changes in lib/galaxy/datatypes/binary.py. I added a
method "split" that creates the json for the script
extract_dataset_parts.sh. Here, in the next code you can see that I call
samtools -H in order to get the chromosome names,
now I realized that I can get that information directly from metadatas in
the input_datasets variable, so in the future I will change this.

def split(cls, input_datasets, subdir_generator_function, split_params):

        # 1) by_rname -> splits the bam into files based on the chromosome
        # 2) by_interval -> splits the bam into files based on  a defined
bp length, and does so across the entire genome present in the BAM file
        # 3) by_read -> splits the bam into files based on the number of
reads encountered (if multiple files, all other files match the interval as
the first)

        if split_params is None:
            return
        if len(input_datasets) > 1:
            raise Exception("BAM file splitting does not support multiple
files")
        input_file = input_datasets[0].file_name

        if 'split_mode' not in split_params:
            raise Exception('Tool does not define a split mode')
        elif split_params['split_mode'] == 'by_rname':
            log.debug("Attemping to split BAM file %s by chromosome",
input_file)

            #First get bam header
            params = ["samtools", "view", "-H", input_file]
            output = subprocess.Popen( params, stderr=subprocess.PIPE,
stdout=subprocess.PIPE ).communicate()[0]
            output = output.split("\n")
            chrList = []
    #Get chromosome list from the header.
            for line in output:
                fields = line.strip().split("\t")
                if fields[0].startswith("@SQ") and
fields[1].startswith("SN:"):
                    chrList.append(fields[1].split("SN:")[1])

    # Write json for extract_dataset_parts
            for chrName in chrList:
                try:
                    part_dir = subdir_generator_function()
                    base_name = os.path.basename(input_file)
                    part_path = os.path.join(part_dir, base_name)
                    split_data = dict(class_name='%s.%s' % (cls.__module__,
cls.__name__),
                                      output_name=part_path,
                                      input_name=input_file,
                                      args=dict(chromosome=chrName))
                    f = open(os.path.join(part_dir, 'split_info_%s.json' %
base_name), 'w')
                    json.dump(split_data, f)
                    f.close()

                except Exception, e:
                    log.error("Error: " + str(e))
                    raise
        else:
            raise Exception('Unsupported split mode %s' %
split_params['split_mode'])
    split = classmethod(split)

Well, this works correctly and writes the json as expected. Now I have to
write the code that is called by scripts/extract_dataset_part.py (inside of
extract_dataset_parts.sh) "cls.*process_split_file*(data)".
So I created the next two function in the Bam class:

def *process_split_file*(data):
        """
        This is called in the context of an external process launched by a
Task (possibly not on the Galaxy machine)
        to create the input files for the Task. The parameters:
        data - a dict containing the contents of the split file
        """
        args = data['args']
        input_name = data['input_name']
        output_name = data['output_name']
        chromosome = args['chromosome']

        commands = Bam.get_split_commands_chromosome(input_name,
output_name, chromosome)
        for cmd in commands:
            if 0 != os.system(cmd):
                raise Exception("Executing '%s' failed" % cmd)
        return True
process_split_file = staticmethod(process_split_file)

def get_split_commands_chromosome(input_name, output_name, chromosome):
        params = ["samtools view -h " + input_name + " " + output_name + "
" + chromosome]
        return params
get_split_commands_chromosome = staticmethod(get_split_commands_chromosome)

Which is my problem? That I need the .bai related with that dataset
"input_name", I think is in a metadata table, but I don't know how to get
it, Could you please help me with this?
In any case, if you find that I am doing something wrong, or you have a
better idea of implementing this, please don't hesitate to contact me.

Best regards

-- 
Roberto Alonso
Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)
C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico)
46012 Valencia, Spain
Tel: +34 963289680 Ext. 1021
Fax: +34 963289574
E-Mail: ralonso@cipf.es

[galaxy-dev] splitting bams bai

Roberto Alonso CIPF