Hello,

I am writing some code in Galaxy for splitting bams. Up to know, I am following the ideas that Marco Albuquerque proposed in this thread http://dev.list.galaxyproject.org/Parallelism-using-metadata-td4666763.html.

He proposed three ways of splitting:

1) by_rname -> splits the bam into files based on the chromosome

2) by_interval -> splits the bam into files based on a defined bp length, and does so across the entire genome present in the BAM file

3) by_read -> splits the bam into files based on the number of reads encountered (if multiple files, all other files match the interval as the first)

As I think the easiest is the first one, I started with this option.

First of all , I had to change line 82 of lib/galaxy/jobs/splitters/multi.py as that "if" didn't let the code to continue (I talked this in another thread).

Next, I had to do some changes in lib/galaxy/datatypes/binary.py. I added a method "split" that creates the json for the script extract_dataset_parts.sh. Here, in the next code you can see that I call samtools -H in order to get the chromosome names,

now I realized that I can get that information directly from metadatas in the input_datasets variable, so in the future I will change this.

def split(cls, input_datasets, subdir_generator_function, split_params):

# 1) by_rname -> splits the bam into files based on the chromosome

# 2) by_interval -> splits the bam into files based on a defined bp length, and does so across the entire genome present in the BAM file

# 3) by_read -> splits the bam into files based on the number of reads encountered (if multiple files, all other files match the interval as the first)

if split_params is None:

return

if len(input_datasets) > 1:

raise Exception("BAM file splitting does not support multiple files")

input_file = input_datasets[0].file_name

if 'split_mode' not in split_params:

raise Exception('Tool does not define a split mode')

elif split_params['split_mode'] == 'by_rname':

log.debug("Attemping to split BAM file %s by chromosome", input_file)

#First get bam header

params = ["samtools", "view", "-H", input_file]

output = subprocess.Popen( params, stderr=subprocess.PIPE, stdout=subprocess.PIPE ).communicate()[0]

output = output.split("\n")

chrList = []

#Get chromosome list from the header.

for line in output:

fields = line.strip().split("\t")

if fields[0].startswith("@SQ") and fields[1].startswith("SN:"):

chrList.append(fields[1].split("SN:")[1])

# Write json for extract_dataset_parts

for chrName in chrList:

try:

part_dir = subdir_generator_function()

base_name = os.path.basename(input_file)

part_path = os.path.join(part_dir, base_name)

split_data = dict(class_name='%s.%s' % (cls.__module__, cls.__name__),

output_name=part_path,

input_name=input_file,

args=dict(chromosome=chrName))

f = open(os.path.join(part_dir, 'split_info_%s.json' % base_name), 'w')

json.dump(split_data, f)

f.close()

except Exception, e:

log.error("Error: " + str(e))

raise

else:

raise Exception('Unsupported split mode %s' % split_params['split_mode'])

split = classmethod(split)

Well, this works correctly and writes the json as expected. Now I have to write the code that is called by scripts/extract_dataset_part.py (inside of extract_dataset_parts.sh) "cls.process_split_file(data)".

So I created the next two function in the Bam class:

def process_split_file(data):

"""

This is called in the context of an external process launched by a Task (possibly not on the Galaxy machine)

to create the input files for the Task. The parameters:

data - a dict containing the contents of the split file

"""

args = data['args']

input_name = data['input_name']

output_name = data['output_name']

chromosome = args['chromosome']

commands = Bam.get_split_commands_chromosome(input_name, output_name, chromosome)

for cmd in commands:

if 0 != os.system(cmd):

raise Exception("Executing '%s' failed" % cmd)

return True

process_split_file = staticmethod(process_split_file)

def get_split_commands_chromosome(input_name, output_name, chromosome):

params = ["samtools view -h " + input_name + " " + output_name + " " + chromosome]

return params

get_split_commands_chromosome = staticmethod(get_split_commands_chromosome)

Which is my problem? That I need the .bai related with that dataset "input_name", I think is in a metadata table, but I don't know how to get it, Could you please help me with this?

In any case, if you find that I am doing something wrong, or you have a better idea of implementing this, please don't hesitate to contact me.

Best regards

Roberto Alonso

Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)

C./Eduardo Primo Yúfera (Científic), nº 3
(junto Oceanografico)
46012 Valencia, Spain
Tel: +34 963289680 Ext. 1021
Fax: +34 963289574
E-Mail: ralonso@cipf.es