Hello,
He proposed three ways of splitting:
1) by_rname -> splits the bam into files based on the chromosome
2) by_interval -> splits the bam into files based on a defined bp length, and does so across the entire genome present in the BAM file
3) by_read -> splits the bam into files based on the number of reads encountered (if multiple files, all other files match the interval as the first)
As I think the easiest is the first one, I started with this option.
First of all , I had to change line 82 of lib/galaxy/jobs/splitters/multi.py as that "if" didn't let the code to continue (I talked this in another thread).
Next, I had to do some changes in lib/galaxy/datatypes/binary.py. I added a method "split" that creates the json for the script extract_dataset_parts.sh. Here, in the next code you can see that I call samtools -H in order to get the chromosome names,
now I realized that I can get that information directly from metadatas in the input_datasets variable, so in the future I will change this.
def split(cls, input_datasets, subdir_generator_function, split_params):
# 1) by_rname -> splits the bam into files based on the chromosome
# 2) by_interval -> splits the bam into files based on a defined bp length, and does so across the entire genome present in the BAM file
# 3) by_read -> splits the bam into files based on the number of reads encountered (if multiple files, all other files match the interval as the first)
if split_params is None:
return
if len(input_datasets) > 1:
raise Exception("BAM file splitting does not support multiple files")
input_file = input_datasets[0].file_name
if 'split_mode' not in split_params:
raise Exception('Tool does not define a split mode')
elif split_params['split_mode'] == 'by_rname':
log.debug("Attemping to split BAM file %s by chromosome", input_file)
#First get bam header
params = ["samtools", "view", "-H", input_file]
output = subprocess.Popen( params, stderr=subprocess.PIPE, stdout=subprocess.PIPE ).communicate()[0]
output = output.split("\n")
chrList = []
#Get chromosome list from the header.
for line in output:
fields = line.strip().split("\t")
if fields[0].startswith("@SQ") and fields[1].startswith("SN:"):
chrList.append(fields[1].split("SN:")[1])
# Write json for extract_dataset_parts
for chrName in chrList:
try:
part_dir = subdir_generator_function()
base_name = os.path.basename(input_file)
part_path = os.path.join(part_dir, base_name)
split_data = dict(class_name='%s.%s' % (cls.__module__, cls.__name__),
output_name=part_path,
input_name=input_file,
args=dict(chromosome=chrName))
f = open(os.path.join(part_dir, 'split_info_%s.json' % base_name), 'w')
json.dump(split_data, f)
f.close()
except Exception, e:
log.error("Error: " + str(e))
raise
else:
raise Exception('Unsupported split mode %s' % split_params['split_mode'])
split = classmethod(split)
Well, this works correctly and writes the json as expected. Now I have to write the code that is called by scripts/extract_dataset_part.py (inside of extract_dataset_parts.sh) "cls.process_split_file(data)".
So I created the next two function in the Bam class:
def process_split_file(data):
"""
This is called in the context of an external process launched by a Task (possibly not on the Galaxy machine)
to create the input files for the Task. The parameters:
data - a dict containing the contents of the split file
"""
args = data['args']
input_name = data['input_name']
output_name = data['output_name']
chromosome = args['chromosome']
commands = Bam.get_split_commands_chromosome(input_name, output_name, chromosome)
for cmd in commands:
if 0 != os.system(cmd):
raise Exception("Executing '%s' failed" % cmd)
return True
process_split_file = staticmethod(process_split_file)
def get_split_commands_chromosome(input_name, output_name, chromosome):
params = ["samtools view -h " + input_name + " " + output_name + " " + chromosome]
return params
get_split_commands_chromosome = staticmethod(get_split_commands_chromosome)
Which is my problem? That I need the .bai related with that dataset "input_name", I think is in a metadata table, but I don't know how to get it, Could you please help me with this?
In any case, if you find that I am doing something wrong, or you have a better idea of implementing this, please don't hesitate to contact me.
Best regards
--
Roberto Alonso
Functional Genomics Unit
Bioinformatics and Genomics Department
Prince Felipe Research Center (CIPF)