Hello, I am writing some code in Galaxy for splitting bams. Up to know, I am following the ideas that Marco Albuquerque proposed in this thread http://dev.list.galaxyproject.org/Parallelism-using-metadata-td4666763.html . He proposed three ways of splitting: 1) by_rname -> splits the bam into files based on the chromosome 2) by_interval -> splits the bam into files based on a defined bp length, and does so across the entire genome present in the BAM file 3) by_read -> splits the bam into files based on the number of reads encountered (if multiple files, all other files match the interval as the first) As I think the easiest is the first one, I started with this option. First of all , I had to change line 82 of lib/galaxy/jobs/splitters/multi.py as that "if" didn't let the code to continue (I talked this in another thread). Next, I had to do some changes in lib/galaxy/datatypes/binary.py. I added a method "split" that creates the json for the script extract_dataset_parts.sh. Here, in the next code you can see that I call samtools -H in order to get the chromosome names, now I realized that I can get that information directly from metadatas in the input_datasets variable, so in the future I will change this. def split(cls, input_datasets, subdir_generator_function, split_params): # 1) by_rname -> splits the bam into files based on the chromosome # 2) by_interval -> splits the bam into files based on a defined bp length, and does so across the entire genome present in the BAM file # 3) by_read -> splits the bam into files based on the number of reads encountered (if multiple files, all other files match the interval as the first) if split_params is None: return if len(input_datasets) > 1: raise Exception("BAM file splitting does not support multiple files") input_file = input_datasets[0].file_name if 'split_mode' not in split_params: raise Exception('Tool does not define a split mode') elif split_params['split_mode'] == 'by_rname': log.debug("Attemping to split BAM file %s by chromosome", input_file) #First get bam header params = ["samtools", "view", "-H", input_file] output = subprocess.Popen( params, stderr=subprocess.PIPE, stdout=subprocess.PIPE ).communicate()[0] output = output.split("\n") chrList = [] #Get chromosome list from the header. for line in output: fields = line.strip().split("\t") if fields[0].startswith("@SQ") and fields[1].startswith("SN:"): chrList.append(fields[1].split("SN:")[1]) # Write json for extract_dataset_parts for chrName in chrList: try: part_dir = subdir_generator_function() base_name = os.path.basename(input_file) part_path = os.path.join(part_dir, base_name) split_data = dict(class_name='%s.%s' % (cls.__module__, cls.__name__), output_name=part_path, input_name=input_file, args=dict(chromosome=chrName)) f = open(os.path.join(part_dir, 'split_info_%s.json' % base_name), 'w') json.dump(split_data, f) f.close() except Exception, e: log.error("Error: " + str(e)) raise else: raise Exception('Unsupported split mode %s' % split_params['split_mode']) split = classmethod(split) Well, this works correctly and writes the json as expected. Now I have to write the code that is called by scripts/extract_dataset_part.py (inside of extract_dataset_parts.sh) "cls.*process_split_file*(data)". So I created the next two function in the Bam class: def *process_split_file*(data): """ This is called in the context of an external process launched by a Task (possibly not on the Galaxy machine) to create the input files for the Task. The parameters: data - a dict containing the contents of the split file """ args = data['args'] input_name = data['input_name'] output_name = data['output_name'] chromosome = args['chromosome'] commands = Bam.get_split_commands_chromosome(input_name, output_name, chromosome) for cmd in commands: if 0 != os.system(cmd): raise Exception("Executing '%s' failed" % cmd) return True process_split_file = staticmethod(process_split_file) def get_split_commands_chromosome(input_name, output_name, chromosome): params = ["samtools view -h " + input_name + " " + output_name + " " + chromosome] return params get_split_commands_chromosome = staticmethod(get_split_commands_chromosome) Which is my problem? That I need the .bai related with that dataset "input_name", I think is in a metadata table, but I don't know how to get it, Could you please help me with this? In any case, if you find that I am doing something wrong, or you have a better idea of implementing this, please don't hesitate to contact me. Best regards -- Roberto Alonso Functional Genomics Unit Bioinformatics and Genomics Department Prince Felipe Research Center (CIPF) C./Eduardo Primo Yúfera (Científic), nº 3 (junto Oceanografico) 46012 Valencia, Spain Tel: +34 963289680 Ext. 1021 Fax: +34 963289574 E-Mail: ralonso@cipf.es