[galaxy-dev] pass more information on a dataset merge

1 Oct 2012

      Dear list,

I thought I was working with fairly large datasets, but they have 
recently started to include ~2Gb files in sets of >50. I have ran these 
sort of things before as merged data by using tar to roll them up in one 
set, but when dealing with >100Gb tarfiles, Galaxy on EC2 seems to get 
very slow, although that's probably because of my implementation of 
dataset type detection (untar and read through files).

Since tarring/untarring isn't very clean, I want to switch from tarring 
to creating composite files on merge by putting a tool's results into 
the dataset.extra_files_path. This doesn't seem to be supported yet, 
because we currently pass in do_merge the output dataset.filename to the 
respective datatype's merge method.

I would like to pass more data to the merge method (let's say the whole 
dataset object) to be able to get the composite files directory and 
'merge' the files in there. Good idea, bad idea? If anyone has views on 
this, I'd love to hear them.

cheers,
jorrit

[galaxy-dev] pass more information on a dataset merge

Jorrit Boekel