[galaxy-dev] disk space and file formats

19 Aug 2011

      I'm not a bioinformaticist or programmer so apologies if this is a silly question. I've been occasionally running galaxy on my laptop and on the public server and I love it. The issue that I have is that my workflow requires many steps (what I do is probably very unusual). Each step creates a new large fastq file as the sequences are iteratively trimmed of junk. This fills my laptop and fills the public server with lots of unnecessary very large files.

I've been thinking about the structure of the files and my workflow and it seems to me that a more space efficient system would be to have a single file (or a sql database) on which each tool can work. Most of what I do is remove adapter sequences, extract barcodes, trim by quality, map to the genome and then process my hits by type (exon, intron etc). Since the clean up tools in FASTX aren't written with my problem in mind, it takes several passes to get the sequences trimmed up before mapping. 

If I had a file that had a format something like (here as tab delimited):
Header	Seq	Phred	Start	Len	Barcode	etc
Each tool could read the Seq and Phred starting at Start and running Len nucleotides and work on that. The tool could then write a new Start and Len to reflect the trimming it has done[1]. For convenience let me call this an HSPh format.

So it would be a real pain, no doubt, to rewrite all the tools. The little that I can read the tools it seems that the way the input is handled internally varies quite a bit. But it seems to me (naively?) that it would be relatively easy to write a conversion tool that would take the HSPh format and turn it into fastq or fast on the fly for the tools. Since most tools take fastq or fasta, it should be a write once, use many times, plugin. The harder (and slower) part would be mapping the fastq output back onto HSPh format.  But again, this should be a write once, use for many tools plugin. Both of the intermediating files would be deleted when done. Just as a real quick test I thought I would see how long it takes to run sed on a fastq 1.35GB file and it was so fast on my laptop, < 2 minutes, that it was done before I noticed. 

Then as people are interested, the tools could be converted to take as input the new format.

It may well be true in these days of $100 terabyte drives, this is not useful, that cycles are limiting, not drive space. But I think if the tools were rewritten to take and write to a HSPh format, processing would be faster too. It seems like some effort has been made to create the tab delimited format and maybe someone is already working on something like this (no doubt better designed).

I may have a comp sci undergrad working in the lab this fall. With help we (well, he) might manage some parts of this. He is apparently quite a talented and hard working C++ programmer. Is it worth while? 

thanks

[1] It could even do something like:
Header Seq Phred Start Len Tool Parameter Start Len Tool Parameter Start Len etc
Tool is the tool name, Parameter a list of parameters used, Start and Len would be the latest trim positions. And the last Start Len pair would be the one to use by default for the next tool, but this would keep an edit history without doubling the space needs with each processing cycle. I wouldn't need this but it might be more friendly for users, an "undo" means removing 4 columns. A format like this would probably be better as a sql database.

[galaxy-dev] disk space and file formats

Patrick Page-McCaw