Read QC intermediate files account for most of the storage used on our galaxy site. And it's a real problem that I must solve soon.

My first attempt at taming the beast was to try to create a single read QC tool that did such things as convert qual encoding, qual-end trimming, etc. (very basic functions).  Such a tool could simply be a wrapper around your favorite existing tools, but doesn't keep the intermediate files.  The added benefit is that it runs faster because it only has to queue onto the cluster once.

Sure, one might argue that it's nice to have all the intermediate files just in case you wish to review them, but in practice, I have found this happens relatively infrequently and is too expensive.  If you're a small lab maybe that's fine, but if you generate a lot of sequence, a more production-line approach is reasonable.

I've been toying with the idea of replacing all the fastq datatypes with a single fastq datatype that is sanger-encoded and gzipped.  I think gzipped reads files are about 1/4 of the unpacked version.  Of course, many tools will require a wrapper if they don't accept gzipped input, but that's trivial (and many already support compressed reads).  

However the import tool automatically uncompressed uploaded files so I'd need to do some hacking there to prevent this.  

Heck, what we really need is a nice compact binary format for reads, perhaps which doesn't even store ids (although pairing would need to be recorded).

Thoughts?

On Fri, Aug 19, 2011 at 11:43 AM, Jelle Scholtalbers <j.scholtalbers@gmail.com> wrote:
Hi Patrick,

the issue you are having is partly related to the idea of Galaxy to
ensure reproducible science and saving each intermediate step and
output files. For example in your current workflow in Galaxy you can
easily do something else with each intermediate file - feed it to a
different tool just to check what the average read length is after
filtering - you can do that even 2 months after your run.
If you how ever insist on keeping disk usage low and don't want to
start programming - as your provided solutions will require - and
aren't too afraid of the commandline you might want to start there.

The thing is, a lot of tools accept either an input file or an input
stream. These same tools also have the ability to either write to an
output file or to an output stream. This way you can "pipe" these
tools together.
e.g. "trimMyFq -i rawinput.fq | removebarcode -i - -n optionN |
filterJunk -i - -o finalOutput.fq"

I don't know which programs you actually use, but the principle is
probably the same ( as long as the tools actually accept streams ).
This example saved you diskspace because from the 3 tools run, only
one actually writes to the disk. On the downside, this also means you
don't have an output file from removeBarcode which you can look at to
see if everything went ok.

If you do want to program or someone else wants to do it, I could
think of a tool that combines your iterative steps and can be run as
one tool - you could even wrap up your 'pipeline' in a script and put
that as a tool in your Galaxy instance and/or in the toolshed.

Cheers,
Jelle



On Fri, Aug 19, 2011 at 6:29 PM, Patrick Page-McCaw
<ppagemccaw@gmail.com> wrote:
> I'm not a bioinformaticist or programmer so apologies if this is a silly question. I've been occasionally running galaxy on my laptop and on the public server and I love it. The issue that I have is that my workflow requires many steps (what I do is probably very unusual). Each step creates a new large fastq file as the sequences are iteratively trimmed of junk. This fills my laptop and fills the public server with lots of unnecessary very large files.
>
> I've been thinking about the structure of the files and my workflow and it seems to me that a more space efficient system would be to have a single file (or a sql database) on which each tool can work. Most of what I do is remove adapter sequences, extract barcodes, trim by quality, map to the genome and then process my hits by type (exon, intron etc). Since the clean up tools in FASTX aren't written with my problem in mind, it takes several passes to get the sequences trimmed up before mapping.
>
> If I had a file that had a format something like (here as tab delimited):
> Header  Seq     Phred   Start   Len     Barcode etc
> Each tool could read the Seq and Phred starting at Start and running Len nucleotides and work on that. The tool could then write a new Start and Len to reflect the trimming it has done[1]. For convenience let me call this an HSPh format.
>
> So it would be a real pain, no doubt, to rewrite all the tools. The little that I can read the tools it seems that the way the input is handled internally varies quite a bit. But it seems to me (naively?) that it would be relatively easy to write a conversion tool that would take the HSPh format and turn it into fastq or fast on the fly for the tools. Since most tools take fastq or fasta, it should be a write once, use many times, plugin. The harder (and slower) part would be mapping the fastq output back onto HSPh format.  But again, this should be a write once, use for many tools plugin. Both of the intermediating files would be deleted when done. Just as a real quick test I thought I would see how long it takes to run sed on a fastq 1.35GB file and it was so fast on my laptop, < 2 minutes, that it was done before I noticed.
>
> Then as people are interested, the tools could be converted to take as input the new format.
>
> It may well be true in these days of $100 terabyte drives, this is not useful, that cycles are limiting, not drive space. But I think if the tools were rewritten to take and write to a HSPh format, processing would be faster too. It seems like some effort has been made to create the tab delimited format and maybe someone is already working on something like this (no doubt better designed).
>
> I may have a comp sci undergrad working in the lab this fall. With help we (well, he) might manage some parts of this. He is apparently quite a talented and hard working C++ programmer. Is it worth while?
>
> thanks
>
> [1] It could even do something like:
> Header Seq Phred Start Len Tool Parameter Start Len Tool Parameter Start Len etc
> Tool is the tool name, Parameter a list of parameters used, Start and Len would be the latest trim positions. And the last Start Len pair would be the one to use by default for the next tool, but this would keep an edit history without doubling the space needs with each processing cycle. I wouldn't need this but it might be more friendly for users, an "undo" means removing 4 columns. A format like this would probably be better as a sql database.
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
>  http://lists.bx.psu.edu/
>

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/