Read QC intermediate files account for most of the storage used on our galaxy site. And it's a real problem that I must solve soon.
Hi Patrick,
the issue you are having is partly related to the idea of Galaxy to
ensure reproducible science and saving each intermediate step and
output files. For example in your current workflow in Galaxy you can
easily do something else with each intermediate file - feed it to a
different tool just to check what the average read length is after
filtering - you can do that even 2 months after your run.
If you how ever insist on keeping disk usage low and don't want to
start programming - as your provided solutions will require - and
aren't too afraid of the commandline you might want to start there.
The thing is, a lot of tools accept either an input file or an input
stream. These same tools also have the ability to either write to an
output file or to an output stream. This way you can "pipe" these
tools together.
e.g. "trimMyFq -i rawinput.fq | removebarcode -i - -n optionN |
filterJunk -i - -o finalOutput.fq"
I don't know which programs you actually use, but the principle is
probably the same ( as long as the tools actually accept streams ).
This example saved you diskspace because from the 3 tools run, only
one actually writes to the disk. On the downside, this also means you
don't have an output file from removeBarcode which you can look at to
see if everything went ok.
If you do want to program or someone else wants to do it, I could
think of a tool that combines your iterative steps and can be run as
one tool - you could even wrap up your 'pipeline' in a script and put
that as a tool in your Galaxy instance and/or in the toolshed.
Cheers,
Jelle
On Fri, Aug 19, 2011 at 6:29 PM, Patrick Page-McCaw
<ppagemccaw@gmail.com> wrote:
> I'm not a bioinformaticist or programmer so apologies if this is a silly question. I've been occasionally running galaxy on my laptop and on the public server and I love it. The issue that I have is that my workflow requires many steps (what I do is probably very unusual). Each step creates a new large fastq file as the sequences are iteratively trimmed of junk. This fills my laptop and fills the public server with lots of unnecessary very large files.
>
> I've been thinking about the structure of the files and my workflow and it seems to me that a more space efficient system would be to have a single file (or a sql database) on which each tool can work. Most of what I do is remove adapter sequences, extract barcodes, trim by quality, map to the genome and then process my hits by type (exon, intron etc). Since the clean up tools in FASTX aren't written with my problem in mind, it takes several passes to get the sequences trimmed up before mapping.
>
> If I had a file that had a format something like (here as tab delimited):
> Header Seq Phred Start Len Barcode etc
> Each tool could read the Seq and Phred starting at Start and running Len nucleotides and work on that. The tool could then write a new Start and Len to reflect the trimming it has done[1]. For convenience let me call this an HSPh format.
>
> So it would be a real pain, no doubt, to rewrite all the tools. The little that I can read the tools it seems that the way the input is handled internally varies quite a bit. But it seems to me (naively?) that it would be relatively easy to write a conversion tool that would take the HSPh format and turn it into fastq or fast on the fly for the tools. Since most tools take fastq or fasta, it should be a write once, use many times, plugin. The harder (and slower) part would be mapping the fastq output back onto HSPh format. But again, this should be a write once, use for many tools plugin. Both of the intermediating files would be deleted when done. Just as a real quick test I thought I would see how long it takes to run sed on a fastq 1.35GB file and it was so fast on my laptop, < 2 minutes, that it was done before I noticed.
>
> Then as people are interested, the tools could be converted to take as input the new format.
>
> It may well be true in these days of $100 terabyte drives, this is not useful, that cycles are limiting, not drive space. But I think if the tools were rewritten to take and write to a HSPh format, processing would be faster too. It seems like some effort has been made to create the tab delimited format and maybe someone is already working on something like this (no doubt better designed).
>
> I may have a comp sci undergrad working in the lab this fall. With help we (well, he) might manage some parts of this. He is apparently quite a talented and hard working C++ programmer. Is it worth while?
>
> thanks
>
> [1] It could even do something like:
> Header Seq Phred Start Len Tool Parameter Start Len Tool Parameter Start Len etc
> Tool is the tool name, Parameter a list of parameters used, Start and Len would be the latest trim positions. And the last Start Len pair would be the one to use by default for the next tool, but this would keep an edit history without doubling the space needs with each processing cycle. I wouldn't need this but it might be more friendly for users, an "undo" means removing 4 columns. A format like this would probably be better as a sql database.
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client. To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
> http://lists.bx.psu.edu/
>
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/