Dan and Peter, Peter Cock wrote, On 03/29/2011 12:08 PM:
Why not do the Illumina to Sanger conversion as part of your pipeline that gets the data into Galaxy (and mark the files as fastqsanger)? As Glen said, with a C tool that isn't really so slow. That future proofs you for the pending Illumina CASAVA 1.8 release, and means you don't need to maintain divergent Bowtie wrappers for Galaxy.
I refuse to groom on a general principle. The idea itself is unreasonable - all the tools support Illumina scale natively. I'm not going to waste my disk space and users' time (and SGE time) by grooming. When I'll see CASAVA 1.8 running then I'll switch (as we are software people, we know that there's a gap between the planning document and the real software). Note that even in that CASAVA 1.8 document they mention that the export files will still be in Illumina format, so it won't be completely gone. Daniel Blankenberg wrote, On 03/29/2011 12:41 PM:
The Grooming step is currently very time consuming and can be quite wasteful in disk space if the source and target fastq files are the same.
It is wasteful in any case, not just if they are the same...
but I have seen many occasions where Grooming has 'saved the day' by e.g. detecting truncated files that may have gone undetected by downstream tools or by indicating to the user that the variant they had selected as the source was incorrect.
I would humbly guess that most of those truncated files are due to problematic HTTP uploads - so it saves the day from another problem, which should be avoided all together.
However, I have been thinking about adding a 'check only' option to the Groomer that would use a naive parser (assume exactly 4 lines to a read, ascii scores, require input variant==output variant, etc.) and reuse the underlying original dataset file as the output (without writing over the file). This would be significantly faster and not waste disk space, but it would require enhancements to the framework.
I know you (the galaxy team) try very hard to have everything in native python (for easy deployment) but I still hold the opinion that these tools should not be done in python. No matter how much you minimize the processing, it will not be as efficient as good a compile program. Python (or perl, I don't discriminate) can probably do this entire "check only mode" in just a few lines of regexes - but try it on twenty 14GB FASTQ files and you'll realize it's not practical. Bottom line - I wouldn't use a python "checker" anyhow. -gordon