The Grooming step is currently very time consuming and can be quite wasteful in disk space if the source and target fastq files are the same, but I have seen many occasions where Grooming has 'saved the day' by e.g. detecting truncated files that may have gone undetected by downstream tools or by indicating to the user that the variant they had selected as the source was incorrect. However, I have been thinking about adding a 'check only' option to the Groomer that would use a naive parser (assume exactly 4 lines to a read, ascii scores, require input variant==output variant, etc.) and reuse the underlying original dataset file as the output (without writing over the file). This would be significantly faster and not waste disk space, but it would require enhancements to the framework. Thanks, Dan On Mar 29, 2011, at 11:46 AM, Assaf Gordon wrote:
Hi Dan,
Daniel Blankenberg wrote, On 03/29/2011 10:55 AM:
When files are added to Galaxy, the datatype can be directly set to any of the fastq variants (e.g. fastqillumina), which removes the requirement of grooming (but should only be done when users know what they are doing).
I'm not using the "get data" tool, we have our own import tools (uploading huge files with HTTP is not stable enough for me). You're right that I should change the format of this tool from 'fastq' to 'fastqillumina' (but the tool pre-dated all those built-in formats in galaxy, so I never bothered to update it...).
The one time I've tried to make the built-in Bowtie tool available, I got complaints about "why isn't my FASTQ file appear in the input list" - because it was "fastq" and not "fastqsanger" after grooming - this is a silly technical step that should not be a concern to users - so I'm taking it out of the equation here (not to mention that grooming two 14GB FASTQ files for every lane is a huge waste of space and time).
It should not be possible to have a data.ext=='fastq' after Grooming (unless manually changed by a user), please report the steps that lead to this.
Sorry, I didn't explain myself correctly: I forbid users from grooming anything (just joking, but I really really discourage its use) - so all the datasets are 'fastq' not 'fastqsanger'. There is no bug - the groomer is simply not used. As stated above, I should change the output format from 'fastq' to 'fastqillumina' (but up until this recent changeset it wouldn't have made any different, because the bowtie tool would not have accepted fastqillumina).
-gordon