Alternative bowtie tools

Assaf Gordon

29 Mar 2011 29 Mar '11

2:31 a.m.

Hello all, We're developing alternative bowtie tools that more closely suit our needs, are we're happy to share (and get comments). The main differences are: 1. separate tools for paired-end and single-end 2. the tools accepts FASTA, FASTQ in both Sanger and Illumina format (no more need for grooming). Illumina is the default for newly uploaded FASTQ files. 3. fewer visible options, only the most relevant ones are available (IMHO better organized) 4. optional output of non-mappers and maxed-out mappers as FASTQ files (to use in downstream analysis or remapping attempts) 5. allows users to to do "quiet" output (only aligned reads are reported in the SAM file). 6. allows users to specify custom options (free text field), for those who know what they are doing. 7. Outputs a sorted SAM file, and sorting is done using multi-threaded sort. You can see the interface (but not run the tools) here: http://cancan.cshl.edu/publicgalaxy/root?tool_id=cshl_bowtie_se1 http://cancan.cshl.edu/publicgalaxy/root?tool_id=cshl_bowtie_pe1 The tools (XML & shell scripts) are available here: http://cancan.cshl.edu/labmembers/gordon/files/galaxy_bowtie_tools.tar.bz2 Note about multithreaded bowtie: currently the tools use 10 threads (hard-coded in the XML files) - easily changeable. Note about the sorting: The "sam_sort.sh" script requires the latest sort from coreutils-8.10 (which does multi-threaded sort). The number of threads, maximum amount of RAM allowed for sort, and temporary storage directory are all hard-coded in the beginning of the script - easily changeable. TopHat and Cufflinks tools will follow soon (if there's any interest). Comments are of course welcomed, these are very much work-in-progress. -gordon

Show replies by date

Peter Cock

29 Mar 29 Mar

11:39 a.m.

On Tue, Mar 29, 2011 at 1:31 AM, Assaf Gordon <gordon@cshl.edu> wrote:

...

Hello all,

We're developing alternative bowtie tools that more closely suit our needs, are we're happy to share (and get comments).

The main differences are: 1. separate tools for paired-end and single-end

Sounds sensible to me.

...

2. the tools accepts FASTA, FASTQ in both Sanger and Illumina format (no more need for grooming). Illumina is the default for newly uploaded FASTQ files.

I think that's a bad idea - use Sanger FASTQ as the default to be consistent with the rest of Galaxy, and also with CASAVA 1.8 Illumina machines will produce that too, see: http://seqanswers.com/forums/showthread.php?t=8895 Peter

Assaf Gordon

4:25 p.m.

Hi Peter, Peter Cock wrote, On 03/29/2011 05:39 AM:

...

...
2. the tools accepts FASTA, FASTQ in both Sanger and Illumina format (no more need for grooming). Illumina is the default for newly uploaded FASTQ files.

I think that's a bad idea - use Sanger FASTQ as the default to be consistent with the rest of Galaxy, and also with CASAVA 1.8 Illumina machines will produce that too, see: http://seqanswers.com/forums/showthread.php?t=8895

Thanks for the link - very interesting read, I wasn't aware of it. However, for our local Galaxy server - I'm sticking with Illumina scale until I see real samples with phred-33 in the wild. The defaults can be easily changed (in the XML file, simply assume a different scale when the extension is "fastq"), or don't accept "fastq" at all and force the user to change the format to either "fastqillumina" or "fastqsanger". I'll explain my reasoning: We (at our lab) deal mostly with Illumina FASTQ files, with the Illumina scale. I'm trying to make life as easy as possible for our users. When they upload a FASTQ file, it is by default an Illumina FASTQ file, I want them to be able to use a workflow on it immediately. All of our internal tools assume Illumina scale. The one time I've tried to make the built-in Bowtie tool available, I got complaints about "why isn't my FASTQ file appear in the input list" - because it was "fastq" and not "fastqsanger" after grooming - this is a silly technical step that should not be a concern to users - so I'm taking it out of the equation here (not to mention that grooming two 14GB FASTQ files for every lane is a huge waste of space and time). When CASAVA 1.8 is ready (that is - when it is actually running in our sequencing center), then we'll have to deal with it. Ideally - galaxy will have some metadata code that will scan the first 1,000,000 lines and heuristically detect which scale it is. I'm not leaving this choice for the users, because they will make the wrong choice and then come crying back. Just my two cents, -gordon

Daniel Blankenberg

4:55 p.m.

Hi Assaf, Just a quick note that the standard bowtie tool in Galaxy was enhanced in changeset 5157:7a9476924daf to work on 'fastqillumina' and 'fastqsolexa' variants in addition to the already possible 'fastqsanger'. In general, it is not a good idea to have a tool accept dataset.ext=='fastq' unless it doesn't care about quality scores or it determines the correct offset/scale itself or the variant type is declared by the user in the tool interface. When files are added to Galaxy, the datatype can be directly set to any of the fastq variants (e.g. fastqillumina), which removes the requirement of grooming (but should only be done when users know what they are doing).

...

The one time I've tried to make the built-in Bowtie tool available, I got complaints about "why isn't my FASTQ file appear in the input list" - because it was "fastq" and not "fastqsanger" after grooming - this is a silly technical step that should not be a concern to users - so I'm taking it out of the equation here (not to mention that grooming two 14GB FASTQ files for every lane is a huge waste of space and time).

It should not be possible to have a data.ext=='fastq' after Grooming (unless manually changed by a user), please report the steps that lead to this. Thanks, Dan On Mar 29, 2011, at 10:25 AM, Assaf Gordon wrote:

...

Hi Peter,

Peter Cock wrote, On 03/29/2011 05:39 AM:

...
...
2. the tools accepts FASTA, FASTQ in both Sanger and Illumina format (no more need for grooming). Illumina is the default for newly uploaded FASTQ files.

I think that's a bad idea - use Sanger FASTQ as the default to be consistent with the rest of Galaxy, and also with CASAVA 1.8 Illumina machines will produce that too, see: http://seqanswers.com/forums/showthread.php?t=8895

Thanks for the link - very interesting read, I wasn't aware of it.

However, for our local Galaxy server - I'm sticking with Illumina scale until I see real samples with phred-33 in the wild.

The defaults can be easily changed (in the XML file, simply assume a different scale when the extension is "fastq"), or don't accept "fastq" at all and force the user to change the format to either "fastqillumina" or "fastqsanger".

I'll explain my reasoning: We (at our lab) deal mostly with Illumina FASTQ files, with the Illumina scale. I'm trying to make life as easy as possible for our users. When they upload a FASTQ file, it is by default an Illumina FASTQ file, I want them to be able to use a workflow on it immediately. All of our internal tools assume Illumina scale.

The one time I've tried to make the built-in Bowtie tool available, I got complaints about "why isn't my FASTQ file appear in the input list" - because it was "fastq" and not "fastqsanger" after grooming - this is a silly technical step that should not be a concern to users - so I'm taking it out of the equation here (not to mention that grooming two 14GB FASTQ files for every lane is a huge waste of space and time).

When CASAVA 1.8 is ready (that is - when it is actually running in our sequencing center), then we'll have to deal with it. Ideally - galaxy will have some metadata code that will scan the first 1,000,000 lines and heuristically detect which scale it is. I'm not leaving this choice for the users, because they will make the wrong choice and then come crying back.

Just my two cents, -gordon ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Assaf Gordon

5:46 p.m.

Hi Dan, Daniel Blankenberg wrote, On 03/29/2011 10:55 AM:

...

When files are added to Galaxy, the datatype can be directly set to any of the fastq variants (e.g. fastqillumina), which removes the requirement of grooming (but should only be done when users know what they are doing).

I'm not using the "get data" tool, we have our own import tools (uploading huge files with HTTP is not stable enough for me). You're right that I should change the format of this tool from 'fastq' to 'fastqillumina' (but the tool pre-dated all those built-in formats in galaxy, so I never bothered to update it...).

...

...
The one time I've tried to make the built-in Bowtie tool available, I got complaints about "why isn't my FASTQ file appear in the input list" - because it was "fastq" and not "fastqsanger" after grooming - this is a silly technical step that should not be a concern to users - so I'm taking it out of the equation here (not to mention that grooming two 14GB FASTQ files for every lane is a huge waste of space and time).

...

It should not be possible to have a data.ext=='fastq' after Grooming (unless manually changed by a user), please report the steps that lead to this.

Sorry, I didn't explain myself correctly: I forbid users from grooming anything (just joking, but I really really discourage its use) - so all the datasets are 'fastq' not 'fastqsanger'. There is no bug - the groomer is simply not used. As stated above, I should change the output format from 'fastq' to 'fastqillumina' (but up until this recent changeset it wouldn't have made any different, because the bowtie tool would not have accepted fastqillumina). -gordon

Peter Cock

6:08 p.m.

On Tue, Mar 29, 2011 at 4:46 PM, Assaf Gordon <gordon@cshl.edu> wrote:

...

Hi Dan,

Daniel Blankenberg wrote, On 03/29/2011 10:55 AM:

...
When files are added to Galaxy, the datatype can be directly set to any of the fastq variants (e.g. fastqillumina), which removes the requirement of grooming (but should only be done when users know what they are doing).

I'm not using the "get data" tool, we have our own import tools (uploading huge files with HTTP is not stable enough for me). You're right that I should change the format of this tool from 'fastq' to 'fastqillumina' (but the tool pre-dated all those built-in formats in galaxy, so I never bothered to update it...).

Why not do the Illumina to Sanger conversion as part of your pipeline that gets the data into Galaxy (and mark the files as fastqsanger)? As Glen said, with a C tool that isn't really so slow. That future proofs you for the pending Illumina CASAVA 1.8 release, and means you don't need to maintain divergent Bowtie wrappers for Galaxy. Peter

Daniel Blankenberg

6:41 p.m.

The Grooming step is currently very time consuming and can be quite wasteful in disk space if the source and target fastq files are the same, but I have seen many occasions where Grooming has 'saved the day' by e.g. detecting truncated files that may have gone undetected by downstream tools or by indicating to the user that the variant they had selected as the source was incorrect. However, I have been thinking about adding a 'check only' option to the Groomer that would use a naive parser (assume exactly 4 lines to a read, ascii scores, require input variant==output variant, etc.) and reuse the underlying original dataset file as the output (without writing over the file). This would be significantly faster and not waste disk space, but it would require enhancements to the framework. Thanks, Dan On Mar 29, 2011, at 11:46 AM, Assaf Gordon wrote:

...

Hi Dan,

Daniel Blankenberg wrote, On 03/29/2011 10:55 AM:

...
When files are added to Galaxy, the datatype can be directly set to any of the fastq variants (e.g. fastqillumina), which removes the requirement of grooming (but should only be done when users know what they are doing).

I'm not using the "get data" tool, we have our own import tools (uploading huge files with HTTP is not stable enough for me). You're right that I should change the format of this tool from 'fastq' to 'fastqillumina' (but the tool pre-dated all those built-in formats in galaxy, so I never bothered to update it...).

...
...
The one time I've tried to make the built-in Bowtie tool available, I got complaints about "why isn't my FASTQ file appear in the input list" - because it was "fastq" and not "fastqsanger" after grooming - this is a silly technical step that should not be a concern to users - so I'm taking it out of the equation here (not to mention that grooming two 14GB FASTQ files for every lane is a huge waste of space and time).

...
It should not be possible to have a data.ext=='fastq' after Grooming (unless manually changed by a user), please report the steps that lead to this.

Sorry, I didn't explain myself correctly: I forbid users from grooming anything (just joking, but I really really discourage its use) - so all the datasets are 'fastq' not 'fastqsanger'. There is no bug - the groomer is simply not used. As stated above, I should change the output format from 'fastq' to 'fastqillumina' (but up until this recent changeset it wouldn't have made any different, because the bowtie tool would not have accepted fastqillumina).

-gordon

Assaf Gordon

7:48 p.m.

Dan and Peter, Peter Cock wrote, On 03/29/2011 12:08 PM:

...

Why not do the Illumina to Sanger conversion as part of your pipeline that gets the data into Galaxy (and mark the files as fastqsanger)? As Glen said, with a C tool that isn't really so slow. That future proofs you for the pending Illumina CASAVA 1.8 release, and means you don't need to maintain divergent Bowtie wrappers for Galaxy.

I refuse to groom on a general principle. The idea itself is unreasonable - all the tools support Illumina scale natively. I'm not going to waste my disk space and users' time (and SGE time) by grooming. When I'll see CASAVA 1.8 running then I'll switch (as we are software people, we know that there's a gap between the planning document and the real software). Note that even in that CASAVA 1.8 document they mention that the export files will still be in Illumina format, so it won't be completely gone. Daniel Blankenberg wrote, On 03/29/2011 12:41 PM:

...

The Grooming step is currently very time consuming and can be quite wasteful in disk space if the source and target fastq files are the same.

It is wasteful in any case, not just if they are the same...

...

but I have seen many occasions where Grooming has 'saved the day' by e.g. detecting truncated files that may have gone undetected by downstream tools or by indicating to the user that the variant they had selected as the source was incorrect.

I would humbly guess that most of those truncated files are due to problematic HTTP uploads - so it saves the day from another problem, which should be avoided all together.

...

However, I have been thinking about adding a 'check only' option to the Groomer that would use a naive parser (assume exactly 4 lines to a read, ascii scores, require input variant==output variant, etc.) and reuse the underlying original dataset file as the output (without writing over the file). This would be significantly faster and not waste disk space, but it would require enhancements to the framework.

I know you (the galaxy team) try very hard to have everything in native python (for easy deployment) but I still hold the opinion that these tools should not be done in python. No matter how much you minimize the processing, it will not be as efficient as good a compile program. Python (or perl, I don't discriminate) can probably do this entire "check only mode" in just a few lines of regexes - but try it on twenty 14GB FASTQ files and you'll realize it's not practical. Bottom line - I wouldn't use a python "checker" anyhow. -gordon

James Taylor

30 Mar 30 Mar

1:12 a.m.

...

I would humbly guess that most of those truncated files are due to problematic HTTP uploads - so it saves the day from another problem, which should be avoided all together.

Maybe most, but definitely not all. We see all kinds of strange corruption.

...

...
However, I have been thinking about adding a 'check only' option to the Groomer that would use a naive parser (assume exactly 4 lines to a read, ascii scores, require input variant==output variant, etc.) and reuse the underlying original dataset file as the output (without writing over the file). This would be significantly faster and not waste disk space, but it would require enhancements to the framework.

I know you (the galaxy team) try very hard to have everything in native python (for easy deployment) but I still hold the opinion that these tools should not be done in python. No matter how much you minimize the processing, it will not be as efficient as good a compile program. Python (or perl, I don't discriminate) can probably do this entire "check only mode" in just a few lines of regexes - but try it on twenty 14GB FASTQ files and you'll realize it's not practical.

Bottom line - I wouldn't use a python "checker" anyhow.

We care more about easy deployment then language. If you have a nice C function that can do this, wrapping it in cython and packaging it is trivial and adds minimal overhead.

Glen Beane

29 Mar 29 Mar

5:02 p.m.

On Mar 29, 2011, at 10:25 AM, Assaf Gordon wrote:

...

Hi Peter,

Peter Cock wrote, On 03/29/2011 05:39 AM:

...
...
2. the tools accepts FASTA, FASTQ in both Sanger and Illumina format (no more need for grooming). Illumina is the default for newly uploaded FASTQ files.

I think that's a bad idea - use Sanger FASTQ as the default to be consistent with the rest of Galaxy, and also with CASAVA 1.8 Illumina machines will produce that too, see: http://seqanswers.com/forums/showthread.php?t=8895

Thanks for the link - very interesting read, I wasn't aware of it.

However, for our local Galaxy server - I'm sticking with Illumina scale until I see real samples with phred-33 in the wild.

The defaults can be easily changed (in the XML file, simply assume a different scale when the extension is "fastq"), or don't accept "fastq" at all and force the user to change the format to either "fastqillumina" or "fastqsanger".

I'll explain my reasoning: We (at our lab) deal mostly with Illumina FASTQ files, with the Illumina scale. I'm trying to make life as easy as possible for our users. When they upload a FASTQ file, it is by default an Illumina FASTQ file, I want them to be able to use a workflow on it immediately. All of our internal tools assume Illumina scale.

The one time I've tried to make the built-in Bowtie tool available, I got complaints about "why isn't my FASTQ file appear in the input list" - because it was "fastq" and not "fastqsanger" after grooming - this is a silly technical step that should not be a concern to users - so I'm taking it out of the equation here (not to mention that grooming two 14GB FASTQ files for every lane is a huge waste of space and time).

We've gotten into the habit of grooming everything (all of our files are also Illumina FASTQ files), so I'm looking forward to the change. I definitely share the concern about the space wasted by essentially having two copies of the same data in Galaxy. We had looked into making Illumina the default for our local instance of Galaxy, but in the end we stuck with Sanger (although we have talked about "pregrooming" files coming off the sequencer). The wasted time was annoying, so I wrote my own groomer in C that could groom one of our FASTQ files in about 5 minutes although if you ran several (6-12) of our custom grooming jobs at the same time on the same node the run time would jump up to ~50 minutes for a 17GB fastq due to IO wait, but it was still much faster than the built-in groomer. -- Glen L. Beane Senior Software Engineer The Jackson Laboratory (207) 288-6153

Ryan Golhar

6:26 p.m.

...

Note about multithreaded bowtie: currently the tools use 10 threads (hard-coded in the XML files) - easily changeable.

If possible, have the user indicate as a parameter how many threads they wish to use. -- CONFIDENTIALITY NOTICE: This email communication may contain private, confidential, or legally privileged information intended for the sole use of the designated and/or duly authorized recipient(s). If you are not the intended recipient or have received this email in error, please notify the sender immediately by email and permanently delete all copies of this email including all attachments without reading them. If you are the intended recipient, secure the contents in a manner that conforms to all applicable state and/or federal requirements related to privacy and confidentiality of such information.

Assaf Gordon

7:35 p.m.

Ryan Golhar wrote, On 03/29/2011 12:26 PM:

...

...
Note about multithreaded bowtie: currently the tools use 10 threads (hard-coded in the XML files) - easily changeable.

If possible, have the user indicate as a parameter how many threads they wish to use.

This will never happen, at least not in my tools: The nice user who will kindly use a single thread (or just two or three) has not yet been born :) All users will put the maximum value (or higher) and will choke up the system. What's more, the underlying submission is using SGE and there's currently no way to dynamically change the number of slots a tool is using. So hard-coded 10 (or other value) it is. If you do want to allow users to set number of threads, simply change the first parameter in the XML file from hard-coded 10 to use a galaxy <param>. -gordon

5210

Age (days ago)

5210

Last active (days ago)

List overview

Download

11 comments

6 participants

participants (6)

Assaf Gordon
Daniel Blankenberg
Glen Beane
James Taylor
Peter Cock
Ryan Golhar