Tophat non Sanger input - galaxy-dev - lists.galaxyproject.org

newer
Installation error

Tophat non Sanger input

older
upload files from file system

Stephen Taylor

31 Aug 2011 31 Aug '11

3:45 a.m.

Hi, Is there any plans to enhance the tophat wrapper to accept non Sanger fastqs, as for bowtie? https://bitbucket.org/galaxy/galaxy-central/changeset/7a9476924daf ? Kind regards and thanks, Steve

Reply

Sign in to reply online Use email software

Show replies by date

Edward Kirton

7 Sep 7 Sep

3:22 p.m.

seems unnecessary since illumina switched over to fastqsanger now. http://www.illumina.com/truseq/quality_101/quality_scores.ilmn On Wed, Aug 31, 2011 at 12:45 AM, Stephen Taylor < stephen.taylor@imm.ox.ac.uk> wrote:

Hi,

Is there any plans to enhance the tophat wrapper to accept non Sanger fastqs, as for bowtie?

https://bitbucket.org/galaxy/**galaxy-central/changeset/**7a9476924daf<https://bitbucket.org/galaxy/galaxy-central/changeset/7a9476924daf>

?

Kind regards and thanks,

Steve ______________________________**_____________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Reply

Sign in to reply online Use email software

Stephen Taylor

8 Sep 8 Sep

3:47 a.m.

On 07/09/2011 20:22, Edward Kirton wrote:

seems unnecessary since illumina switched over to fastqsanger now.

http://www.illumina.com/truseq/quality_101/quality_scores.ilmn

Eventually...unfortunately we still get a lot of fastqillumina :-( Steve

Reply

Sign in to reply online Use email software

Hans-Rudolf Hotz

9:17 a.m.

On 09/08/2011 09:47 AM, Stephen Taylor wrote:

On 07/09/2011 20:22, Edward Kirton wrote:

...
seems unnecessary since illumina switched over to fastqsanger now.

http://www.illumina.com/truseq/quality_101/quality_scores.ilmn

Eventually...unfortunately we still get a lot of fastqillumina :-(

I might miss your point.....but why can't you use the fastq groomer tool? Regards, Hans

Steve ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Reply

Sign in to reply online Use email software

Stephen Taylor

10:14 a.m.

On 08/09/2011 14:17, Hans-Rudolf Hotz wrote:

On 09/08/2011 09:47 AM, Stephen Taylor wrote:

...
On 07/09/2011 20:22, Edward Kirton wrote:

...
seems unnecessary since illumina switched over to fastqsanger now.

http://www.illumina.com/truseq/quality_101/quality_scores.ilmn

Eventually...unfortunately we still get a lot of fastqillumina :-(

I might miss your point.....but why can't you use the fastq groomer tool?

- Duplication of data (disk space usage) - Groomer is slow and puts more demands on CPU usage where it can be done easily on the fly by tophat - Consistency (bowtie does it) From the responses (or lack of :-)) we've been spurred on to change the wrapper. If there is interest we will commit it to the code base when done. Cheers, Steve

Reply

Sign in to reply online Use email software

Anton Nekrutenko

10:30 a.m.

Dear Stephen (and others): The sole reason for requiring fastq-sanger input to all of our wrappers was to force the users to run their data through the groomer. It is slow, but it checks data consistency in a way that is more robust than just checking 'four lines per fastq block' and prevents a lot of problems downstream. Here on Galaxy @ Penn State we see a lot of fastq files edited in MS Word and other similar horrors, which are being caught by groomer and prevent users from running into problems later on (and so cutting down on the support overhead - investigating why groomer has failed is a lot easier than researching why a particular set of polymorphisms derived from a Word-edited fastq file clusters Ukrainians with parasitic worms). In addition, even though Illumina did switch to Sanger encoding, there is still a lot of old data out there. However, we are open to suggestions ... What we are thinking of lately is switching to unaligned BAM for everyting. One of the benefits here is the ability to add readgroups from day 1 simplifying multisample analyses down the road. a. Anton Nekrutenko http://galaxyproject.org On Sep 8, 2011, at 10:14 AM, Stephen Taylor wrote:

On 08/09/2011 14:17, Hans-Rudolf Hotz wrote:

...
On 09/08/2011 09:47 AM, Stephen Taylor wrote:

...
On 07/09/2011 20:22, Edward Kirton wrote:

...
seems unnecessary since illumina switched over to fastqsanger now.

http://www.illumina.com/truseq/quality_101/quality_scores.ilmn

Eventually...unfortunately we still get a lot of fastqillumina :-(

I might miss your point.....but why can't you use the fastq groomer tool?

- Duplication of data (disk space usage) - Groomer is slow and puts more demands on CPU usage where it can be done easily on the fly by tophat - Consistency (bowtie does it)

From the responses (or lack of :-)) we've been spurred on to change the wrapper. If there is interest we will commit it to the code base when done.

Cheers,

Steve ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Reply

Sign in to reply online Use email software

Whyte, Jeffrey

11:33 a.m.

Anton, If a user is running a multi-core machine, a simple method to speed up FASTQ -Groomer is to first split the original FASTQ file (e.g. into 10 smaller files if you've got a 12-core machine), run FASTQ-Groomer on each file concurrently, then join the 10 files back together. This allows for parallel processing of the FASTQ file, rather than having FASTQ-Groomer slug it's way through with one processor while the other cores sit idle. I wrote a simple bash script that uses "split" and "cat" to automate the process. A file that would take two hours for FASTQ-Groomer now takes just over 10 min. As a double check, I verify that the input FASTQ-illumina and output FASTQ-Sanger files are identical with the FastQC program written by Simon Andrews. Right now, I run the script from the command line before putting my file into the Galaxy pipeline. I'm sure there are more "refined" ways to handle this with python, but it gets the job done. Thumbs up for FASTQ-groomer. Jeff On Sep 8, 2011, at 9:30 AM, Anton Nekrutenko wrote: Dear Stephen (and others): The sole reason for requiring fastq-sanger input to all of our wrappers was to force the users to run their data through the groomer. It is slow, but it checks data consistency in a way that is more robust than just checking 'four lines per fastq block' and prevents a lot of problems downstream. Here on Galaxy @ Penn State we see a lot of fastq files edited in MS Word and other similar horrors, which are being caught by groomer and prevent users from running into problems later on (and so cutting down on the support overhead - investigating why groomer has failed is a lot easier than researching why a particular set of polymorphisms derived from a Word-edited fastq file clusters Ukrainians with parasitic worms). In addition, even though Illumina did switch to Sanger encoding, there is still a lot of old data out there. However, we are open to suggestions ... What we are thinking of lately is switching to unaligned BAM for everyting. One of the benefits here is the ability to add readgroups from day 1 simplifying multisample analyses down the road. a. Anton Nekrutenko http://galaxyproject.org<http://galaxyproject.org/> On Sep 8, 2011, at 10:14 AM, Stephen Taylor wrote: On 08/09/2011 14:17, Hans-Rudolf Hotz wrote: On 09/08/2011 09:47 AM, Stephen Taylor wrote: On 07/09/2011 20:22, Edward Kirton wrote: seems unnecessary since illumina switched over to fastqsanger now. http://www.illumina.com/truseq/quality_101/quality_scores.ilmn Eventually...unfortunately we still get a lot of fastqillumina :-( I might miss your point.....but why can't you use the fastq groomer tool? - Duplication of data (disk space usage) - Groomer is slow and puts more demands on CPU usage where it can be done easily on the fly by tophat - Consistency (bowtie does it)

From the responses (or lack of :-)) we've been spurred on to change the wrapper. If there is interest we will commit it to the code base when done.

Cheers, Steve ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

Reply

Sign in to reply online Use email software

5387

Age (days ago)

5395

Last active (days ago)

Download

6 comments

5 participants

tags

participants (5)

Anton Nekrutenko
Edward Kirton
Hans-Rudolf Hotz
Stephen Taylor
Whyte, Jeffrey