selecting reads at random from fastq file
Hi, I am curious if anyone knows how to select random lines from a fastq file. There is a select random lines tool in text manipulation tools, but it does not treat fastq files specifically, so it will not group quality lines with sequence lines. And if I turn the fastq file to tabular form in order to select lines, I can no longer return it to fastq form. Anyone know a way to do this in galaxy? Otherwise, perhaps another program? Thanks. Austin
On Tue, Nov 8, 2011 at 9:57 PM, Austin Paul <austinpa@usc.edu> wrote:
Hi,
I am curious if anyone knows how to select random lines from a fastq file. There is a select random lines tool in text manipulation tools, but it does not treat fastq files specifically, so it will not group quality lines with sequence lines. And if I turn the fastq file to tabular form in order to select lines, I can no longer return it to fastq form. Anyone know a way to do this in galaxy? Otherwise, perhaps another program? Thanks.
Austin
How big are your FASTQ files (can they be indexed in memory)? And are you willing to program? If you like Python, Biopython's Bio.SeqIO.index(...) or Bio.SeqIO.index_db(...) functions would let you do this easily. Have a look at the "Getting the raw data for a record" example in the tutorial, and please ask if you liked a little more help: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Regards, Peter
Hi Peter, Thanks for the suggestion. For example, I have a fastq file with 50 million reads and I want to randomly select 5 million of them. It seems biopython would very easily select a single or a handful of reads with the Bio.SeqIO.index() function. Would it also be able to do the job I am interested in? Austin On Tue, Nov 8, 2011 at 2:07 PM, Peter Cock <p.j.a.cock@googlemail.com>wrote:
On Tue, Nov 8, 2011 at 9:57 PM, Austin Paul <austinpa@usc.edu> wrote:
Hi,
I am curious if anyone knows how to select random lines from a fastq file. There is a select random lines tool in text manipulation tools, but it does not treat fastq files specifically, so it will not group quality lines with sequence lines. And if I turn the fastq file to tabular form in order to select lines, I can no longer return it to fastq form. Anyone know a way to do this in galaxy? Otherwise, perhaps another program? Thanks.
Austin
How big are your FASTQ files (can they be indexed in memory)?
And are you willing to program? If you like Python, Biopython's Bio.SeqIO.index(...) or Bio.SeqIO.index_db(...) functions would let you do this easily. Have a look at the "Getting the raw data for a record" example in the tutorial, and please ask if you liked a little more help: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
Regards,
Peter
On Tue, Nov 8, 2011 at 10:26 PM, Austin Paul <austinpa@usc.edu> wrote:
Hi Peter,
Thanks for the suggestion. For example, I have a fastq file with 50 million reads and I want to randomly select 5 million of them. It seems biopython would very easily select a single or a handful of reads with the Bio.SeqIO.index() function. Would it also be able to do the job I am interested in?
Austin
I think so, but you'd have to use Bio.SeqIO.index_db() which stores the index in an SQLite dictionary rather than in memory which isn't really viable here (unless you have a 64bit big memory machine?). I don't think I've tried it with quite that many reads though... Alternatively, if I understood her correctly, Jennifer pointed out you can do this in Galaxy but it will take a lot of IO: 1. Convert FASTQ to tabular (4 lines per record -> 1 line per record) 2. Randomly select lines (each line is now a record so safe) 3. Convert tabular back to FASTQ It should work though, and requires no additional programming. Peter
Hi Paul, Hi Peter You might also wanna look at the 'FastqSampler' function in the Bioconductor 'ShortRead' package http://bioconductor.org/packages/release/bioc/html/ShortRead.html We are working (as part of our NGS pipeline redesign) on adding more Bioconductor functionalities to Galaxy. Unfortunately, it is very low on my pile of stuff to do, so it will take a while till it appears in the 'Tool Shed'. Regards, Hans On 11/08/2011 11:45 PM, Peter Cock wrote:
On Tue, Nov 8, 2011 at 10:26 PM, Austin Paul<austinpa@usc.edu> wrote:
Hi Peter,
Thanks for the suggestion. For example, I have a fastq file with 50 million reads and I want to randomly select 5 million of them. It seems biopython would very easily select a single or a handful of reads with the Bio.SeqIO.index() function. Would it also be able to do the job I am interested in?
Austin
I think so, but you'd have to use Bio.SeqIO.index_db() which stores the index in an SQLite dictionary rather than in memory which isn't really viable here (unless you have a 64bit big memory machine?). I don't think I've tried it with quite that many reads though...
Alternatively, if I understood her correctly, Jennifer pointed out you can do this in Galaxy but it will take a lot of IO:
1. Convert FASTQ to tabular (4 lines per record -> 1 line per record) 2. Randomly select lines (each line is now a record so safe) 3. Convert tabular back to FASTQ
It should work though, and requires no additional programming.
Peter
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Hi, This may be a bit dumb or missing the point but just selecting the first 5 million is kind of random isn't it? I mean where the reads map and what they are from is not known to you and they were not collected by the sequencer in a manner that is influenced by the nature of the sample? Best Wishes, David. __________________________________ Dr David A. Matthews Senior Lecturer in Virology Room E49 Department of Cellular and Molecular Medicine, School of Medical Sciences University Walk, University of Bristol Bristol. BS8 1TD U.K. Tel. +44 117 3312058 Fax. +44 117 3312091 D.A.Matthews@bristol.ac.uk On 9 Nov 2011, at 09:44, Hans-Rudolf Hotz wrote:
Hi Paul, Hi Peter
You might also wanna look at the 'FastqSampler' function in the Bioconductor 'ShortRead' package http://bioconductor.org/packages/release/bioc/html/ShortRead.html
We are working (as part of our NGS pipeline redesign) on adding more Bioconductor functionalities to Galaxy. Unfortunately, it is very low on my pile of stuff to do, so it will take a while till it appears in the 'Tool Shed'.
Regards, Hans
On 11/08/2011 11:45 PM, Peter Cock wrote:
On Tue, Nov 8, 2011 at 10:26 PM, Austin Paul<austinpa@usc.edu> wrote:
Hi Peter,
Thanks for the suggestion. For example, I have a fastq file with 50 million reads and I want to randomly select 5 million of them. It seems biopython would very easily select a single or a handful of reads with the Bio.SeqIO.index() function. Would it also be able to do the job I am interested in?
Austin
I think so, but you'd have to use Bio.SeqIO.index_db() which stores the index in an SQLite dictionary rather than in memory which isn't really viable here (unless you have a 64bit big memory machine?). I don't think I've tried it with quite that many reads though...
Alternatively, if I understood her correctly, Jennifer pointed out you can do this in Galaxy but it will take a lot of IO:
1. Convert FASTQ to tabular (4 lines per record -> 1 line per record) 2. Randomly select lines (each line is now a record so safe) 3. Convert tabular back to FASTQ
It should work though, and requires no additional programming.
Peter
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
David, in my experience with Illumina sequencing, it looks like the reads at the start of a file have a much higher sequencing error rate. Bob H On Nov 9, 2011, at 4:52 AM, David Matthews wrote:
Hi,
This may be a bit dumb or missing the point but just selecting the first 5 million is kind of random isn't it? I mean where the reads map and what they are from is not known to you and they were not collected by the sequencer in a manner that is influenced by the nature of the sample?
Best Wishes, David.
On Wed, Nov 9, 2011 at 12:03 PM, Bob Harris <rsharris@bx.psu.edu> wrote:
David, in my experience with Illumina sequencing, it looks like the reads at the start of a file have a much higher sequencing error rate. Bob H
Yes, reads at the start and the end of the file come from the edge of the Illumina slide, and tend to be of poorer quality that the reads from the middle. So depending on the purpose in mind, picking 5 million reads from the middle of the file might be fine (and much easier computationally). Peter
to the best of my knowledge reads at the start of SOLiD data set also have a higher error rate .. I think it might be also due to edge effect. On 9 November 2011 21:49, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Wed, Nov 9, 2011 at 12:03 PM, Bob Harris <rsharris@bx.psu.edu> wrote:
David, in my experience with Illumina sequencing, it looks like the reads at the start of a file have a much higher sequencing error rate. Bob H
Yes, reads at the start and the end of the file come from the edge of the Illumina slide, and tend to be of poorer quality that the reads from the middle.
Hello Austin, You have the correct method to do this all in Galaxy. Use the tool "NGS: QC and manipulation -> Tabular to FASTQ converter" to do the final step. Hopefully this helps, Jen Galaxy team On 11/8/11 1:57 PM, Austin Paul wrote:
Hi,
I am curious if anyone knows how to select random lines from a fastq file. There is a select random lines tool in text manipulation tools, but it does not treat fastq files specifically, so it will not group quality lines with sequence lines. And if I turn the fastq file to tabular form in order to select lines, I can no longer return it to fastq form. Anyone know a way to do this in galaxy? Otherwise, perhaps another program? Thanks.
Austin
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org/wiki/Support
participants (7)
-
Austin Paul
-
Bob Harris
-
David Matthews
-
Hans-Rudolf Hotz
-
Jennifer Jackson
-
Kevin Lam
-
Peter Cock