Hi Peter,

Thanks for the suggestion.  For example, I have a fastq file with 50 million reads and I want to randomly select 5 million of them.  It seems biopython would very easily select a single or a handful of reads with the Bio.SeqIO.index() function.  Would it also be able to do the job I am interested in? 

Austin

On Tue, Nov 8, 2011 at 2:07 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Tue, Nov 8, 2011 at 9:57 PM, Austin Paul <austinpa@usc.edu> wrote:
> Hi,
>
> I am curious if anyone knows how to select random lines from a fastq file.
> There is a select random lines tool in text manipulation tools, but it does
> not treat fastq files specifically, so it will not group quality lines with
> sequence lines.  And if I turn the fastq file to tabular form in order to
> select lines, I can no longer return it to fastq form.  Anyone know a way to
> do this in galaxy?  Otherwise, perhaps another program?  Thanks.
>
> Austin

How big are your FASTQ files (can they be indexed in memory)?

And are you willing to program? If you like Python, Biopython's
Bio.SeqIO.index(...) or Bio.SeqIO.index_db(...) functions would
let you do this easily. Have a look at the "Getting the raw data
for a record" example in the tutorial, and please ask if you liked
a little more help:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

Regards,

Peter