Hi Peter,
Thanks for the suggestion. For example, I have a fastq file with 50 million reads and I want to randomly select 5 million of them. It seems biopython would very easily select a single or a handful of reads with the Bio.SeqIO.index() function. Would it also be able to do the job I am interested in?
Austin
How big are your FASTQ files (can they be indexed in memory)?On Tue, Nov 8, 2011 at 9:57 PM, Austin Paul <austinpa@usc.edu> wrote:
> Hi,
>
> I am curious if anyone knows how to select random lines from a fastq file.
> There is a select random lines tool in text manipulation tools, but it does
> not treat fastq files specifically, so it will not group quality lines with
> sequence lines. And if I turn the fastq file to tabular form in order to
> select lines, I can no longer return it to fastq form. Anyone know a way to
> do this in galaxy? Otherwise, perhaps another program? Thanks.
>
> Austin
And are you willing to program? If you like Python, Biopython's
Bio.SeqIO.index(...) or Bio.SeqIO.index_db(...) functions would
let you do this easily. Have a look at the "Getting the raw data
for a record" example in the tutorial, and please ask if you liked
a little more help:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
Regards,
Peter