On Tue, Nov 8, 2011 at 10:26 PM, Austin Paul <austinpa@usc.edu> wrote:
Hi Peter,
Thanks for the suggestion. For example, I have a fastq file with 50 million reads and I want to randomly select 5 million of them. It seems biopython would very easily select a single or a handful of reads with the Bio.SeqIO.index() function. Would it also be able to do the job I am interested in?
Austin
I think so, but you'd have to use Bio.SeqIO.index_db() which stores the index in an SQLite dictionary rather than in memory which isn't really viable here (unless you have a 64bit big memory machine?). I don't think I've tried it with quite that many reads though... Alternatively, if I understood her correctly, Jennifer pointed out you can do this in Galaxy but it will take a lot of IO: 1. Convert FASTQ to tabular (4 lines per record -> 1 line per record) 2. Randomly select lines (each line is now a record so safe) 3. Convert tabular back to FASTQ It should work though, and requires no additional programming. Peter