On Fri, Nov 30, 2012 at 3:55 PM, Perumal Vijayan <peruvijayan@gmail.com> wrote:
I have successfully uploaded a large fasta file (2.5 million genomic sequence contigs) onto Galaxy server. I wish to extract a subset of sequences from this file. I have a list of the fasta headers. Is there a way I can accomplish this on Galaxy?
Yes, if you are running your own Galaxy instance you could use one of these two tools available on the Galaxy tool shed: http://toolshed.g2.bx.psu.edu/ 'seq_filter_by_id' - returns a filtered version of the sequence file with only those entries on your list. This can output two files, those on the list and those not on the list, or just the sequences not on the list. 'seq_select_by_id' - like the above but indexes the sequence file in order to extract the requested entries in the order given (rather than the order in the sequence file). Both of these tools work on FASTA, FASTQ and SFF files. Peter