Re: [galaxy-user] extracting a subset of sequences from a very large fasta file(1.5 million)

30 Nov 2012

      On Fri, Nov 30, 2012 at 3:55 PM, Perumal Vijayan <peruvijayan@gmail.com> wrote:
...
I have successfully uploaded a large fasta file (2.5 million genomic
sequence contigs) onto Galaxy server.  I wish to extract a subset of
sequences from this file.  I have a list of the fasta headers.  Is there a
way I can accomplish this on Galaxy?
Yes, if you are running your own Galaxy instance you could use
one of these two tools available on the Galaxy tool shed:
http://toolshed.g2.bx.psu.edu/

 'seq_filter_by_id' - returns a filtered version of the sequence file
with only those entries on your list. This can output two files,
those on the list and those not on the list, or just the sequences
not on the list.

 'seq_select_by_id' - like the above but indexes the sequence file
in order to extract the requested entries in the order given (rather
than the order in the sequence file).

Both of these tools work on FASTA, FASTQ and SFF files.

Peter

Re: [galaxy-user] extracting a subset of sequences from a very large fasta file(1.5 million)

Peter Cock