[galaxy-dev] Barcode/adapter/primer trimming/filtering of sequences/reads

3 Feb 2011

      Hi all,

I'm currently working with some 454 data where the sample was
amplified with selective primers, and therefore the reads need a
little processing to remove the primer sequences before assembly
or mapping (something that sff_extract cleverly spots and warns
the user about when doing an SFF to FASTA/FASTQ conversion).

The actual processing I want to do is very similar to spotting
and removing barcodes or adapters - except that PRC primers
are often degenerate, i.e. have an N in them representing the
fact it is a pool of primers covering A, C, G and T at that point,
and primers may come in pairs.

Looking over the provided tools in Galaxy, the only relevant ones
I saw are as follows:

emboss_5/emboss_primersearch.xml - the text output does not
look helpful for trimming my sequences - nothing else in Galaxy
uses this format, does it?

fastx_toolkit/fastx_barcode_splitter.xml - copes with 5' or 3'
barcodes, but only handles fastqsolexa (discussed recently on the
mailing list - I guess it could handle fastqsanger and fastqillumina
as well), not FASTA or SFF. Also according to the FASTX docs for
fastx_barcode_splitter.pl it require non-ambiguous barcodes
(i.e. ACGT only), so using it with ambiguous primers won't work:
http://hannonlab.cshl.edu/fastx_toolkit/commandline.html

I did look on the tool shed and noticed Edward Kirton has done
some wrappers for the "Suite of Newbler tools", but his sfffile
wrapper does not (yet) include support for splitting SFF files using
Roche's MID barcodes.

Are there any other relevant tools I have overlooked?

In the meantime I've started Galaxy wrappers for my own Python
code to find and remove PCR primers, adapters, or barcodes
(basically any short sequences). These can also be used to filter
the reads (choose if non-matching reads are kept or not). However
this isn't ideal for barcodes where you'd have to run the tool once
for each barcode (or set of barcodes) to get them in a separate file.

For specifying the barcode/primer/adapter sequences, are FASTA
or tabular files more commonly used?

Should I look at the Galaxy *.loc system to allow commonly used
things the Roche MID barcodes to be predefined at the system level?

I am currently taking a FASTA input file for the primers. A simple
tabular file with ID and sequence in the first two columns would
be easy to add - that is what fastx_toolkit/fastx_barcode_splitter.xml
expects. EMBOSS primersearch wants a three column tabular file
with ID, forward primer, reverse primer sequence. However, so far
I am only looking at single primer analysis (our primers were pooled
so I can remove the forward and reverse primers in two steps).

Currently I have used three separate scripts, with three separate
XML files - one for each supported file type (FASTA, FASTQ, SFF).
They all have the same interface, so could be done as a single
XML wrapper. The only potential downside to that is that as written
the SFF script requires Biopython, while the FASTA and FASTQ
scripts currently use the Galaxy libraries instead. This external
dependency may be an issue if Galaxy were interested in
including this tool in the main distribution - or if I bundled this
as a single tool or tool-suite on the Tool Shed.

Any thoughts?

Regards,

Peter

[galaxy-dev] Barcode/adapter/primer trimming/filtering of sequences/reads

Peter Cock