Barcode/adapter/primer trimming/filtering of sequences/reads
Hi all, I'm currently working with some 454 data where the sample was amplified with selective primers, and therefore the reads need a little processing to remove the primer sequences before assembly or mapping (something that sff_extract cleverly spots and warns the user about when doing an SFF to FASTA/FASTQ conversion). The actual processing I want to do is very similar to spotting and removing barcodes or adapters - except that PRC primers are often degenerate, i.e. have an N in them representing the fact it is a pool of primers covering A, C, G and T at that point, and primers may come in pairs. Looking over the provided tools in Galaxy, the only relevant ones I saw are as follows: emboss_5/emboss_primersearch.xml - the text output does not look helpful for trimming my sequences - nothing else in Galaxy uses this format, does it? fastx_toolkit/fastx_barcode_splitter.xml - copes with 5' or 3' barcodes, but only handles fastqsolexa (discussed recently on the mailing list - I guess it could handle fastqsanger and fastqillumina as well), not FASTA or SFF. Also according to the FASTX docs for fastx_barcode_splitter.pl it require non-ambiguous barcodes (i.e. ACGT only), so using it with ambiguous primers won't work: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html I did look on the tool shed and noticed Edward Kirton has done some wrappers for the "Suite of Newbler tools", but his sfffile wrapper does not (yet) include support for splitting SFF files using Roche's MID barcodes. Are there any other relevant tools I have overlooked? In the meantime I've started Galaxy wrappers for my own Python code to find and remove PCR primers, adapters, or barcodes (basically any short sequences). These can also be used to filter the reads (choose if non-matching reads are kept or not). However this isn't ideal for barcodes where you'd have to run the tool once for each barcode (or set of barcodes) to get them in a separate file. For specifying the barcode/primer/adapter sequences, are FASTA or tabular files more commonly used? Should I look at the Galaxy *.loc system to allow commonly used things the Roche MID barcodes to be predefined at the system level? I am currently taking a FASTA input file for the primers. A simple tabular file with ID and sequence in the first two columns would be easy to add - that is what fastx_toolkit/fastx_barcode_splitter.xml expects. EMBOSS primersearch wants a three column tabular file with ID, forward primer, reverse primer sequence. However, so far I am only looking at single primer analysis (our primers were pooled so I can remove the forward and reverse primers in two steps). Currently I have used three separate scripts, with three separate XML files - one for each supported file type (FASTA, FASTQ, SFF). They all have the same interface, so could be done as a single XML wrapper. The only potential downside to that is that as written the SFF script requires Biopython, while the FASTA and FASTQ scripts currently use the Galaxy libraries instead. This external dependency may be an issue if Galaxy were interested in including this tool in the main distribution - or if I bundled this as a single tool or tool-suite on the Tool Shed. Any thoughts? Regards, Peter
On Thu, Feb 3, 2011 at 11:54 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hi all,
I'm currently working with some 454 data where the sample was amplified with selective primers, and therefore the reads need a little processing to remove the primer sequences before assembly or mapping (something that sff_extract cleverly spots and warns the user about when doing an SFF to FASTA/FASTQ conversion).
The actual processing I want to do is very similar to spotting and removing barcodes or adapters - except that PRC primers are often degenerate, i.e. have an N in them representing the fact it is a pool of primers covering A, C, G and T at that point, and primers may come in pairs.
Looking over the provided tools in Galaxy, the only relevant ones I saw are as follows:
emboss_5/emboss_primersearch.xml - the text output does not look helpful for trimming my sequences - nothing else in Galaxy uses this format, does it?
fastx_toolkit/fastx_barcode_splitter.xml - copes with 5' or 3' barcodes, but only handles fastqsolexa (discussed recently on the mailing list - I guess it could handle fastqsanger and fastqillumina as well), not FASTA or SFF. Also according to the FASTX docs for fastx_barcode_splitter.pl it require non-ambiguous barcodes (i.e. ACGT only), so using it with ambiguous primers won't work: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html
I did look on the tool shed and noticed Edward Kirton has done some wrappers for the "Suite of Newbler tools", but his sfffile wrapper does not (yet) include support for splitting SFF files using Roche's MID barcodes.
Are there any other relevant tools I have overlooked?
I forgot to mention fastx_toolkit/fastx_clipper.xml aka "Clip" which does handle FASTA and FASTQ files, but apparently only deals with 3' adapters (although perhaps the poorly documented -d switch is relevant for a 5' adapter?), and appears to only handle one adapter sequence at a time. The documentation doesn't mention what happens if you want to use an ambiguous adapter sequence (e.g. with an N in it). Peter
Hello Peter, If these are standard length PCR primers, then UCSC's In-Silico PCR tool would be an option. It is a varient of BLAT and the source is available from Kent Informatics. Here is a UCSC link to the online version (send Jim Kent an email for a copy): http://genome.ucsc.edu/cgi-bin/hgPcr?command=start A wrapper could be made for your own instance or just use it command-line before loading data. If this is not what you had in mind, please let us know, Best, Jen Galaxy team On 2/3/11 4:14 AM, Peter Cock wrote:
On Thu, Feb 3, 2011 at 11:54 AM, Peter Cock<p.j.a.cock@googlemail.com> wrote:
Hi all,
I'm currently working with some 454 data where the sample was amplified with selective primers, and therefore the reads need a little processing to remove the primer sequences before assembly or mapping (something that sff_extract cleverly spots and warns the user about when doing an SFF to FASTA/FASTQ conversion).
The actual processing I want to do is very similar to spotting and removing barcodes or adapters - except that PRC primers are often degenerate, i.e. have an N in them representing the fact it is a pool of primers covering A, C, G and T at that point, and primers may come in pairs.
Looking over the provided tools in Galaxy, the only relevant ones I saw are as follows:
emboss_5/emboss_primersearch.xml - the text output does not look helpful for trimming my sequences - nothing else in Galaxy uses this format, does it?
fastx_toolkit/fastx_barcode_splitter.xml - copes with 5' or 3' barcodes, but only handles fastqsolexa (discussed recently on the mailing list - I guess it could handle fastqsanger and fastqillumina as well), not FASTA or SFF. Also according to the FASTX docs for fastx_barcode_splitter.pl it require non-ambiguous barcodes (i.e. ACGT only), so using it with ambiguous primers won't work: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html
I did look on the tool shed and noticed Edward Kirton has done some wrappers for the "Suite of Newbler tools", but his sfffile wrapper does not (yet) include support for splitting SFF files using Roche's MID barcodes.
Are there any other relevant tools I have overlooked?
I forgot to mention fastx_toolkit/fastx_clipper.xml aka "Clip" which does handle FASTA and FASTQ files, but apparently only deals with 3' adapters (although perhaps the poorly documented -d switch is relevant for a 5' adapter?), and appears to only handle one adapter sequence at a time. The documentation doesn't mention what happens if you want to use an ambiguous adapter sequence (e.g. with an N in it).
Peter _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
-- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org
On Tue, Feb 8, 2011 at 3:46 AM, Jennifer Jackson <jen@bx.psu.edu> wrote:
Hello Peter,
If these are standard length PCR primers, then UCSC's In-Silico PCR tool would be an option. It is a varient of BLAT and the source is available from Kent Informatics. Here is a UCSC link to the online version (send Jim Kent an email for a copy):
http://genome.ucsc.edu/cgi-bin/hgPcr?command=start
A wrapper could be made for your own instance or just use it command-line before loading data. If this is not what you had in mind, please let us know,
Best,
Jen
Thanks Jen, If I've understood the purpose of the UCSC's In-Silico PCR tool it is for predicting the PCR product of a given primer pair and genome sequence. That's now quite what I have in mind - more like EMBOSS' primersearch - but I'm sure would be useful at some point. Peter
Hi Peter, Sorry about that, I did a double check and you are right, the tool doesn't "screen" sequences. Maybe try BLAT itself to identify then clip? It depends on how long your primers are - shorter than 20 will need some tuning. Ask UCSC directly (Galt) about how to configure for this type of match: genome@ucsc.edu. Best, Jen Galaxy team On 2/8/11 2:23 AM, Peter Cock wrote:
On Tue, Feb 8, 2011 at 3:46 AM, Jennifer Jackson<jen@bx.psu.edu> wrote:
Hello Peter,
If these are standard length PCR primers, then UCSC's In-Silico PCR tool would be an option. It is a varient of BLAT and the source is available from Kent Informatics. Here is a UCSC link to the online version (send Jim Kent an email for a copy):
http://genome.ucsc.edu/cgi-bin/hgPcr?command=start
A wrapper could be made for your own instance or just use it command-line before loading data. If this is not what you had in mind, please let us know,
Best,
Jen
Thanks Jen,
If I've understood the purpose of the UCSC's In-Silico PCR tool it is for predicting the PCR product of a given primer pair and genome sequence. That's now quite what I have in mind - more like EMBOSS' primersearch - but I'm sure would be useful at some point.
Peter
-- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org
On Wed, Feb 9, 2011 at 3:32 PM, Jennifer Jackson <jen@bx.psu.edu> wrote:
Hi Peter,
Sorry about that, I did a double check and you are right, the tool doesn't "screen" sequences. Maybe try BLAT itself to identify then clip? It depends on how long your primers are - shorter than 20 will need some tuning. Ask UCSC directly (Galt) about how to configure for this type of match: genome@ucsc.edu.
Best,
Jen Galaxy team
Hi Jen, Yes, my primer sequences are short - up to 22bp, so I don't think BLAST is a good solution. I've been using regular expressions and it seems to work nicely on my current 454 data (it would need testing on some large datasets before I was happy it could be used on say a full run of Illumina). For the work in progress, see: https://bitbucket.org/peterjc/galaxy-central/src/filter_fasta/tools/primers/ At the time of writing I have three tools, for FASTA, FASTQ and SFF files. As per my recent email I am considering merging them into one single tool: http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-February/004294.html Peter
Hi Peter - Regular expressions are a great, simple, and often fast solution to cleaning up seqs - glad that it is working for you. I am sure others will be interested in your tools, too, once you have them ready for the Tool Shed. Thanks for all of your contributions to Galaxy! Jen On 2/9/11 7:42 AM, Peter Cock wrote:
On Wed, Feb 9, 2011 at 3:32 PM, Jennifer Jackson<jen@bx.psu.edu> wrote:
Hi Peter,
Sorry about that, I did a double check and you are right, the tool doesn't "screen" sequences. Maybe try BLAT itself to identify then clip? It depends on how long your primers are - shorter than 20 will need some tuning. Ask UCSC directly (Galt) about how to configure for this type of match: genome@ucsc.edu.
Best,
Jen Galaxy team
Hi Jen,
Yes, my primer sequences are short - up to 22bp, so I don't think BLAST is a good solution. I've been using regular expressions and it seems to work nicely on my current 454 data (it would need testing on some large datasets before I was happy it could be used on say a full run of Illumina).
For the work in progress, see: https://bitbucket.org/peterjc/galaxy-central/src/filter_fasta/tools/primers/
At the time of writing I have three tools, for FASTA, FASTQ and SFF files. As per my recent email I am considering merging them into one single tool: http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-February/004294.html
Peter
-- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org
Have you looked at http://code.google.com/p/cutadapt/ Features - Gapped alignment with mismatches and indels, that is, errors in the adapter are tolerated - Finds adapters both in the 5' and 3' ends of reads - Accepts FASTQ, FASTA or .csfasta and .qual files (for AB SOLiD data) - Any input or output file can be gzip-compressed - Outputs FASTA or FASTQ - Trims color space reads correctly - Optionally removes primer base in color space data - Can produce MAQ- or BWA-compatible output only had the chance to play around with this for a while. but looks promising! On Thu, Feb 3, 2011 at 7:54 PM, Peter Cock <p.j.a.cock@googlemail.com>wrote:
Hi all,
I'm currently working with some 454 data where the sample was amplified with selective primers, and therefore the reads need a little processing to remove the primer sequences before assembly or mapping (something that sff_extract cleverly spots and warns the user about when doing an SFF to FASTA/FASTQ conversion).
The actual processing I want to do is very similar to spotting and removing barcodes or adapters - except that PRC primers are often degenerate, i.e. have an N in them representing the fact it is a pool of primers covering A, C, G and T at that point, and primers may come in pairs.
Looking over the provided tools in Galaxy, the only relevant ones I saw are as follows:
emboss_5/emboss_primersearch.xml - the text output does not look helpful for trimming my sequences - nothing else in Galaxy uses this format, does it?
On Tue, Feb 8, 2011 at 8:15 AM, Kevin Lam <aboulia@gmail.com> wrote:
Have you looked at http://code.google.com/p/cutadapt/ Features
Gapped alignment with mismatches and indels, that is, errors in the adapter are tolerated Finds adapters both in the 5' and 3' ends of reads Accepts FASTQ, FASTA or .csfasta and .qual files (for AB SOLiD data) Any input or output file can be gzip-compressed Outputs FASTA or FASTQ Trims color space reads correctly Optionally removes primer base in color space data Can produce MAQ- or BWA-compatible output
only had the chance to play around with this for a while. but looks promising!
Hi Kevin, Thanks for the link - I think I skimmed over all the source code (its in Python + C which is nice from my personal perspective), and I'm pretty sure it does NOT handle ambiguous IUPAC codes (either in the adapter/barcode/primer or the read sequences). For the particular task I'm working on I do have degenerate PCR primers, i.e. they have N's in them representing the fact it is a pool of primers covering A, C, G and T at that point. This would be pretty strange for an adapter or barcode! Peter
participants (3)
-
Jennifer Jackson
-
Kevin Lam
-
Peter Cock