Nucleotide analysis - GC percentage
Hi all, Are there any built in Galaxy tools that I have missed to do with GC percentage (or indeed, AT percentage)? I'm thinking of a tool to calculate the GC percentage (and perhaps related statistics like counts/percentages of A, C, G, T), and perhaps a related tool to filter on GC. Possible use cases include filtering NGS reads to remove high/low GC reads from a contaminate. Slightly more complicated, right now I want to calculate the GC (or in fact AT) percentage from the first and last ~20 (configurable) bases. In this case I am looking for (and filtering on) AT rich ends of contigs which may be indicative of viral sequences. A very similar task would be looking for (and filtering on) poly A tails of mRNA, or if sequenced from the reverse strand, a poly T start. Peter
Hi Peter, There isn't a built-in Galaxy tool to compute GC%, yet. You could perhaps use UCSC's hgGcPercent binary, which lets you compute GC% for BED intervals. You can find the same here: http://genome.ucsc.edu/FAQ/FAQdownloads#download27 Thanks, Guru. On Thu, Apr 14, 2011 at 9:11 AM, Peter Cock <p.j.a.cock@googlemail.com>wrote:
Hi all,
Are there any built in Galaxy tools that I have missed to do with GC percentage (or indeed, AT percentage)?
I'm thinking of a tool to calculate the GC percentage (and perhaps related statistics like counts/percentages of A, C, G, T), and perhaps a related tool to filter on GC. Possible use cases include filtering NGS reads to remove high/low GC reads from a contaminate.
Slightly more complicated, right now I want to calculate the GC (or in fact AT) percentage from the first and last ~20 (configurable) bases. In this case I am looking for (and filtering on) AT rich ends of contigs which may be indicative of viral sequences. A very similar task would be looking for (and filtering on) poly A tails of mRNA, or if sequenced from the reverse strand, a poly T start.
Peter ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Graduate student, Bioinformatics and Genomics Makova lab/Galaxy team 505 Wartik lab University Park PA 16802 guru@psu.edu
On Thu, Apr 14, 2011 at 4:15 PM, Guru Ananda <guru@psu.edu> wrote:
Hi Peter, There isn't a built-in Galaxy tool to compute GC%, yet.
Thanks Guru.
You could perhaps use UCSC's hgGcPercent binary, which lets you compute GC% for BED intervals. You can find the same here: http://genome.ucsc.edu/FAQ/FAQdownloads#download27
I'll be working with simple sequence files (FASTA, or even FASTQ, SFF, etc) rather than BED files, but I'll keep that in mind. Peter
Peter and Guru; [Computing GC]
I'll be working with simple sequence files (FASTA, or even FASTQ, SFF, etc) rather than BED files, but I'll keep that in mind.
Emboss has some utilities that do this. infoseq and geecee, and there are also programs for exploring CpG islands: http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/nucleic_cpg_islan... Brad
Thanks for pointing this out, Brad. Both geecee and infoseq are in fact available on Galaxy under EMBOSS section. Guru. On Thu, Apr 14, 2011 at 12:13 PM, Brad Chapman <chapmanb@50mail.com> wrote:
Peter and Guru;
[Computing GC]
I'll be working with simple sequence files (FASTA, or even FASTQ, SFF, etc) rather than BED files, but I'll keep that in mind.
Emboss has some utilities that do this. infoseq and geecee, and there are also programs for exploring CpG islands:
http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/nucleic_cpg_islan...
Brad
-- Graduate student, Bioinformatics and Genomics Makova lab/Galaxy team 505 Wartik lab University Park PA 16802 guru@psu.edu
please remove me from mailing list - thanks On 14-04-2011, at 9:26 AM, Guru Ananda wrote:
Thanks for pointing this out, Brad. Both geecee and infoseq are in fact available on Galaxy under EMBOSS section.
Guru.
On Thu, Apr 14, 2011 at 12:13 PM, Brad Chapman <chapmanb@50mail.com> wrote: Peter and Guru;
[Computing GC]
I'll be working with simple sequence files (FASTA, or even FASTQ, SFF, etc) rather than BED files, but I'll keep that in mind.
Emboss has some utilities that do this. infoseq and geecee, and there are also programs for exploring CpG islands:
http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/nucleic_cpg_islan...
Brad
-- Graduate student, Bioinformatics and Genomics Makova lab/Galaxy team 505 Wartik lab University Park PA 16802 guru@psu.edu
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
On Thu, Apr 14, 2011 at 5:13 PM, Brad Chapman <chapmanb@50mail.com> wrote:
Peter and Guru;
[Computing GC]
I'll be working with simple sequence files (FASTA, or even FASTQ, SFF, etc) rather than BED files, but I'll keep that in mind.
Emboss has some utilities that do this. infoseq and geecee, and there are also programs for exploring CpG islands:
http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/nucleic_cpg_islan...
Brad
Good idea Brad :) Now why does a tool search on the public Galaxy instance for GC not suggest this tool? Name: geecee Description: Calculates fractional GC content of nucleic acid sequences Does this mean the description isn't searched? It would seem like a sensible idea to me to include that... Searching for "geecee" works, but unless you're familiar with this EMBOSS tool no-one will think of that. Peter
Now why does a tool search on the public Galaxy instance for GC not suggest this tool?
Name: geecee Description: Calculates fractional GC content of nucleic acid sequences
Does this mean the description isn't searched? It would seem like a sensible idea to me to include that...
Searching for "geecee" works, but unless you're familiar with this EMBOSS tool no-one will think of that.
Peter, The tool search doesn't start until you type in three characters, so typing 'GC' does not initiate a search. Typing 'gc<space' or 'gc content' works. Perhaps a tooltip or help text is needed. J.
On Thu, Apr 14, 2011 at 6:25 PM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:
Now why does a tool search on the public Galaxy instance for GC not suggest this tool?
Name: geecee Description: Calculates fractional GC content of nucleic acid sequences
Does this mean the description isn't searched? It would seem like a sensible idea to me to include that...
Searching for "geecee" works, but unless you're familiar with this EMBOSS tool no-one will think of that.
Peter,
The tool search doesn't start until you type in three characters, so typing 'GC' does not initiate a search. Typing 'gc<space' or 'gc content' works. Perhaps a tooltip or help text is needed.
J.
I see that now, and yes, perhaps a caption on the search box would help... Also typing C, C, enter doesn't work - that does surprise me. There is still something amiss with the search apparently not using the tool description line, for instance neither "acid" nor "nucleic" nor "factional" show the EMBOSS geecee tool. If the search is indexing on the tool's main help text, then for the EMBOSS tools it would help to have an executive summary with key words in it, rather than just a link to the EMBOSS webpage for each tool. Peter
Dear All I have combined H3K4me3 pattern in a specific region (Info: UCSC Main on Human: wgEncodeBroadHistoneGm12878CtcfStdPk (genome)) with RefSeq genes in that region (CSC Main on Human: refGene (genome)) and get this pdf histogram. I was wondering if someone help me on interpretation of it. Best regards. Moein M.Farshchian Ph.D Candidate of Cell & Molecular Biology, Department of Biology, Faculty of Sciences, Ferdowsi University of Mashhad. Mashhad.Iran. P.O.Box: 9177948974 ________________________________ From: Peter Cock <p.j.a.cock@googlemail.com> To: Jeremy Goecks <jeremy.goecks@emory.edu> Cc: galaxy-user@lists.bx.psu.edu Sent: Fri, April 15, 2011 12:55:49 PM Subject: Re: [galaxy-user] Nucleotide analysis - GC percentage On Thu, Apr 14, 2011 at 6:25 PM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:
Now why does a tool search on the public Galaxy instance for GC not suggest this tool?
Name: geecee Description: Calculates fractional GC content of nucleic acid sequences
Does this mean the description isn't searched? It would seem like a sensible idea to me to include that...
Searching for "geecee" works, but unless you're familiar with this EMBOSS tool no-one will think of that.
Peter,
The tool search doesn't start until you type in three characters, so typing 'GC' does not initiate a search. Typing 'gc<space' or 'gc content' works. Perhaps a tooltip or help text is needed.
J.
I see that now, and yes, perhaps a caption on the search box would help... Also typing C, C, enter doesn't work - that does surprise me. There is still something amiss with the search apparently not using the tool description line, for instance neither "acid" nor "nucleic" nor "factional" show the EMBOSS geecee tool. If the search is indexing on the tool's main help text, then for the EMBOSS tools it would help to have an executive summary with key words in it, rather than just a link to the EMBOSS webpage for each tool. Peter ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Hi Brad, These tools are also in galaxy under the EMBOSS section. "geecee" will tell you the percentage of GC in FASTA sequences. It basically outputs the sequence name and then the GC content as below: #Sequence GC content Sequence1 0.44 Hope this helps! Tychele On Apr 14, 2011, at 12:13 PM, Brad Chapman wrote:
Peter and Guru;
[Computing GC]
I'll be working with simple sequence files (FASTA, or even FASTQ, SFF, etc) rather than BED files, but I'll keep that in mind.
Emboss has some utilities that do this. infoseq and geecee, and there are also programs for exploring CpG islands:
http://emboss.sourceforge.net/apps/release/6.3/emboss/apps/nucleic_cpg_islan...
Brad ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
The kent program hgGcPercent will measure what you want to measure from your sequences. --Hiram hgGcPercent - Calculate GC Percentage in 20kb windows usage: hgGcPercent [options] database nibDir nibDir can be a .2bit file, a directory that contains a database.2bit file, or a directory that contains *.nib files. Loads gcPercent table with counts from sequence. options: -win=<size> - change windows size (default 20000) -noLoad - do not load mysql table - create bed file -file=<filename> - output to <filename> (stdout OK) (implies -noLoad) -chr=<chrN> - process only chrN from the nibDir -noRandom - ignore randome chromosomes from the nibDir -noDots - do not display ... progress during processing -doGaps - process gaps correctly (default: gaps are not counted as GC) -wigOut - output wiggle ascii data ready to pipe to wigEncode -overlap=N - overlap windows by N bases (default 0) -verbose=N - display details to stderr during processing -bedRegionIn=input.bed Read in a bed file for GC content in specific regions and write to bedRegionsOut -bedRegionOut=output.bed Write a bed file of GC content in specific regions from bedRegionIn example: calculate GC percent in 5 base windows using a 2bit assembly (dp2): hgGcPercent -wigOut -doGaps -win=5 -file=stdout -verbose=0 \ dp2 /cluster/data/dp2 \ | wigEncode stdin gc5Base.wig gc5Base.wib
participants (8)
-
Brad Chapman
-
Douglas Allan
-
Guru Ananda
-
Hiram Clawson
-
Jeremy Goecks
-
Moein Farshchian
-
Peter Cock
-
Tychele