Re: [galaxy-user] finding open reading frames and corresponding genes

12 Jul 2011

      Hello,

Making the assumption that your data is DNA (and not RNA), the tools 
under "NGS: SAM Tools" and "Operate on Genomic Intervals" can generate 
coordinates of the mapped reads which then can be correlated with known 
genes/ORFs from your bacterial genome (or related genomes, if you can 
obtain those those mapped to the same reference genome to use as 
predictions).

General analysis path:

Starting with your SAM file, use these tools first to obtain an interval 
file representing your read coverage:
1 - SAM-to-BAM
2 - Generate pileup
3 - Pileup-to-Interval

Next, import a reference known gene set. Sources may include UCSC, NCBI, 
or other. Download from that source (if not directly available via a 
"Get Data" source) and upload the file into Galaxy using FTP ("Get Data 
-> Upload"). If this data is in GTF/T format, you can convert it to 
Interval using "Convert Formats -> GFF-to-BED" (BED is a stricter form 
of Interval, use the pencil icon in the datasets box to Edit Attributes 
to change data type to Interval).

Then if using output from "Pileup-to-Interval" and a reference known 
gene/transcript dataset mapped to the same reference genome, use the 
tools in "Operate on Genomic Intervals" to perform comparisons based on 
genomic coordinates. Each tool has a description on the main tool form, 
but there are also screencasts explaining the functions here under "3. 
Interval Operation Tutorial"
http://wiki.g2.bx.psu.edu/Learn/Screencasts
Also see: "Regional Variation -> Feature coverage" for localized comparisons

Once an intersection of coordinates is complete, you may need to use the 
tools in "Text Manipulation", "Filter and Sort", or "Join, Subtract and 
Group" to merge in gene identifiers. Exactly what order to use these 
tools greatly depends on the input reference gene/transcript dataset 
formats.

If you are doing transcript predictions, the tool "EMBOSS -> getorf" 
Finds and extracts open reading frames (ORFs)" may be helpful. This tool 
requires sequence as input. Once predicted transcript coordinates are 
obtained, extract sequence from the reference genome using "Fetch 
Sequences -> Extract Genomic DNA" to use as input.

Hopefully this helps to get you started. Please let us know if we can 
help again,

Best,

Jen
Galaxy team

On 7/12/11 7:02 AM, Joanne Rampersad wrote:
...
Hi
I am sequencing a bacterial genome and have assembled my Illumina
reads (40 bp single) using bowtie with a reference genome. This
generated a sam file.
I would like to obtain a listing of the open reading frames from the
bacterial genome and the corresponding genes that they are most
similar to.
Can you please give the tools/steps  necessary to do this?
many thanks
Jo
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists,
please use the interface at:
http://lists.bx.psu.edu/
-- 
Jennifer Jackson
http://usegalaxy.org/
http://galaxyproject.org/