Hi everyone, I have the genome sequence and gene annotation file. Is there a tool on Galaxy to extract the 5,000 bp upstream, 5,000 bp downstream and genome sequences of the genes (including exons and introns) from the genome sequence? Any suggestions are highly appreciated! Thanks! Yan
Hi Yan, did you know the tool extractfeat from the EMBOSS suite (its in the toolshed)? I don't know offhand if it can work in batch mode, but its possible to add that feature. Cheers, Bjoern
Hi everyone,
I have the genome sequence and gene annotation file. Is there a tool on Galaxy to extract the 5,000 bp upstream, 5,000 bp downstream and genome sequences of the genes (including exons and introns) from the genome sequence? Any suggestions are highly appreciated! Thanks!
Yan
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Björn Grüning Albert-Ludwigs-Universität Freiburg Institute of Pharmaceutical Sciences Pharmaceutical Bioinformatics Hermann-Herder-Strasse 9 D-79104 Freiburg i. Br. Tel.: +49 761 203-4872 Fax.: +49 761 203-97769 E-Mail: bjoern.gruening@pharmazie.uni-freiburg.de Web: http://www.pharmaceutical-bioinformatics.org/
Yan, One way to do this is to create an interval file with the new co-ordinates (+/- 5kb) and then use the Fetch Sequences > Extract genomic DNA tool. To create a new co-ordinates file, input your annotation file into the Text Manipulation > Compute tool, using expressions like "c3 = c3-5000" to get your new co-ordinates. You'll get 2 new columns in the final output file and then use the Text Manipulation > Cut tool to extract the columns you need to create an interval file. Hope this helps. Cheers, Graham Dr. Graham Etherington Bioinformatics Support Officer, The Sainsbury Laboratory, Norwich Research Park, Norwich NR4 7UH. UK Tel: +44 (0)1603 450601 On 24/09/2012 09:02, "Björn Grüning" <bjoern.gruening@pharmazie.uni-freiburg.de> wrote:
Hi Yan,
did you know the tool extractfeat from the EMBOSS suite (its in the toolshed)?
I don't know offhand if it can work in batch mode, but its possible to add that feature.
Cheers, Bjoern
Hi everyone,
I have the genome sequence and gene annotation file. Is there a tool on Galaxy to extract the 5,000 bp upstream, 5,000 bp downstream and genome sequences of the genes (including exons and introns) from the genome sequence? Any suggestions are highly appreciated! Thanks!
Yan
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Björn Grüning Albert-Ludwigs-Universität Freiburg Institute of Pharmaceutical Sciences Pharmaceutical Bioinformatics Hermann-Herder-Strasse 9 D-79104 Freiburg i. Br.
Tel.: +49 761 203-4872 Fax.: +49 761 203-97769 E-Mail: bjoern.gruening@pharmazie.uni-freiburg.de Web: http://www.pharmaceutical-bioinformatics.org/
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Hi Yan, Both of the other suggestions are good - I'll also give you another choice to build coordinates before using the "Fetch Sequences -> Extract Genomic DNA" tool to obtain the fasta sequence. Using your input in BED/Interval format (convert from GFF/GTF if necessary, using the tool "Convert Formats -> GFF-to-BED "), or the first 6 columns if a BED12 (use "Cut" as needed), then run the "Operate on Genomic Intervals -> Get flanks" tool. "Region:" Whole feature "Location of the flanking region/s:" Both "Offset" 0 "Length of the flanking region(s):" 5000 Your question is similar to this one (the first part, but I thought you might be interested in how to just get the flanks, too). http://user.list.galaxyproject.org/Get-flanks-version-1-0-0-td4604849.html Good luck with your project! Jen Galaxy team ps. To search prior questions, please see: http://galaxy.psu.edu/search/mailinglists/ On 9/23/12 7:00 PM, Yan He wrote:
Hi everyone,
I have the genome sequence and gene annotation file. Is there a tool on Galaxy to extract the 5,000 bp upstream, 5,000 bp downstream and genome sequences of the genes (including exons and introns) from the genome sequence? Any suggestions are highly appreciated! Thanks!
Yan
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Jennifer Jackson http://galaxyproject.org
Hi Jen, Thanks very much for your help! It is very helpful. However, following your suggestion, what I got is not what I want. Take one sequence for example. The annotation for one scaffold is C16582GLEANmRNA353850.555898-.ID=OYG_GLEAN_10000001;C16582GLEANCDS35385.-0Parent=OYG_GLEAN_10000001;What I got for this scaffold is >?_C16582_385_5385_-GCAAACAAGC>?_C16582_385_5385_-GCAAACAAGCI understand that it is trying to get the sequence of the gene downstream from 385-5385, but the sequence is short, so I only get what the scaffold has. I would like to have the upstream+gene+downstream sequence at the same time, not only the upstream or downstream. How can I do this using a galaxy tool? Thanks! Yan
Date: Mon, 24 Sep 2012 12:26:03 -0700 From: jen@bx.psu.edu To: yanhe83@hotmail.com CC: galaxy-user@lists.bx.psu.edu Subject: Re: [galaxy-user] extract genome sequence
Hi Yan,
Both of the other suggestions are good - I'll also give you another choice to build coordinates before using the "Fetch Sequences -> Extract Genomic DNA" tool to obtain the fasta sequence.
Using your input in BED/Interval format (convert from GFF/GTF if necessary, using the tool "Convert Formats -> GFF-to-BED "), or the first 6 columns if a BED12 (use "Cut" as needed), then run the "Operate on Genomic Intervals -> Get flanks" tool.
"Region:" Whole feature "Location of the flanking region/s:" Both "Offset" 0 "Length of the flanking region(s):" 5000
Your question is similar to this one (the first part, but I thought you might be interested in how to just get the flanks, too). http://user.list.galaxyproject.org/Get-flanks-version-1-0-0-td4604849.html
Good luck with your project!
Jen Galaxy team
ps. To search prior questions, please see: http://galaxy.psu.edu/search/mailinglists/
On 9/23/12 7:00 PM, Yan He wrote:
Hi everyone,
I have the genome sequence and gene annotation file. Is there a tool on Galaxy to extract the 5,000 bp upstream, 5,000 bp downstream and genome sequences of the genes (including exons and introns) from the genome sequence? Any suggestions are highly appreciated! Thanks!
Yan
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Jennifer Jackson http://galaxyproject.org
Hello Yan, Unfortunately, the tool can only extract sequence that is provided as the mapping target. This will be a problem with any of the methods. This tool does avoid a problem with generating negative coordinates (which will cause a problem with the 'Extract' tool). But it is not quite giving you what you want either, assuming that partially extended sequence, based on available data, would be acceptable. Using the compute tool may be the best option for your case, now that the data is clearer. "End" coordinates that extend past the edge of the chromosome are not a problem, but the "Start" coordinate will need to be set to 1 (if using GFF3 as interval directly) or 0 (if you converted to BED - this doesn't appear to be the case). The expression below will either subtract '5000' from a "Start" coordinate or change it to a "1", depending on how close it is to the leading edge of the scaffold. (Modify for BED to be 0-based as needed). (c2 - 5000) if (c2 > 5000) else (1) Then add 5000 to the end, 'Cut' columns, and extract as Graham recommended. I am not going to address the GFF3 format except to say that if you have gene rows in your data, use those if your target genome has spliced transcripts. If the data is transcript, not gene based, and is split between rows (multi-exon), then the processing becomes more complicated. One potential solution is the 'Extract' tool - it does not only extract fasta sequence, it can also be used to combine records for some GFF/GTF datasets - so you could try this and output "Interval" data instead of "Fasta". This creates a new GTF file with global coordinates (but the sequence output will be spliced). Check to see if correct, run the 'Compute' tool to do the extensions, 'Cut' columns, and do a final 'Extract' run to obtain the extended, global, sequence. All of this would have to be tested with your data - much depends on the attributes in your file. Hopefully one of these solution will work out for you, Jen Galaxy team On 9/25/12 12:41 AM, Yan He wrote:
Hi Jen,
Thanks very much for your help! It is very helpful. However, following your suggestion, what I got is not what I want. Take one sequence for example. The annotation for one scaffold is C16582 GLEAN mRNA 35 385 0.555898 - . ID=OYG_GLEAN_10000001; C16582 GLEAN CDS 35 385 . - 0 Parent=OYG_GLEAN_10000001;
What I got for this scaffold is
?_C16582_385_5385_- GCAAACAAGC ?_C16582_385_5385_- GCAAACAAGC
I understand that it is trying to get the sequence of the gene downstream from 385-5385, but the sequence is short, so I only get what the scaffold has. I would like to have the upstream+gene+downstream sequence at the same time, not only the upstream or downstream. How can I do this using a galaxy tool? Thanks!
Yan
Date: Mon, 24 Sep 2012 12:26:03 -0700 From: jen@bx.psu.edu To: yanhe83@hotmail.com CC: galaxy-user@lists.bx.psu.edu Subject: Re: [galaxy-user] extract genome sequence
Hi Yan,
Both of the other suggestions are good - I'll also give you another choice to build coordinates before using the "Fetch Sequences -> Extract Genomic DNA" tool to obtain the fasta sequence.
Using your input in BED/Interval format (convert from GFF/GTF if necessary, using the tool "Convert Formats -> GFF-to-BED "), or the first 6 columns if a BED12 (use "Cut" as needed), then run the "Operate on Genomic Intervals -> Get flanks" tool.
"Region:" Whole feature "Location of the flanking region/s:" Both "Offset" 0 "Length of the flanking region(s):" 5000
Your question is similar to this one (the first part, but I thought you might be interested in how to just get the flanks, too).
http://user.list.galaxyproject.org/Get-flanks-version-1-0-0-td4604849.html
Good luck with your project!
Jen Galaxy team
ps. To search prior questions, please see: http://galaxy.psu.edu/search/mailinglists/
On 9/23/12 7:00 PM, Yan He wrote:
Hi everyone,
I have the genome sequence and gene annotation file. Is there a tool on Galaxy to extract the 5,000 bp upstream, 5,000 bp downstream and genome sequences of the genes (including exons and introns) from the genome sequence? Any suggestions are highly appreciated! Thanks!
Yan
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Jennifer Jackson http://galaxyproject.org
-- Jennifer Jackson http://galaxyproject.org
participants (4)
-
Björn Grüning
-
graham etherington (TSL)
-
Jennifer Jackson
-
Yan He