Hi Jen,
Thank you very much for your
reply.
The file contains more than 5000 transcripts
so I don't pull out data per transcript.
I do as you told and make sure the
format. I filter the gff file to get a new file only containing
exons information (I was wrong yesterday because I used the raw gtf file
as I told in the former mail), then convert gtf to bed . So I can use
(Extract Features)->Gene BED To
Exon/Intron/Codon BED to get a bed file containing introns such like
this:
1 |
9162341 |
9162884 |
CUFF.1911.1 |
0 |
- |
1 |
22819814 |
22826251 |
CUFF.5109.1 |
0 |
+ |
1 |
25887852 |
25895755 |
CUFF.5509.1 |
0 |
- |
1 |
25895822 |
25902258 |
CUFF.5509.1 |
0 |
- |
1 |
39783161 |
39786032 |
CUFF.8086.1 |
0 |
+ |
Then I met another problem: I got an empty file when I
used Extract Genomic DNA to fetch sequence whether the file was gtf format or
not. It returned a right result while I used the bed file downloaded
from UCSC main. I think I have checked the format, but I found
nothing wrong.
the data downloaded from UCSC main is like this:
chr1 |
133903980 |
133904133 |
NM_214429_exon_0_0_chr1_133903981_f |
0 |
+ |
chr1 |
133914112 |
133914267 |
NM_214429_exon_1_0_chr1_133914113_f |
0 |
+ |
chr1 |
133917280 |
133917449 |
NM_214429_exon_2_0_chr1_133917281_f |
0 |
+ |
Then I suddenly found the problem when
I was trying to explain it. The input file of the tool (Extract
Genomic DNA) request the condition of the chromosome name which should be
,for example, 'chr1' rather than '1' .
I have tackled it all day .It is
really low deficient when there is not anybody instructing in
face to face.
Best,
John
Sent: Wednesday, August 21, 2013 6:50 PM
Subject: Re: [galaxy-user] Question about Extract intron sequences
from [gtf file] + [genome FASTA file]
Hi Jen,
Thank you very much for your reply.
The file contains more than 5000 transcripts
so I don't pull out data per transcript .
I do as you say and make sure the
format. I filter the gff file to get a new file only containing
exons information (I was wrong yesterday because I used the raw gtf file
as I told in the former mail), then convert gtf to bed . So I can use
(Extract
Features)->Gene BED To Exon/Intron/Codon BED to get a
bed file containing introns such like this:
1 |
9162341 |
9162884 |
CUFF.1911.1 |
0 |
- |
1 |
22819814 |
22826251 |
CUFF.5109.1 |
0 |
+ |
1 |
25887852 |
25895755 |
CUFF.5509.1 |
0 |
- |
1 |
25895822 |
25902258 |
CUFF.5509.1 |
0 |
- |
1 |
39783161 |
39786032 |
CUFF.8086.1 |
0 |
+ |
Then I met another problem: I got an empty file when I used Extract
Genomic DNA to fetch sequence whether the file was gtf format or not. It
returned a right result while I used the bed file downloaded from UCSC
main. I think I have checked the format, but I found nothing wrong.
the data downloaded from UCSC main is like this:
chr1 |
133903980 |
133904133 |
NM_214429_exon_0_0_chr1_133903981_f |
0 |
+ |
chr1 |
133914112 |
133914267 |
NM_214429_exon_1_0_chr1_133914113_f |
0 |
+ |
chr1 |
133917280 |
133917449 |
NM_214429_exon_2_0_chr1_133917281_f |
0 |
+ |
I have tackled it all day .It is really of low
deficiency when there is not anybody instructing in face to face. So I need
some of your tips.
Best,
John
Sent: Wednesday, August 21, 2013 1:45 AM
Subject: Re: [galaxy-user] Question about Extract intron sequences
from [gtf file] + [genome FASTA file]
Hello,
There appears to be something odd with the
formatting of the GTF file - the exon counts are off in the second transcript's
first exon. The exon_number "1"
should be "2" (remember to count
reverse, is on the negative strand). But that is a side issue. There are other
things that do not quite make sense, but the entire dataset was not
shared.
Run this again, but do the following:
1 - make sure the
files are in interval format and that the column assignments are correct (click
on the pencil icon)
2 - Use strand assignment or better, separate (+) and (-)
stranded transcripts into two files, at the start and run the query in two
workflows from there. Some GOPS tools work best this way.
Also, be aware
that some of these transcripts will not have intron output. For example, the
first transcript in your example is a single exon transcript. Also, if you have
genes with overlapping variant transcripts, these will interfere with the query
(you will lose introns or fractions of introns), but I don't know how large of a
dataset you are working with. If you want to pull out data per transcript, the
tools in the group "Filter and Sort" can be used to subset GFF/GTF
files.
The last query that you ran is the ideal way to run to obtain this
information in Galaxy, but the GFF to BED converter creates a BED6, not a BED12
file, and this is why the tool produced no output (see the tool form for
required input). Having this tool accept GTF formatted input might be something
to consider as an enhancement - I will run it by our development team and open a
Trello ticket as appropriate.
Another method, which may not be available
to you, (from looking at the chromosome identifiers - these are not UCSC chrom
IDs) -- but could help in the future or others now, is to use the UCSC Table
browser. It goes something like this:
1 - Click on "display at UCSC Main" for
a GTF dataset, this loads the data as a custom track, default display in
assembly viewer
2 - Once in UCSC, at the top bar, pick Tools -> Table
Browser
3 - In the Table Browser, change track group to "Custom Tracks" and
the user track you just loaded will be there
4 - Change region = genome, then
output = bed, and make sure "Send output to Galaxy" is checked, submit
5 - On
the next form, you will be given a list of regions to output in the BED6 output,
Introns are one of them
Best,
Jen
Galaxy team
On 8/20/13 9:29 AM, 师云 wrote:
Dear Jen,
I am not much of a Galaxy user yet. Some
days ago I know something about Galaxy and found it a really
wonderful tool. And I am confused by a simple question regarding how to
extract intron sequences from [gtf file];
Here is a simple of a gtf
file:
1 Cufflinks transcript 3 22 1000 + . gene_id
"CUFF.26"; transcript_id "CUFF.26.1";
1 Cufflinks exon 3 22 1000 + . gene_id
"CUFF.26"; transcript_id "CUFF.26.1"; exon_number
"1";
1 Cufflinks transcript 10 40 1000 - . gene_id
"CUFF.204"; transcript_id "CUFF.204.1";
1 Cufflinks exon 10 15 1000 - . gene_id
"CUFF.204"; transcript_id "CUFF.204.1"; exon_number
"1";
1 Cufflinks exon 30 40 1000 - . gene_id
"CUFF.204"; transcript_id "CUFF.204.1"; exon_number
"1";
I want to extract intron
from the [gtf] file. I found 2 ways may solve the question but it is
both useless;
1. I use (Filter and Sort) -> Filter to
cut the [gtf] file into 2 files such as the
follows:
File A ( contain
transcript ):
1 Cufflinks transcript 3 22 1000 + . gene_id
"CUFF.26"; transcript_id "CUFF.26.1";
1 Cufflinks transcript 10 40 1000 - . gene_id
"CUFF.204"; transcript_id "CUFF.204.1";
File B ( contain exon):
1 Cufflinks exon 3 22 1000 + . gene_id
"CUFF.26"; transcript_id "CUFF.26.1"; exon_number
"1";
1 Cufflinks exon 10 15 1000 - . gene_id
"CUFF.204"; transcript_id "CUFF.204.1"; exon_number
"1";
1 Cufflinks exon 30 40 1000 - . gene_id
"CUFF.204"; transcript_id "CUFF.204.1"; exon_number
"1";
Then I use (Operate on Genomic
Intervals)->Subtract to subtract File B from File A Return Non-overlapping
pieces of intervals. I thought it will return a file
containing intron But the result is an empty
file;
2.
I convert [gtf] file to [Bed] file ,and use (Extract
Features)->Gene BED To Exon/Intron/Codon BED, and it return the same
result, an empty file.
I
think it must be something wrong with my thoughts. So I really need your help.
Thank you very much.
sincerely
yours,
John
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org. Please keep all replies on the list by
using "reply all" in your mail client. For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists,
please use the interface at:
http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/
--
Jennifer Hillman-Jackson
http://galaxyproject.org