Dear Team,

I've been having a problem with cufflink regarding GFF files. I tried searching the mailing list first and failed to find an answer. Could you help me look at this?

I downloaded my genome annotation GFF file from NCBI (soon I realized NCBI format may be a problem) for my bacterial RNA-seq data analysis. My GFF file looks like the following:

'

##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region NC_011420.2 1 4355543
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=414684
NC_011420.2	RefSeq	region	1	4355543	.	+	.	ID=id0;Dbxref=taxon:414684;Is_circular=true;culture-collection=ATCC:51521;gb-synonym=Rhodocista centenaria SW;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=SW%3B ATCC 51521
NC_011420.2	RefSeq	gene	11	3343	.	+	.	ID=gene0;Name=RC1_0011;Dbxref=GeneID:7008893;gbkey=Gene;locus_tag=RC1_0011
NC_011420.2	RefSeq	CDS	11	3343	.	+	0	ID=cds0;Name=YP_002296275.1;Parent=gene0;Note=Contains a type I secretion target ggxgxdxxx repeat %282 copies%29 domain%3B Contains a Cadherin domain%3B identified by match to protein family HMM PF02789;Dbxref=Genbank:YP_002296275.1,GeneID:7008893;gbkey=CDS;product=hypothetical protein;protein_id=YP_002296275.1;transl_table=11

I used this file for cufflink but all the FPKM values are 0. I checked out this link: http://cufflinks.cbcb.umd.edu/gff.html and thought that maybe the problem is because I don't have any mRNA feature in my gff file. Since I am dealing with a bacterial genome, there is no exon/intron or UTR info needed. Therefore I modified my GFF file into the following:

##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region NC_011420.2 1 4355543
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=414684
NC_011420.2	RefSeq	region	1	4355543	.	+	.	ID=id0;Dbxref=taxon:414684;Is_circular=true;culture-collection=ATCC:51521;gb-synonym=Rhodocista centenaria SW;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=SW%3B ATCC 51521
NC_011420.2	RefSeq	mRNA	11	3343	.	+	.	ID=mRNA0;Name=RC1_0011;Dbxref=GeneID:7008893;gbkey=Gene;locus_tag=RC1_0011
NC_011420.2	RefSeq	CDS	11	3343	.	+	0	ID=cds0;Name=YP_002296275.1;Parent=mRNA0;Note=Contains a type I secretion target ggxgxdxxx repeat %282 copies%29 domain%3B Contains a Cadherin domain%3B identified by match to protein family HMM PF02789;Dbxref=Genbank:YP_002296275.1,GeneID:7008893;gbkey=CDS;product=hypothetical protein;protein_id=YP_002296275.1;transl_table=11

I re-ran cufflink however this time there is error reported. I can only tell from the report that there is a segmentation fault but not further details. The report is as follows:

Error running cufflinks.
return code = 139
Command line:
cufflinks -q --no-update-check -I 100 -F 0.100000 -j 0.150000 -p 4 -G /galaxy/test_pool/pool5/files/000/327/dataset_327777.dat /galaxy/test_database/files/000/325/dataset_325086.dat 
[19:41:41] Loading reference annotation.
Segmentation fault

cp: cannot stat `/galaxy/test_pool/pool3/tmp/job_working_directory/000/170/170197/global_model.txt': No such file or directory
cp: cannot stat `/galaxy/test_pool/pool3/tmp/job_working_directory/000/170/170197/isoforms.fpkm_tracking': No such file or directory
cp: cannot stat `/galaxy/test_pool/pool3/tmp/job_working_directory/000/170/170197/genes.fpkm_tracking': No such file or directory

My questions will be:

1. Is there any way to modify a NCBI bacterial genome annotation GFF file to make it usable for cufflink? Our genome annotation is only available in NCBI, not ensemble or USDC so this is pretty much my only choice..

2. Should I proceed with modifying the GFF file or should I convert it into GTF and use the GTF instead in cufflink?

I am a biochemist and really new to the computer world so any advice will help!

Thanks a lot,

Qian
--
Qian Dong
Bauer Lab, MCBD
Simon Hall: 313-317
212 S. Hawthorne Dr.
Bloomington, IN 47405
Email:dong3@indiana.edu
Lab Phone:812-855-8443