Hello Galaxy: I am trying overall to convert a .gff3 file to 12-column .bed file. I first tried GFF-to-BED converter, but it gave a 6-column .bed file. Then, I tried BED-to-bigBed converter by inputting the 6-column .bed file. I get an error "Unspecified genome build, click the pencil icon in the history item to set the genome build". So, I click the pencil icon, and see 4 tabs at the top. I set the "Attributes" tab as in the attached image (Attributes.png). But then, when I select "Convert Format", I am only seeing an option that outputs .bed12 file as "Convert Genomic Intervals to Strict BED12". I am a bit confused about this because I specified the input file as a .bed file (and not genomic intervals, unless I am misunderstanding something). In any case, when I select "Convert Genomic Intervals to Strict BED12", I do get a .bed file with 12 columns. But I would like to ask if I may have lost information going from the .gff3 to .bed(6) to .bed(12)? (I feel that scores were all set to "0" from .gff3 to .bed(6), and columns 10, 11, 12 (block counts, sizes, and starting positions) were all set to zero going from .bed(6) to .bed(12)). If I am correct that there is information loss, is there a system in Galaxy to prevent this, and transfer as much information as possible from .gff3 to .bed(12)? Thank you. L. Rutter ** Below is a head of my three files (the species is P. dominula): .gff3 file ##gff-version 3 ##date Mon Nov 4 14:54:42 2013 ##source gbrowse gbgff gff3 dumper PdomScaf0001 maker gene 15 1963 . - . Name=PdomGene00025;ID=1;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274 PdomScaf0001 maker mRNA 15 1963 . - . Name=PdomMRNA00025.1;Parent=1;ID=2;_QI=216%7C0%7C0.2%7C0.6%7C0.5%7C0.6%7C5%7C0%7C98;_eAED=0.43;_AED=0.43;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274-mRNA-1 PdomScaf0001 maker exon 15 100 -0.094 - . Parent=2;ID=3 PdomScaf0001 maker CDS 15 100 . - 2 Parent=2;ID=4 PdomScaf0001 maker exon 223 300 21.8 - . Parent=2;ID=5 PdomScaf0001 maker CDS 223 300 . - 2 Parent=2;ID=6 PdomScaf0001 maker exon 717 765 22.4 - . Parent=2;ID=7 .bed(6) file PdomScaf0001 14 1963 gene 0 - PdomScaf0001 14 1963 mRNA 0 - PdomScaf0001 14 100 exon 0 - PdomScaf0001 14 100 CDS 0 - PdomScaf0001 222 300 exon 0 - PdomScaf0001 222 300 CDS 0 - PdomScaf0001 716 765 exon 0 - PdomScaf0001 716 765 CDS 0 - PdomScaf0001 906 947 exon 0 - PdomScaf0001 906 947 CDS 0 - .bed(12) file PdomScaf0001 14 1963 gene 0 - 14 1963 0 0 , , PdomScaf0001 14 1963 mRNA 0 - 14 1963 0 0 , , PdomScaf0001 14 100 exon 0 - 14 100 0 0 , , PdomScaf0001 14 100 CDS 0 - 14 100 0 0 , , PdomScaf0001 222 300 exon 0 - 222 300 0 0 , , PdomScaf0001 222 300 CDS 0 - 222 300 0 0 , , PdomScaf0001 716 765 exon 0 - 716 765 0 0 , , PdomScaf0001 716 765 CDS 0 - 716 765 0 0 , , PdomScaf0001 906 947 exon 0 - 906 947 0 0 , , PdomScaf0001 906 947 CDS 0 - 906 947 0 0 , ,
Hello, There are no tools directly on the public Galaxy site to transform a GFF3 dataset into a BED12 dataset. However, the Tool Shed has a repository called ' fml_gff3togtf' that includes a tool for this purpose, for use in a local install. The description is a bit bothersome in that it a slightly incorrect datatype statement, so be sure to test out the results. (the word "wiggle" has no place in this statement: " gff3_to_bed_converter.py: This tool converts gene transcript annotation from GFF3 format to UCSC wiggle 12 column BED format.") http://getgalaxy.org http://usegalaxy.org/toolshed I see your post at Biostar, and it might be helpful to let you know what a BED12 file represents (plus I'll post this there, may help others): http://www.biostars.org/p/85869/ A BED12 file describes the complete, often spliced, alignment of a sequence to a reference genome. This does not include minor base variation, it is a macro alignment. You can think of each of the blocks as being "exons", although there is no magic here - if the sequence or genome had quality problems, or significant variation (large insertion or deletion), that could cause the alignment to fragment as well. Here is the data description: http://wiki.galaxyproject.org/Learn/Datatypes#Bed To see examples, at UCSC (genome.ucsc.edu), EST or mRNA track will have this as the primary table format. All gene track can also be in BED12 format, or in a related one, genePred: http://genome.ucsc.edu/FAQ/FAQformat.html#format9 UCSC also has line-command utilities to convert between the formats, pre-compiled versions are here: http://hgdownload.cse.ucsc.edu/downloads.html#source_downloads Either way, you can convert the data, then load up into the public Galaxy (usegalaxy.org) and proceed with your analysis. BEDTools works well with BED12 files. There is definitely information loss attempting to transform BED6 -> BED12, as the global alignment is lost. And adjusting attributes such as score or name are often a preference, so you can alter these however you want, as long as the attribute formatting rules for the columns are followed. Hopefully this helps, Jen Galaxy team On 11/9/13 3:29 PM, lrutter @iastate.edu wrote:
Hello Galaxy:
I am trying overall to convert a .gff3 file to 12-column .bed file.
I first tried GFF-to-BED converter, but it gave a 6-column .bed file.
Then, I tried BED-to-bigBed converter by inputting the 6-column .bed file. I get an error "Unspecified genome build, click the pencil icon in the history item to set the genome build".
So, I click the pencil icon, and see 4 tabs at the top. I set the "Attributes" tab as in the attached image (Attributes.png).
But then, when I select "Convert Format", I am only seeing an option that outputs .bed12 file as "Convert Genomic Intervals to Strict BED12". I am a bit confused about this because I specified the input file as a .bed file (and not genomic intervals, unless I am misunderstanding something).
In any case, when I select "Convert Genomic Intervals to Strict BED12", I do get a .bed file with 12 columns. But I would like to ask if I may have lost information going from the .gff3 to .bed(6) to .bed(12)?
(I feel that scores were all set to "0" from .gff3 to .bed(6), and columns 10, 11, 12 (block counts, sizes, and starting positions) were all set to zero going from .bed(6) to .bed(12)).
If I am correct that there is information loss, is there a system in Galaxy to prevent this, and transfer as much information as possible from .gff3 to .bed(12)?
Thank you. L. Rutter
** Below is a head of my three files (the species is P. dominula):
.gff3 file
##gff-version 3 ##date Mon Nov 4 14:54:42 2013 ##source gbrowse gbgff gff3 dumper PdomScaf0001 maker gene 15 1963 . - . Name=PdomGene00025;ID=1;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274 PdomScaf0001 maker mRNA 15 1963 . - . Name=PdomMRNA00025.1;Parent=1;ID=2;_QI=216%7C0%7C0.2%7C0.6%7C0.5%7C0.6%7C5%7C0%7C98;_eAED=0.43;_AED=0.43;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274-mRNA-1 PdomScaf0001 maker exon 15 100 -0.094 - . Parent=2;ID=3 PdomScaf0001 maker CDS 15 100 . - 2 Parent=2;ID=4 PdomScaf0001 maker exon 223 300 21.8 - . Parent=2;ID=5 PdomScaf0001 maker CDS 223 300 . - 2 Parent=2;ID=6 PdomScaf0001 maker exon 717 765 22.4 - . Parent=2;ID=7
.bed(6) file
PdomScaf0001 14 1963 gene 0 - PdomScaf0001 14 1963 mRNA 0 - PdomScaf0001 14 100 exon 0 - PdomScaf0001 14 100 CDS 0 - PdomScaf0001 222 300 exon 0 - PdomScaf0001 222 300 CDS 0 - PdomScaf0001 716 765 exon 0 - PdomScaf0001 716 765 CDS 0 - PdomScaf0001 906 947 exon 0 - PdomScaf0001 906 947 CDS 0 -
.bed(12) file
PdomScaf0001 14 1963 gene 0 - 14 1963 0 0 , , PdomScaf0001 14 1963 mRNA 0 - 14 1963 0 0 , , PdomScaf0001 14 100 exon 0 - 14 100 0 0 , , PdomScaf0001 14 100 CDS 0 - 14 100 0 0 , , PdomScaf0001 222 300 exon 0 - 222 300 0 0 , , PdomScaf0001 222 300 CDS 0 - 222 300 0 0 , , PdomScaf0001 716 765 exon 0 - 716 765 0 0 , , PdomScaf0001 716 765 CDS 0 - 716 765 0 0 , , PdomScaf0001 906 947 exon 0 - 906 947 0 0 , , PdomScaf0001 906 947 CDS 0 - 906 947 0 0 , ,
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-- Jennifer Hillman-Jackson http://galaxyproject.org
Hi, I am the author for "fml_gff3togtf" tool package, currently merged into our instance at http://galaxy.cbio.mskcc.org, The tool can be accessed with following link: https://galaxy.cbio.mskcc.org/tool_runner?tool_id=fml_gff2bed --/Vipin Sloan-Kettering Institute <http://galaxy.cbio.mskcc.org> On Mon, Nov 11, 2013 at 12:15 PM, Jennifer Jackson <jen@bx.psu.edu> wrote:
Hello,
There are no tools directly on the public Galaxy site to transform a GFF3 dataset into a BED12 dataset. However, the Tool Shed has a repository called ' fml_gff3togtf' that includes a tool for this purpose, for use in a local install. The description is a bit bothersome in that it a slightly incorrect datatype statement, so be sure to test out the results. (the word "wiggle" has no place in this statement: " gff3_to_bed_converter.py: This tool converts gene transcript annotation from GFF3 format to UCSC wiggle 12 column BED format.") http://getgalaxy.org http://usegalaxy.org/toolshed
I see your post at Biostar, and it might be helpful to let you know what a BED12 file represents (plus I'll post this there, may help others): http://www.biostars.org/p/85869/
A BED12 file describes the complete, often spliced, alignment of a sequence to a reference genome. This does not include minor base variation, it is a macro alignment. You can think of each of the blocks as being "exons", although there is no magic here - if the sequence or genome had quality problems, or significant variation (large insertion or deletion), that could cause the alignment to fragment as well. Here is the data description: http://wiki.galaxyproject.org/Learn/Datatypes#Bed
To see examples, at UCSC (genome.ucsc.edu), EST or mRNA track will have this as the primary table format. All gene track can also be in BED12 format, or in a related one, genePred: http://genome.ucsc.edu/FAQ/FAQformat.html#format9
UCSC also has line-command utilities to convert between the formats, pre-compiled versions are here: http://hgdownload.cse.ucsc.edu/downloads.html#source_downloads
Either way, you can convert the data, then load up into the public Galaxy ( usegalaxy.org) and proceed with your analysis. BEDTools works well with BED12 files. There is definitely information loss attempting to transform BED6 -> BED12, as the global alignment is lost. And adjusting attributes such as score or name are often a preference, so you can alter these however you want, as long as the attribute formatting rules for the columns are followed.
Hopefully this helps,
Jen Galaxy team
On 11/9/13 3:29 PM, lrutter @iastate.edu wrote:
Hello Galaxy:
I am trying overall to convert a .gff3 file to 12-column .bed file.
I first tried GFF-to-BED converter, but it gave a 6-column .bed file.
Then, I tried BED-to-bigBed converter by inputting the 6-column .bed file. I get an error "Unspecified genome build, click the pencil icon in the history item to set the genome build".
So, I click the pencil icon, and see 4 tabs at the top. I set the "Attributes" tab as in the attached image (Attributes.png).
But then, when I select "Convert Format", I am only seeing an option that outputs .bed12 file as "Convert Genomic Intervals to Strict BED12". I am a bit confused about this because I specified the input file as a .bed file (and not genomic intervals, unless I am misunderstanding something).
In any case, when I select "Convert Genomic Intervals to Strict BED12", I do get a .bed file with 12 columns. But I would like to ask if I may have lost information going from the .gff3 to .bed(6) to .bed(12)?
(I feel that scores were all set to "0" from .gff3 to .bed(6), and columns 10, 11, 12 (block counts, sizes, and starting positions) were all set to zero going from .bed(6) to .bed(12)).
If I am correct that there is information loss, is there a system in Galaxy to prevent this, and transfer as much information as possible from .gff3 to .bed(12)?
Thank you. L. Rutter
** Below is a head of my three files (the species is P. dominula):
.gff3 file
##gff-version 3 ##date Mon Nov 4 14:54:42 2013 ##source gbrowse gbgff gff3 dumper PdomScaf0001 maker gene 15 1963 . - . Name=PdomGene00025;ID=1;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274 PdomScaf0001 maker mRNA 15 1963 . - . Name=PdomMRNA00025.1;Parent=1;ID=2;_QI=216%7C0%7C0.2%7C0.6%7C0.5%7C0.6%7C5%7C0%7C98;_eAED=0.43;_AED=0.43;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274-mRNA-1 PdomScaf0001 maker exon 15 100 -0.094 - . Parent=2;ID=3 PdomScaf0001 maker CDS 15 100 . - 2 Parent=2;ID=4 PdomScaf0001 maker exon 223 300 21.8 - . Parent=2;ID=5 PdomScaf0001 maker CDS 223 300 . - 2 Parent=2;ID=6 PdomScaf0001 maker exon 717 765 22.4 - . Parent=2;ID=7
.bed(6) file
PdomScaf0001 14 1963 gene 0 - PdomScaf0001 14 1963 mRNA 0 - PdomScaf0001 14 100 exon 0 - PdomScaf0001 14 100 CDS 0 - PdomScaf0001 222 300 exon 0 - PdomScaf0001 222 300 CDS 0 - PdomScaf0001 716 765 exon 0 - PdomScaf0001 716 765 CDS 0 - PdomScaf0001 906 947 exon 0 - PdomScaf0001 906 947 CDS 0 -
.bed(12) file
PdomScaf0001 14 1963 gene 0 - 14 1963 0 0 , , PdomScaf0001 14 1963 mRNA 0 - 14 1963 0 0 , , PdomScaf0001 14 100 exon 0 - 14 100 0 0 , , PdomScaf0001 14 100 CDS 0 - 14 100 0 0 , , PdomScaf0001 222 300 exon 0 - 222 300 0 0 , , PdomScaf0001 222 300 CDS 0 - 222 300 0 0 , , PdomScaf0001 716 765 exon 0 - 716 765 0 0 , , PdomScaf0001 716 765 CDS 0 - 716 765 0 0 , , PdomScaf0001 906 947 exon 0 - 906 947 0 0 , , PdomScaf0001 906 947 CDS 0 - 906 947 0 0 , ,
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-- Jennifer Hillman-Jacksonhttp://galaxyproject.org
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Vipin and Jennifer: Thank you both very much for your helpful information here! Sincerely, Lindsay On Mon, Nov 11, 2013 at 11:33 AM, Vipin TS <vipin.ts@gmail.com> wrote:
Hi,
I am the author for "fml_gff3togtf" tool package, currently merged into our instance at http://galaxy.cbio.mskcc.org, The tool can be accessed with following link: https://galaxy.cbio.mskcc.org/tool_runner?tool_id=fml_gff2bed
--/Vipin Sloan-Kettering Institute <http://galaxy.cbio.mskcc.org>
On Mon, Nov 11, 2013 at 12:15 PM, Jennifer Jackson <jen@bx.psu.edu> wrote:
Hello,
There are no tools directly on the public Galaxy site to transform a GFF3 dataset into a BED12 dataset. However, the Tool Shed has a repository called ' fml_gff3togtf' that includes a tool for this purpose, for use in a local install. The description is a bit bothersome in that it a slightly incorrect datatype statement, so be sure to test out the results. (the word "wiggle" has no place in this statement: " gff3_to_bed_converter.py: This tool converts gene transcript annotation from GFF3 format to UCSC wiggle 12 column BED format.") http://getgalaxy.org http://usegalaxy.org/toolshed
I see your post at Biostar, and it might be helpful to let you know what a BED12 file represents (plus I'll post this there, may help others): http://www.biostars.org/p/85869/
A BED12 file describes the complete, often spliced, alignment of a sequence to a reference genome. This does not include minor base variation, it is a macro alignment. You can think of each of the blocks as being "exons", although there is no magic here - if the sequence or genome had quality problems, or significant variation (large insertion or deletion), that could cause the alignment to fragment as well. Here is the data description: http://wiki.galaxyproject.org/Learn/Datatypes#Bed
To see examples, at UCSC (genome.ucsc.edu), EST or mRNA track will have this as the primary table format. All gene track can also be in BED12 format, or in a related one, genePred: http://genome.ucsc.edu/FAQ/FAQformat.html#format9
UCSC also has line-command utilities to convert between the formats, pre-compiled versions are here: http://hgdownload.cse.ucsc.edu/downloads.html#source_downloads
Either way, you can convert the data, then load up into the public Galaxy (usegalaxy.org) and proceed with your analysis. BEDTools works well with BED12 files. There is definitely information loss attempting to transform BED6 -> BED12, as the global alignment is lost. And adjusting attributes such as score or name are often a preference, so you can alter these however you want, as long as the attribute formatting rules for the columns are followed.
Hopefully this helps,
Jen Galaxy team
On 11/9/13 3:29 PM, lrutter @iastate.edu wrote:
Hello Galaxy:
I am trying overall to convert a .gff3 file to 12-column .bed file.
I first tried GFF-to-BED converter, but it gave a 6-column .bed file.
Then, I tried BED-to-bigBed converter by inputting the 6-column .bed file. I get an error "Unspecified genome build, click the pencil icon in the history item to set the genome build".
So, I click the pencil icon, and see 4 tabs at the top. I set the "Attributes" tab as in the attached image (Attributes.png).
But then, when I select "Convert Format", I am only seeing an option that outputs .bed12 file as "Convert Genomic Intervals to Strict BED12". I am a bit confused about this because I specified the input file as a .bed file (and not genomic intervals, unless I am misunderstanding something).
In any case, when I select "Convert Genomic Intervals to Strict BED12", I do get a .bed file with 12 columns. But I would like to ask if I may have lost information going from the .gff3 to .bed(6) to .bed(12)?
(I feel that scores were all set to "0" from .gff3 to .bed(6), and columns 10, 11, 12 (block counts, sizes, and starting positions) were all set to zero going from .bed(6) to .bed(12)).
If I am correct that there is information loss, is there a system in Galaxy to prevent this, and transfer as much information as possible from .gff3 to .bed(12)?
Thank you. L. Rutter
** Below is a head of my three files (the species is P. dominula):
.gff3 file
##gff-version 3 ##date Mon Nov 4 14:54:42 2013 ##source gbrowse gbgff gff3 dumper PdomScaf0001 maker gene 15 1963 . - . Name=PdomGene00025;ID=1;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274 PdomScaf0001 maker mRNA 15 1963 . - . Name=PdomMRNA00025.1;Parent=1;ID=2;_QI=216%7C0%7C0.2%7C0.6%7C0.5%7C0.6%7C5%7C0%7C98;_eAED=0.43;_AED=0.43;Dbxref=MAKER:maker-PdomScaf0001-snap-gene-0.274-mRNA-1 PdomScaf0001 maker exon 15 100 -0.094 - . Parent=2;ID=3 PdomScaf0001 maker CDS 15 100 . - 2 Parent=2;ID=4 PdomScaf0001 maker exon 223 300 21.8 - . Parent=2;ID=5 PdomScaf0001 maker CDS 223 300 . - 2 Parent=2;ID=6 PdomScaf0001 maker exon 717 765 22.4 - . Parent=2;ID=7
.bed(6) file
PdomScaf0001 14 1963 gene 0 - PdomScaf0001 14 1963 mRNA 0 - PdomScaf0001 14 100 exon 0 - PdomScaf0001 14 100 CDS 0 - PdomScaf0001 222 300 exon 0 - PdomScaf0001 222 300 CDS 0 - PdomScaf0001 716 765 exon 0 - PdomScaf0001 716 765 CDS 0 - PdomScaf0001 906 947 exon 0 - PdomScaf0001 906 947 CDS 0 -
.bed(12) file
PdomScaf0001 14 1963 gene 0 - 14 1963 0 0 , , PdomScaf0001 14 1963 mRNA 0 - 14 1963 0 0 , , PdomScaf0001 14 100 exon 0 - 14 100 0 0 , , PdomScaf0001 14 100 CDS 0 - 14 100 0 0 , , PdomScaf0001 222 300 exon 0 - 222 300 0 0 , , PdomScaf0001 222 300 CDS 0 - 222 300 0 0 , , PdomScaf0001 716 765 exon 0 - 716 765 0 0 , , PdomScaf0001 716 765 CDS 0 - 716 765 0 0 , , PdomScaf0001 906 947 exon 0 - 906 947 0 0 , , PdomScaf0001 906 947 CDS 0 - 906 947 0 0 , ,
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-- Jennifer Hillman-Jacksonhttp://galaxyproject.org
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
participants (4)
-
Jennifer Jackson
-
Lindsay Rutter
-
lrutter @iastate.edu
-
Vipin TS