Extract Genomic DNA insisting on build for GFF3 file

newer
Cluster setup - shared temporary...

older
Using data_fabfile.py to update...

Peter Cock

12 Aug 2011 12 Aug '11

1:23 p.m.

Hi all, Is this a bug, or have I misunderstood something? 1. Goto http://usegalaxy.org (or a local Galaxy running galaxy-dist) 2. Import this genomic FASTA file, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.ffn 3. Import this GFF3 file, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gff 4. Goto "Fetch Sequences", "Extract Genomic DNA" 5. Choose fetch sequence for intervals in NC_005213.gff, interpret yes, source history, NC_005213.ffn, output FASTA 6. Click execute. Expected result: Tool runs Actual result: Red error against the gff file, Unspecified genome build, click the pencil icon in the history item to set the genome build The fact I'm using a FASTA file from my history should mean the genome build is irrelevant as that only applies to "locally cached genomes" (right?). Thanks, Peter

Show replies by date

Jeremy Goecks

12 Aug 12 Aug

1:25 p.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

...

Actual result: Red error against the gff file, Unspecified genome build, click the pencil icon in the history item to set the genome build

The fact I'm using a FASTA file from my history should mean the genome build is irrelevant as that only applies to "locally cached genomes" (right?).

Correct. Fixed in galaxy-central changeset 07de40a5a0b9 Thanks, J.

Peter Cock

1:45 p.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

On Fri, Aug 12, 2011 at 2:25 PM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:

...

...
Actual result: Red error against the gff file, Unspecified genome build, click the pencil icon in the history item to set the genome build

The fact I'm using a FASTA file from my history should mean the genome build is irrelevant as that only applies to "locally cached genomes" (right?).

Correct. Fixed in galaxy-central changeset 07de40a5a0b9

Thanks, J.

Great. Could you try that example on the latest galaxy-central please? I have revision 06f0bca6de24 and get the following error using the steps given earlier: Traceback (most recent call last): File "/home/pjcock/repositories/galaxy-central/tools/extract/extract_genomic_dna.py", line 261, in if __name__ == "__main__": __main__() File "/home/pjcock/repositories/galaxy-central/tools/extract/extract_genomic_dna.py", line 128, in __main__ chrom = feature.chrom AttributeError: 'Header' object has no attribute 'chrom' Thanks, Peter P.S. Perhaps you could apply this fix before tacking the above issue? http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-August/006366.html That would save me having to merge it after the trunk changes.

Peter Cock

2:21 p.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

On Fri, Aug 12, 2011 at 2:45 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Fri, Aug 12, 2011 at 2:25 PM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:

...
...
Actual result: Red error against the gff file, Unspecified genome build, click the pencil icon in the history item to set the genome build

The fact I'm using a FASTA file from my history should mean the genome build is irrelevant as that only applies to "locally cached genomes" (right?).

Correct. Fixed in galaxy-central changeset 07de40a5a0b9

Thanks, J.

Great. Could you try that example on the latest galaxy-central please? I have revision 06f0bca6de24 and get the following error using the steps given earlier:

Traceback (most recent call last): File "/home/pjcock/repositories/galaxy-central/tools/extract/extract_genomic_dna.py", line 261, in if __name__ == "__main__": __main__() File "/home/pjcock/repositories/galaxy-central/tools/extract/extract_genomic_dna.py", line 128, in __main__ chrom = feature.chrom AttributeError: 'Header' object has no attribute 'chrom'

Thanks,

Peter

P.S. Perhaps you could apply this fix before tacking the above issue? http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-August/006366.html That would save me having to merge it after the trunk changes.

I think I've found another problem in exploring possible workarounds, 1. Goto http://usegalaxy.org (or a local Galaxy running galaxy-dist or galaxy-central) 2. Import this GFF3 file, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gff 3. Click on the pencil for this dataset to "Edit Attributes" 4. Under "Convert to new format", "Convert GFF to Interval Index", convert Expected result: New interval file Actual result: Traceback (most recent call last): File "/home/pjcock/repositories/galaxy-central/lib/galaxy/datatypes/converters/gff_to_interval_index_converter.py", line 39, in main() File "/home/pjcock/repositories/galaxy-central/lib/galaxy/datatypes/converters/gff_to_interval_index_converter.py", line 26, in main for feature in list( reader_wrapper ): File "/home/pjcock/repositories/galaxy-central/lib/galaxy/datatypes/util/gff_util.py", line 213, in next group = interval.attributes.get( 'group', None ) AttributeError: 'Comment' object has no attribute 'attributes' On the bright side, convert to BED seems to work. Let me know if you want me to file bugs for these issues since that seems to be the new policy - email first, then file bug. Personally I fear that risks issues being forgotten about and never being filed, but we'll see. Peter

Peter Cock

2:36 p.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

On Fri, Aug 12, 2011 at 3:21 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On the bright side, convert to BED seems to work.

Well, sort of. After converting that GFF3 file into BED, the strand column isn't set in the metadata. That seems important! Also a point of clarification, I had the wrong URL for the FASTA file. This is the whole genome, although to actually proceed with this example, the sequence must be renamed to just NC_005213.1 rather than the NCBI's gi|38349555|ref|NC_005213.1| in order to match the naming in the GFF file. ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.fna The file I linked to in the earlier email is what I am trying to reproduce, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.ffn Peter

Jeremy Goecks

14 Aug 14 Aug

11:52 p.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

...

Well, sort of. After converting that GFF3 file into BED, the strand column isn't set in the metadata. That seems important!

We'll look into this.

...

Also a point of clarification, I had the wrong URL for the FASTA file.

This is the whole genome, although to actually proceed with this example, the sequence must be renamed to just NC_005213.1 rather than the NCBI's gi|38349555|ref|NC_005213.1| in order to match the naming in the GFF file. ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.fna

Correct. J.

Peter Cock

15 Aug 15 Aug

10:43 a.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

Hi Jeremy, Things do indeed look much better after your commit last night, thanks: https://bitbucket.org/galaxy/galaxy-central/changeset/3c7416baa157 On Mon, Aug 15, 2011 at 12:52 AM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:

...

...
Well, sort of. After converting that GFF3 file into BED, the strand column isn't set in the metadata. That seems important!

We'll look into this.

Thanks. If I manually set the BED strand to column 5, then the extract tool can be used with both the original NCBI GFF3 file and the BED conversion. I have filtered these on gene features, and noticed a discrepancy. GFF uses one based numbering, e.g. the gene NEQ003 is 883 to 2691. For BED the start coordinate is zero-indexed and the end coordinate is one-indexed (just like Python slicing), thus the gene NEQ003 is 882 to 2691 (and Galaxy converts this correctly). Using the extract tool with the gene features correctly get the nucleotide sequence of NEQ003 running from ATG...TAA regardless of if I use the genes in GFF3 format or in BED format (good). However, the FASTA output uses different names because it embeds the start/end co-ordindates as is. Thus using GFF3 features, the sequence name includes _883_2691_ while using BED features the same sequence has instead _882_2691_ for its name. I propose this be harmonised by always using one-based counting in the FASTA names (as done in GFF files but also GenBank, EMBL, etc) rather than the convention used in BED files (and Python) which is confusing to most non-programmers. Peter

Peter Cock

11:30 a.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

On Mon, Aug 15, 2011 at 11:43 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

Hi Jeremy,

Things do indeed look much better after your commit last night, thanks: https://bitbucket.org/galaxy/galaxy-central/changeset/3c7416baa157

On Mon, Aug 15, 2011 at 12:52 AM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:

...
...
Well, sort of. After converting that GFF3 file into BED, the strand column isn't set in the metadata. That seems important!

We'll look into this.

Thanks. If I manually set the BED strand to column 5, then the extract tool can be used with both the original NCBI GFF3 file and the BED conversion. I have filtered these on gene features, and noticed a discrepancy.

GFF uses one based numbering, e.g. the gene NEQ003 is 883 to 2691.

For BED the start coordinate is zero-indexed and the end coordinate is one-indexed (just like Python slicing), thus the gene NEQ003 is 882 to 2691 (and Galaxy converts this correctly).

Using the extract tool with the gene features correctly get the nucleotide sequence of NEQ003 running from ATG...TAA regardless of if I use the genes in GFF3 format or in BED format (good).

However, the FASTA output uses different names because it embeds the start/end co-ordindates as is. Thus using GFF3 features, the sequence name includes _883_2691_ while using BED features the same sequence has instead _882_2691_ for its name.

I propose this be harmonised by always using one-based counting in the FASTA names (as done in GFF files but also GenBank, EMBL, etc) rather than the convention used in BED files (and Python) which is confusing to most non-programmers.

Jeremy, Sorry -more questions - could you explain what the "Interpret features when possible" setting in the "Extract Genomic DNA" is meant to do? The tool's help text doesn't say anything (other than it is only for GTF/GFF files). In the NC_005213.1 example turning "Interpret features when possible" on seems to massively collapse down the number of features. I'm not sure what is happening but suspect this is in part down to the NCBI GFF3 file being broken with regards to the lack of any ID tags? See also: http://blastedbio.blogspot.com/2011/08/why-are-ncbi-gff3-files-still-broken.... And on another issue, looking at the code for extract_genomic_dna.py there appears to be no attempt to support circular genomes with features wrapping the origin. The gene NEQ001 would be a great example of this, except the NCBI's GFF3 file doesn't do this properly (as noted in my blog post). I'm not sure if the BED format attempts to cover features wrapping the origin of a circular reference genome: http://genome.ucsc.edu/FAQ/FAQformat#format1 Regards, Peter

Jeremy Goecks

16 Aug 16 Aug

12:06 a.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

...

Sorry -more questions - could you explain what the "Interpret features when possible" setting in the "Extract Genomic DNA" is meant to do? The tool's help text doesn't say anything (other than it is only for GTF/GFF files).

In the NC_005213.1 example turning "Interpret features when possible" on seems to massively collapse down the number of features. I'm not sure what is happening but suspect this is in part down to the NCBI GFF3 file being broken with regards to the lack of any ID tags? See also: http://blastedbio.blogspot.com/2011/08/why-are-ncbi-gff3-files-still-broken....

Peter, Interpret features means that the GFF parser aggregates features for a single parent feature (i.e. does multi-line parsing), extracts the sequences for each child feature, and creates the sequence for the parent by concatenating the child features' sequences. Yes, this needs to be better documented.

...

And on another issue, looking at the code for extract_genomic_dna.py there appears to be no attempt to support circular genomes with features wrapping the origin.

Correct, this is not implemented. J.

Peter Cock

15 Aug 15 Aug

2:01 p.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

On Mon, Aug 15, 2011 at 11:43 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

Hi Jeremy,

Things do indeed look much better after your commit last night, thanks: https://bitbucket.org/galaxy/galaxy-central/changeset/3c7416baa157

On Mon, Aug 15, 2011 at 12:52 AM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:

...
...
Well, sort of. After converting that GFF3 file into BED, the strand column isn't set in the metadata. That seems important!

We'll look into this.

Thanks. If I manually set the BED strand to column 5, then the extract tool can be used with both the original NCBI GFF3 file and the BED conversion. I have filtered these on gene features, and noticed a discrepancy.

GFF uses one based numbering, e.g. the gene NEQ003 is 883 to 2691.

For BED the start coordinate is zero-indexed and the end coordinate is one-indexed (just like Python slicing), thus the gene NEQ003 is 882 to 2691 (and Galaxy converts this correctly).

Using the extract tool with the gene features correctly get the nucleotide sequence of NEQ003 running from ATG...TAA regardless of if I use the genes in GFF3 format or in BED format (good).

However, the FASTA output uses different names because it embeds the start/end co-ordindates as is. Thus using GFF3 features, the sequence name includes _883_2691_ while using BED features the same sequence has instead _882_2691_ for its name.

I propose this be harmonised by always using one-based counting in the FASTA names (as done in GFF files but also GenBank, EMBL, etc) rather than the convention used in BED files (and Python) which is confusing to most non-programmers.

i.e. I suggest this change (with new tests to enforce it), https://bitbucket.org/peterjc/galaxy-central/changeset/e7393df0fbc1 This is currently the one and only commit on this new branch, https://bitbucket.org/peterjc/galaxy-central/src/extract_region2 Peter

Peter Cock

4:19 p.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

On Mon, Aug 15, 2011 at 3:01 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Mon, Aug 15, 2011 at 11:43 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
Thanks. If I manually set the BED strand to column 5, then the extract tool can be used with both the original NCBI GFF3 file and the BED conversion. I have filtered these on gene features, and noticed a discrepancy.

GFF uses one based numbering, e.g. the gene NEQ003 is 883 to 2691.

For BED the start coordinate is zero-indexed and the end coordinate is one-indexed (just like Python slicing), thus the gene NEQ003 is 882 to 2691 (and Galaxy converts this correctly).

Using the extract tool with the gene features correctly get the nucleotide sequence of NEQ003 running from ATG...TAA regardless of if I use the genes in GFF3 format or in BED format (good).

However, the FASTA output uses different names because it embeds the start/end co-ordindates as is. Thus using GFF3 features, the sequence name includes _883_2691_ while using BED features the same sequence has instead _882_2691_ for its name.

I propose this be harmonised by always using one-based counting in the FASTA names (as done in GFF files but also GenBank, EMBL, etc) rather than the convention used in BED files (and Python) which is confusing to most non-programmers.

i.e. I suggest this change (with new tests to enforce it),

https://bitbucket.org/peterjc/galaxy-central/changeset/e7393df0fbc1

This is currently the one and only commit on this new branch,

https://bitbucket.org/peterjc/galaxy-central/src/extract_region2

Second commit to use the newly added BED file in the converter's tests as well: https://bitbucket.org/peterjc/galaxy-central/changeset/e227e463bea0 Peter

Jeremy Goecks

16 Aug 16 Aug

12:03 a.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

...

...
However, the FASTA output uses different names because it embeds the start/end co-ordindates as is. Thus using GFF3 features, the sequence name includes _883_2691_ while using BED features the same sequence has instead _882_2691_ for its name.

I propose this be harmonised by always using one-based counting in the FASTA names (as done in GFF files but also GenBank, EMBL, etc) rather than the convention used in BED files (and Python) which is confusing to most non-programmers.

i.e. I suggest this change (with new tests to enforce it),

https://bitbucket.org/peterjc/galaxy-central/changeset/e7393df0fbc1

Peter, I have concerns about this change. IMO, the goal of embedding the start/end coords in the fasta name is to (a) embed important information from the input file into the fasta name and (b) make it simple for users to connect a fasta sequence to an entry in the interval file. These goals are achieved with the current code _relative to the input file_. This connection between the input and output files key. However, in the case of a user using a mix a BED and GFF files for a single genome, your concern becomes an issue. In practice, I don't think we've seen users encounter this issue yet, which leads to me think that the current code is fine. One idea to address both of these issues is to embed the original format in the fasta name so that it's clear whether the coords are BED or GFF (e.g. >hg17_BED_chr1_147962192_147962580). Best, J.

Peter Cock

9:15 a.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

On Tue, Aug 16, 2011 at 1:03 AM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:

...

...
...
However, the FASTA output uses different names because it embeds the start/end co-ordindates as is. Thus using GFF3 features, the sequence name includes _883_2691_ while using BED features the same sequence has instead _882_2691_ for its name.

I propose this be harmonised by always using one-based counting in the FASTA names (as done in GFF files but also GenBank, EMBL, etc) rather than the convention used in BED files (and Python) which is confusing to most non-programmers.

i.e. I suggest this change (with new tests to enforce it),

https://bitbucket.org/peterjc/galaxy-central/changeset/e7393df0fbc1

Peter,

I have concerns about this change.

IMO, the goal of embedding the start/end coords in the fasta name is to (a) embed important information from the input file into the fasta name and (b) make it simple for users to connect a fasta sequence to an entry in the interval file. These goals are achieved with the current code _relative to the input file_.

Its awkward that two mainstream tabular annotation formats (BED and the GFF family) use different co-ordinates.

...

This connection between the input and output files key. However, in the case of a user using a mix a BED and GFF files for a single genome, your concern becomes an issue. In practice, I don't think we've seen users encounter this issue yet, which leads to me think that the current code is fine.

One idea to address both of these issues is to embed the original format in the fasta name so that it's clear whether the coords are BED or GFF (e.g. > hg17_BED_chr1_147962192_147962580).

Or hg17_gtf_chr1_147962192_147962580 etc. That certainly seems better than the current situation. However, my preferred solution is to take the FASTA ID from the annotation file. In GFF3 this would be the ID tag in column nine (if present), perhaps with an option to use another custom tag like locus_tag or transcript_id if preferred. For BED I had initially thought this would the optional column 4, name. This made me wonder what Galaxy is doing in converting GFF3 to BED, since column 4 was populated with generic feature types (gene, CDS, etc from GFF3 column 2). Shouldn't this be using the feature's ID tag (if present)? I see code which looks for the tag transcript_id which looks like how I'd handle the GFF3 ID (for batching multi-location features together). Peter

Jeremy Goecks

17 Aug 17 Aug

12:41 a.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

...

...
One idea to address both of these issues is to embed the original format in the fasta name so that it's clear whether the coords are BED or GFF (e.g. > hg17_BED_chr1_147962192_147962580).

Or hg17_gtf_chr1_147962192_147962580 etc.

That certainly seems better than the current situation.

However, my preferred solution is to take the FASTA ID from the annotation file. In GFF3 this would be the ID tag in column nine (if present), perhaps with an option to use another custom tag like locus_tag or transcript_id if preferred.

Hi Peter, This seems reasonable. Of course, the implementation needs to be done with care to (a) ensure the default choice is somewhat similar to what is done now and (b) support all flavors of GFF. If you choose to implement this, you'll also need to update all the existing test output files.

...

For BED I had initially thought this would the optional column 4, name. This made me wonder what Galaxy is doing in converting GFF3 to BED, since column 4 was populated with generic feature types (gene, CDS, etc from GFF3 column 2). Shouldn't this be using the feature's ID tag (if present)?

Yes, I'd say that's correct. The GFF-to-BED converter was written before we had GFF parsing support, and at the time it wasn't possible to extract the name from the attributes. Finally, note that all changes made to any GFF code must work for GFF, GFF3, and GTF formats. Thanks, J.

Peter Cock

9:02 a.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

On Wed, Aug 17, 2011 at 1:41 AM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:

...

...
...
One idea to address both of these issues is to embed the original format in the fasta name so that it's clear whether the coords are BED or GFF (e.g. > hg17_BED_chr1_147962192_147962580).

Or hg17_gtf_chr1_147962192_147962580 etc.

That certainly seems better than the current situation.

However, my preferred solution is to take the FASTA ID from the annotation file. In GFF3 this would be the ID tag in column nine (if present), perhaps with an option to use another custom tag like locus_tag or transcript_id if preferred.

Hi Peter,

This seems reasonable. Of course, the implementation needs to be done with care to (a) ensure the default choice is somewhat similar to what is done now and (b) support all flavors of GFF. If you choose to implement this, you'll also need to update all the existing test output files.

While in BED the name is usually there in column 4 (although this and the later columns are optional), in GFF3 the ID tag is optional, while in GTF v2.2 there can be a gene_id or transcript_id value, etc. I'm picturing select parameter for FASTA output, Name features using: * build, reference, co-ordinates and strand (default) * name from annotation file (if present) * reference name (useful if working on gene/proteins) If name is selected, then a conditional text parameter for GFF type files would be shown to ask which tag(s) to use as the name - a command separated list might work well: http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-August/006432.html This could default to ID for GFF3, and transcript_id,gene_id for GTF, and whatever else is sensible for GFF2. Or a single default suitable for all: ID,transcript_id,gene_id Maybe we don't need the tag setting to be optional, just hard code it to something like ID,transcript_id,gene_id?

...

...
For BED I had initially thought this would the optional column 4, name. This made me wonder what Galaxy is doing in converting GFF3 to BED, since column 4 was populated with generic feature types (gene, CDS, etc from GFF3 column 2). Shouldn't this be using the feature's ID tag (if present)?

Yes, I'd say that's correct. The GFF-to-BED converter was written before we had GFF parsing support, and at the time it wasn't possible to extract the name from the attributes.

Is it acceptable for the file format conversion tools in Galaxy to have parameters? In this case, a list of tags to use as the feature name, e.g. ID, transcript_id, gene_id

...

Finally, note that all changes made to any GFF code must work for GFF, GFF3, and GTF formats.

That makes life interesting... what are the major sources of legacy GFF files within Galaxy (anything not GFF3)? Peter

Jeremy Goecks

18 Aug 18 Aug

2:04 a.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

...

I'm picturing select parameter for FASTA output,

Name features using: * build, reference, co-ordinates and strand (default) * name from annotation file (if present) * reference name (useful if working on gene/proteins)

Agreed.

...

If name is selected, then a conditional text parameter for GFF type files would be shown to ask which tag(s) to use as the name - a command separated list might work well: http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-August/006432.html

Yes, this is a limitation of the current Galaxy framework but should be able to be implemented without too much trouble.

...

This could default to ID for GFF3, and transcript_id,gene_id for GTF, and whatever else is sensible for GFF2. Or a single default suitable for all: ID,transcript_id,gene_id

Maybe we don't need the tag setting to be optional, just hard code it to something like ID,transcript_id,gene_id?

As a first step, hardcoding is fine.

...

Is it acceptable for the file format conversion tools in Galaxy to have parameters? In this case, a list of tags to use as the feature name, e.g. ID, transcript_id, gene_id

Not that I know of because Galaxy assumes conversions can be done automatically as needed.

...

...
Finally, note that all changes made to any GFF code must work for GFF, GFF3, and GTF formats.

That makes life interesting... what are the major sources of legacy GFF files within Galaxy (anything not GFF3)?

Perhaps I spoke a bit too strongly here. I think that GTFs are the primary flavor of GFF files used in Galaxy, and these are acquired from UCSC and Ensembl. GFF3 also seem to be used quite frequently as well, especially for folks working with bacteria and other simple organisms. GFF 2.2 and earlier aren't seen much as best I know. So let me rephrase and say that any changes need to be compatible with GTF and GFF3. Best, J.

Peter Cock

19 Aug 19 Aug

10:54 a.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

On Thursday, August 18, 2011, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:

...

...
I'm picturing select parameter for FASTA output,

Name features using: * build, reference, co-ordinates and strand (default) * name from annotation file (if present) * reference name (useful if working on gene/proteins)

Agreed.

Great - I may not be able to work on this immediately though...

...

...
If name is selected, then a conditional text parameter for GFF type files would be shown to ask which tag(s) to use as the name - a command separated list might work well: http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-August/006432.html

Yes, this is a limitation of the current Galaxy framework but should be able to be implemented without too much trouble.

...
This could default to ID for GFF3, and transcript_id,gene_id for GTF, and whatever else is sensible for GFF2. Or a single default suitable for all: ID,transcript_id,gene_id

Maybe we don't need the tag setting to be optional, just hard code it to something like ID,transcript_id,gene_id?

As a first step, hardcoding is fine.

Agreed.

...

...
Is it acceptable for the file format conversion tools in Galaxy to have parameters? In this case, a list of tags to use as the feature name, e.g. ID, transcript_id, gene_id

Not that I know of because Galaxy assumes conversions can be done automatically as needed.

OK.

...

...
...
Finally, note that all changes made to any GFF code must work for GFF, GFF3, and GTF formats.

That makes life interesting... what are the major sources of legacy GFF files within Galaxy (anything not GFF3)?

Perhaps I spoke a bit too strongly here. I think that GTFs are the primary flavor of GFF files used in Galaxy, and these are acquired from UCSC and Ensembl. GFF3 also seem to be used quite frequently as well, especially for folks working with bacteria and other simple organisms. GFF 2.2 and earlier aren't seen much as best I know.

Interesting - thank you!

...

So let me rephrase and say that any changes need to be compatible with GTF and GFF3.

OK - I'll focus on them. Meanwhile back to the song-devel mailing list for more GFF3 spec discussions on things like best practice for protein annotation using GFF3... Thanks, Peter

Jeremy Goecks

14 Aug 14 Aug

11:50 p.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

...

I think I've found another problem in exploring possible workarounds,

1. Goto http://usegalaxy.org (or a local Galaxy running galaxy-dist or galaxy-central)

2. Import this GFF3 file, ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gff

3. Click on the pencil for this dataset to "Edit Attributes"

4. Under "Convert to new format", "Convert GFF to Interval Index", convert

Expected result: New interval file

Actual result:

Traceback (most recent call last): File "/home/pjcock/repositories/galaxy-central/lib/galaxy/datatypes/converters/gff_to_interval_index_converter.py", line 39, in main() File "/home/pjcock/repositories/galaxy-central/lib/galaxy/datatypes/converters/gff_to_interval_index_converter.py", line 26, in main for feature in list( reader_wrapper ): File "/home/pjcock/repositories/galaxy-central/lib/galaxy/datatypes/util/gff_util.py", line 213, in next group = interval.attributes.get( 'group', None ) AttributeError: 'Comment' object has no attribute 'attributes'

Fixed in galaxy-central changeset 3c7416baa157 Best, J.

Jeremy Goecks

11:50 p.m.

New subject: Extract Genomic DNA insisting on build for GFF3 file

...

Great. Could you try that example on the latest galaxy-central please? I have revision 06f0bca6de24 and get the following error using the steps given earlier:

Traceback (most recent call last): File "/home/pjcock/repositories/galaxy-central/tools/extract/extract_genomic_dna.py", line 261, in if __name__ == "__main__": __main__() File "/home/pjcock/repositories/galaxy-central/tools/extract/extract_genomic_dna.py", line 128, in __main__ chrom = feature.chrom AttributeError: 'Header' object has no attribute 'chrom'

Fixed in galaxy-central changeset 3c7416baa157 Best, J.

5079

Age (days ago)

5086

Last active (days ago)

List overview

Download

18 comments

2 participants

participants (2)

Jeremy Goecks
Peter Cock