Re: [galaxy-dev] Extract Genomic DNA insisting on build for GFF3 file

16 Aug 2011

      On Tue, Aug 16, 2011 at 1:03 AM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:
...
...
...
However, the FASTA output uses different names because it embeds
the start/end co-ordindates as is. Thus using GFF3 features, the
sequence name includes _883_2691_ while using BED features the
same sequence has instead _882_2691_ for its name.
I propose this be harmonised by always using one-based counting
in the FASTA names (as done in GFF files but also GenBank, EMBL,
etc) rather than the convention used in BED files (and Python) which
is confusing to most non-programmers.
i.e. I suggest this change (with new tests to enforce it),
https://bitbucket.org/peterjc/galaxy-central/changeset/e7393df0fbc1
Peter,
I have concerns about this change.
IMO, the goal of embedding the start/end coords in the fasta
name is to (a) embed important information from the input file
into the fasta name and (b) make it simple for users to connect
a fasta sequence to an entry in the interval file. These goals
are achieved with the current code _relative to the input file_.
Its awkward that two mainstream tabular annotation formats
(BED and the GFF family) use different co-ordinates.
...
This connection between the input and output files key.
However, in the case of a user using a mix a BED and GFF
files for a single genome, your concern becomes an issue.
In practice, I don't think we've seen users encounter this
issue yet, which leads to me  think that the current code is
fine.
One idea to address both of these issues is to embed the
original format in the fasta name so that it's clear whether
the coords are BED or GFF (e.g. >
hg17_BED_chr1_147962192_147962580).
Or hg17_gtf_chr1_147962192_147962580 etc.

That certainly seems better than the current situation.

However, my preferred solution is to take the FASTA ID from
the annotation file. In GFF3 this would be the ID tag in column
nine (if present), perhaps with an option to use another
custom tag like locus_tag or transcript_id if preferred.

For BED I had initially thought this would the optional
column 4, name. This made me wonder what Galaxy
is doing in converting GFF3 to BED, since column 4 was
populated with generic feature types (gene, CDS, etc
from GFF3 column 2). Shouldn't this be using the feature's
ID tag (if present)? I see code which looks for the tag
transcript_id which looks like how I'd handle the GFF3
ID (for batching multi-location features together).

Peter

Re: [galaxy-dev] Extract Genomic DNA insisting on build for GFF3 file

Peter Cock