Re: [galaxy-dev] Extract Genomic DNA insisting on build for GFF3 file

17 Aug 2011

      On Wed, Aug 17, 2011 at 1:41 AM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:
...
...
...
One idea to address both of these issues is to embed the
original format in the fasta name so that it's clear whether
the coords are BED or GFF (e.g. >
hg17_BED_chr1_147962192_147962580).
Or hg17_gtf_chr1_147962192_147962580 etc.
That certainly seems better than the current situation.
However, my preferred solution is to take the FASTA ID from
the annotation file. In GFF3 this would be the ID tag in column
nine (if present), perhaps with an option to use another
custom tag like locus_tag or transcript_id if preferred.
Hi Peter,
This seems reasonable. Of course, the implementation needs
to be done with care to (a) ensure the default choice is
somewhat similar to what is done now and (b) support all
flavors of GFF. If you choose to implement this, you'll also
need to update all the existing test output files.
While in BED the name is usually there in column 4 (although
this and the later columns are optional), in GFF3 the ID tag
is optional, while in GTF v2.2 there can be a gene_id or
transcript_id value, etc.

I'm picturing select parameter for FASTA output,

Name features using:
* build, reference, co-ordinates and strand (default)
* name from annotation file (if present)
* reference name (useful if working on gene/proteins)

If name is selected, then a conditional text parameter for
GFF type files would be shown to ask which tag(s) to use
as the name - a command separated list might work well:
http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-August/006432.html

This could default to ID for GFF3, and transcript_id,gene_id
for GTF, and whatever else is sensible for GFF2. Or a single
default suitable for all: ID,transcript_id,gene_id

Maybe we don't need the tag setting to be optional, just
hard code it to something like ID,transcript_id,gene_id?
...
...
For BED I had initially thought this would the optional
column 4, name. This made me wonder what Galaxy
is doing in converting GFF3 to BED, since column 4 was
populated with generic feature types (gene, CDS, etc
from GFF3 column 2). Shouldn't this be using the feature's
ID tag (if present)?
Yes, I'd say that's correct. The GFF-to-BED converter was
written before we had GFF parsing support, and at the time
it wasn't possible to extract the name from the attributes.
Is it acceptable for the file format conversion tools in Galaxy
to have parameters? In this case, a list of tags to use as the
feature name, e.g. ID, transcript_id, gene_id
...
Finally, note that all changes made to any GFF code must
work for GFF, GFF3, and GTF formats.
That makes life interesting... what are the major sources of
legacy GFF files within Galaxy (anything not GFF3)?

Peter

Re: [galaxy-dev] Extract Genomic DNA insisting on build for GFF3 file

Peter Cock