
On Wed, Aug 17, 2011 at 1:41 AM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:
One idea to address both of these issues is to embed the original format in the fasta name so that it's clear whether the coords are BED or GFF (e.g. > hg17_BED_chr1_147962192_147962580).
Or hg17_gtf_chr1_147962192_147962580 etc.
That certainly seems better than the current situation.
However, my preferred solution is to take the FASTA ID from the annotation file. In GFF3 this would be the ID tag in column nine (if present), perhaps with an option to use another custom tag like locus_tag or transcript_id if preferred.
Hi Peter,
This seems reasonable. Of course, the implementation needs to be done with care to (a) ensure the default choice is somewhat similar to what is done now and (b) support all flavors of GFF. If you choose to implement this, you'll also need to update all the existing test output files.
While in BED the name is usually there in column 4 (although this and the later columns are optional), in GFF3 the ID tag is optional, while in GTF v2.2 there can be a gene_id or transcript_id value, etc. I'm picturing select parameter for FASTA output, Name features using: * build, reference, co-ordinates and strand (default) * name from annotation file (if present) * reference name (useful if working on gene/proteins) If name is selected, then a conditional text parameter for GFF type files would be shown to ask which tag(s) to use as the name - a command separated list might work well: http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-August/006432.html This could default to ID for GFF3, and transcript_id,gene_id for GTF, and whatever else is sensible for GFF2. Or a single default suitable for all: ID,transcript_id,gene_id Maybe we don't need the tag setting to be optional, just hard code it to something like ID,transcript_id,gene_id?
For BED I had initially thought this would the optional column 4, name. This made me wonder what Galaxy is doing in converting GFF3 to BED, since column 4 was populated with generic feature types (gene, CDS, etc from GFF3 column 2). Shouldn't this be using the feature's ID tag (if present)?
Yes, I'd say that's correct. The GFF-to-BED converter was written before we had GFF parsing support, and at the time it wasn't possible to extract the name from the attributes.
Is it acceptable for the file format conversion tools in Galaxy to have parameters? In this case, a list of tags to use as the feature name, e.g. ID, transcript_id, gene_id
Finally, note that all changes made to any GFF code must work for GFF, GFF3, and GTF formats.
That makes life interesting... what are the major sources of legacy GFF files within Galaxy (anything not GFF3)? Peter