On Tue, Aug 16, 2011 at 1:03 AM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:
However, the FASTA output uses different names because it embeds the start/end co-ordindates as is. Thus using GFF3 features, the sequence name includes _883_2691_ while using BED features the same sequence has instead _882_2691_ for its name.
I propose this be harmonised by always using one-based counting in the FASTA names (as done in GFF files but also GenBank, EMBL, etc) rather than the convention used in BED files (and Python) which is confusing to most non-programmers.
i.e. I suggest this change (with new tests to enforce it),
https://bitbucket.org/peterjc/galaxy-central/changeset/e7393df0fbc1
Peter,
I have concerns about this change.
IMO, the goal of embedding the start/end coords in the fasta name is to (a) embed important information from the input file into the fasta name and (b) make it simple for users to connect a fasta sequence to an entry in the interval file. These goals are achieved with the current code _relative to the input file_.
Its awkward that two mainstream tabular annotation formats (BED and the GFF family) use different co-ordinates.
This connection between the input and output files key. However, in the case of a user using a mix a BED and GFF files for a single genome, your concern becomes an issue. In practice, I don't think we've seen users encounter this issue yet, which leads to me think that the current code is fine.
One idea to address both of these issues is to embed the original format in the fasta name so that it's clear whether the coords are BED or GFF (e.g. > hg17_BED_chr1_147962192_147962580).
Or hg17_gtf_chr1_147962192_147962580 etc. That certainly seems better than the current situation. However, my preferred solution is to take the FASTA ID from the annotation file. In GFF3 this would be the ID tag in column nine (if present), perhaps with an option to use another custom tag like locus_tag or transcript_id if preferred. For BED I had initially thought this would the optional column 4, name. This made me wonder what Galaxy is doing in converting GFF3 to BED, since column 4 was populated with generic feature types (gene, CDS, etc from GFF3 column 2). Shouldn't this be using the feature's ID tag (if present)? I see code which looks for the tag transcript_id which looks like how I'd handle the GFF3 ID (for batching multi-location features together). Peter