On Mon, Aug 15, 2011 at 3:01 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Mon, Aug 15, 2011 at 11:43 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Thanks. If I manually set the BED strand to column 5, then the extract tool can be used with both the original NCBI GFF3 file and the BED conversion. I have filtered these on gene features, and noticed a discrepancy.
GFF uses one based numbering, e.g. the gene NEQ003 is 883 to 2691.
For BED the start coordinate is zero-indexed and the end coordinate is one-indexed (just like Python slicing), thus the gene NEQ003 is 882 to 2691 (and Galaxy converts this correctly).
Using the extract tool with the gene features correctly get the nucleotide sequence of NEQ003 running from ATG...TAA regardless of if I use the genes in GFF3 format or in BED format (good).
However, the FASTA output uses different names because it embeds the start/end co-ordindates as is. Thus using GFF3 features, the sequence name includes _883_2691_ while using BED features the same sequence has instead _882_2691_ for its name.
I propose this be harmonised by always using one-based counting in the FASTA names (as done in GFF files but also GenBank, EMBL, etc) rather than the convention used in BED files (and Python) which is confusing to most non-programmers.
i.e. I suggest this change (with new tests to enforce it),
https://bitbucket.org/peterjc/galaxy-central/changeset/e7393df0fbc1
This is currently the one and only commit on this new branch,
https://bitbucket.org/peterjc/galaxy-central/src/extract_region2
Second commit to use the newly added BED file in the converter's tests as well: https://bitbucket.org/peterjc/galaxy-central/changeset/e227e463bea0 Peter