BED files not recognized correctly
Hi, I have noticed for a while now that BED files are not recognized correctly or at least not parsed out correctly. I notice that invariably, the (9 column) BED file comments state there is 1 region and X comments, where X + 1 is the actual number of regions in the file.. Here's a few lines from the file 1 38076950 38077349 utr3:RSPO1 1 - 38077349 38077349 0,0,255 1 38077420 38078426 utr3:RSPO1 1 - 38078426 38078426 0,0,255 1 38078426 38078593 cds:RSPO1 1 - 38078426 38078593 255,0,0 1 38079375 38079564 cds:RSPO1 1 - 38079375 38079564 255,0,0 1 38079855 38080005 cds:RSPO1 1 - 38079855 38080005 255,0,0 1 38082155 38082347 cds:RSPO1 1 - 38082155 38082347 255,0,0 1 38095239 38095333 cds:RSPO1 1 - 38095239 38095333 255,0,0 1 38095333 38095621 utr5:RSPO1 1 - 38095621 38095621 0,0,255 Any ideas why it thinks there are comments in the file and why only one region? The file is a regular txt file without the LF and is not DOS format or anything... It also does not parse out the name, score and strand info, but once I correct that manually, it works, but it is a pain to have to do that everytime... Thanks, Thon
Hi Thon We see the same with our BED files, and I can reproduce it with your example. I tend to ignore it since (so far) it has no influence on the functionality, if I use the BED file in any other tool. I only run into troubles when I present working with BED files in our introductory courses.... Sorry, no solution. Regards, Hans-Rudolf On 01/22/2013 10:35 PM, Anthonius deBoer wrote:
Hi,
I have noticed for a while now that BED files are not recognized correctly or at least not parsed out correctly. I notice that invariably, the (9 column) BED file comments state there is 1 region and X comments, where X + 1 is the actual number of regions in the file..
Capture.JPG
Here's a few lines from the file 1 38076950 38077349 utr3:RSPO1 1 - 38077349 38077349 0,0,255 1 38077420 38078426 utr3:RSPO1 1 - 38078426 38078426 0,0,255 1 38078426 38078593 cds:RSPO1 1 - 38078426 38078593 255,0,0 1 38079375 38079564 cds:RSPO1 1 - 38079375 38079564 255,0,0 1 38079855 38080005 cds:RSPO1 1 - 38079855 38080005 255,0,0 1 38082155 38082347 cds:RSPO1 1 - 38082155 38082347 255,0,0 1 38095239 38095333 cds:RSPO1 1 - 38095239 38095333 255,0,0 1 38095333 38095621 utr5:RSPO1 1 - 38095621 38095621 0,0,255
Any ideas why it thinks there are comments in the file and why only one region?
The file is a regular txt file without the LF and is not DOS format or anything...
It also does not parse out the name, score and strand info, but once I correct that manually, it works, but it is a pain to have to do that everytime...
Thanks,
Thon
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Are you seeing this with any BED file, or just those where the chrom column is "1" rather than "chr1"? Column auto-detection for interval files looks for a name like chr, contig, scaffold, ... -- James Taylor, Assistant Professor, Biology/CS, Emory University On Tue, Jan 22, 2013 at 4:35 PM, Anthonius deBoer <thondeboer@me.com> wrote:
Hi,
I have noticed for a while now that BED files are not recognized correctly or at least not parsed out correctly. I notice that invariably, the (9 column) BED file comments state there is 1 region and X comments, where X + 1 is the actual number of regions in the file..
[image: Capture.JPG]
Here's a few lines from the file 13807695038077349 utr3:RSPO11- 38077349380773490,0,2551 3807742038078426 utr3:RSPO11 -3807842638078426 0,0,25513807842638078593 cds:RSPO11-38078426 38078593255,0,01 3807937538079564cds:RSPO11 -3807937538079564 255,0,01 3807985538080005 cds:RSPO11- 3807985538080005255,0,01 3808215538082347 cds:RSPO11 -3808215538082347 255,0,013809523938095333 cds:RSPO11- 38095239 38095333255,0,01 3809533338095621utr5:RSPO11 -3809562138095621 0,0,255
Any ideas why it thinks there are comments in the file and why only one region?
The file is a regular txt file without the LF and is not DOS format or anything...
It also does not parse out the name, score and strand info, but once I correct that manually, it works, but it is a pain to have to do that everytime...
Thanks,
Thon
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Well...I guess it should not be forcing us to use the chr1 notation, especially since the decoy hg19 genome I am using is not using that convention...This is the age old problem with the two ways of refering to the position on the genome...What i call the UCSC way and the "other" way, such as BROAD is using for their decoy genome... Any way we can make those interval parser a little "wiser" about other ways to refer to a contig/chromosome? Thanks, Thon On Jan 24, 2013, at 07:31 AM, James Taylor <james@jamestaylor.org> wrote: Are you seeing this with any BED file, or just those where the chrom column is "1" rather than "chr1"? Column auto-detection for interval files looks for a name like chr, contig, scaffold, ... -- James Taylor, Assistant Professor, Biology/CS, Emory University On Tue, Jan 22, 2013 at 4:35 PM, Anthonius deBoer <thondeboer@me.com> wrote: Hi, I have noticed for a while now that BED files are not recognized correctly or at least not parsed out correctly. I notice that invariably, the (9 column) BED file comments state there is 1 region and X comments, where X + 1 is the actual number of regions in the file.. <Capture.JPG> Here's a few lines from the file 1 38076950 38077349 utr3:RSPO1 1 - 38077349 38077349 0,0,255 1 38077420 38078426 utr3:RSPO1 1 - 38078426 38078426 0,0,255 1 38078426 38078593 cds:RSPO1 1 - 38078426 38078593 255,0,0 1 38079375 38079564 cds:RSPO1 1 - 38079375 38079564 255,0,0 1 38079855 38080005 cds:RSPO1 1 - 38079855 38080005 255,0,0 1 38082155 38082347 cds:RSPO1 1 - 38082155 38082347 255,0,0 1 38095239 38095333 cds:RSPO1 1 - 38095239 38095333 255,0,0 1 38095333 38095621 utr5:RSPO1 1 - 38095621 38095621 0,0,255 Any ideas why it thinks there are comments in the file and why only one region? The file is a regular txt file without the LF and is not DOS format or anything... It also does not parse out the name, score and strand info, but once I correct that manually, it works, but it is a pain to have to do that everytime... Thanks, Thon ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
One way we dealt with this discrepancy between chromosome nomenclature, in the commercial software Avadis NGS (for which I was a Product Manager some time ago, so full disclosure there) was to instigate what we called aliases for the chromosomes... A build could have multiple ways to refer to a chromosome and we simply had an alias table we consulted for the build we were using... This was very useful to know that "1", "chr1" and "chr1.fa" (Thanks Illumina) were all refering to the same chromosome... it made life SO much easier to deal with all these different ways of analysis... We already have to make the build.txt file (and produce the .len files as well, I noticed all of a sudden), so we should be able to use ALIASES as I described pretty easilyl... Thon On Jan 24, 2013, at 07:31 AM, James Taylor <james@jamestaylor.org> wrote: Are you seeing this with any BED file, or just those where the chrom column is "1" rather than "chr1"? Column auto-detection for interval files looks for a name like chr, contig, scaffold, ... -- James Taylor, Assistant Professor, Biology/CS, Emory University On Tue, Jan 22, 2013 at 4:35 PM, Anthonius deBoer <thondeboer@me.com> wrote: Hi, I have noticed for a while now that BED files are not recognized correctly or at least not parsed out correctly. I notice that invariably, the (9 column) BED file comments state there is 1 region and X comments, where X + 1 is the actual number of regions in the file.. <Capture.JPG> Here's a few lines from the file 1 38076950 38077349 utr3:RSPO1 1 - 38077349 38077349 0,0,255 1 38077420 38078426 utr3:RSPO1 1 - 38078426 38078426 0,0,255 1 38078426 38078593 cds:RSPO1 1 - 38078426 38078593 255,0,0 1 38079375 38079564 cds:RSPO1 1 - 38079375 38079564 255,0,0 1 38079855 38080005 cds:RSPO1 1 - 38079855 38080005 255,0,0 1 38082155 38082347 cds:RSPO1 1 - 38082155 38082347 255,0,0 1 38095239 38095333 cds:RSPO1 1 - 38095239 38095333 255,0,0 1 38095333 38095621 utr5:RSPO1 1 - 38095621 38095621 0,0,255 Any ideas why it thinks there are comments in the file and why only one region? The file is a regular txt file without the LF and is not DOS format or anything... It also does not parse out the name, score and strand info, but once I correct that manually, it works, but it is a pain to have to do that everytime... Thanks, Thon ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
participants (3)
-
Anthonius deBoer
-
Hans-Rudolf Hotz
-
James Taylor