Hi, I want to share a few issues on the performance of tools for working with MAF alignments in the "Fetch Alignments" section in Galaxy. *The first issue* I have is with the "Stitch MAF blocks given a set of genomic intervals" tool. For a given set of genomic intervals I use this tool to obtain, in my case, the human and the corresponding mouse sequences from 17-way MULTIZ alignment. The problem is, that in the case of insertions in the mouse sequence, the human sequence, which is a reference (the intervals are for human genome) does not contain any gaps. The gaps are only present in mouse sequence. So when there is any insertion in mouse, there are no gaps in the human sequence - simply this portion of alignment is not fetched. *The second issue*. It is somehow related to the first issue. To overcome the problem described above I used the "Extract MAF blocks given a set of genomic intervals" tool. Here I get a set of MAF block that overlap my intervals, and because one set of intervals can overlap with more than one block, the number of resulting block is of course higher than the initial number of intervals. Now I can use the "Join MAF blocks by Species" tool to join some of this smaller block to get the full overlap of my intervals. However, this tool is sensitive on the order of MAF blocks, meaning the blocks that look like the ones below will be joined to one block with 28nt in length, but if we reverse their order, they won't be joined together. s hg18.chr10 101141372 14 + 135374737 CTGCCTTCCCTTCC s mm8.chr19 43548455 10 + 61321190 CGGCCCTTCA---- s hg18.chr10 101141386 14 + 135374737 ATCTCTTCACCCCT s mm8.chr19 43548465 12 + 61321190 --CCCTTCACCCCT This is also specific to the strand, so for the '+' and the '-' strand the order of blocks has to be different, meaning, ascending for the '+', and descending for the '-' strand. *The third issue* is with the "Reverse Complement a MAF file" tool. The reverse complement sequence that I get is actually good, there's nothing wrong with it, but the problem is, that the coordinates change, meaning, tool starts counting from the end position of a chromosome, not from the start position. So now if I want to relate my genomic intervals with the resulting MAF block which I reverse-complemented no I can't do this, because coordinates changed for the second one. So to do that I need to take the length of chromosome and subtract the start position from it, then subtract length and only then I get the same coordinate as in my set of intervals. I think it would be better if the tool would keep the original coordinates. Marcin
(I am not part of the galaxy team) On Oct 14, 2011, at 8:33 AM, Marcin Jakalski wrote:
The third issue is with the "Reverse Complement a MAF file" tool. ... the coordinates change, meaning, tool starts counting from the end position of a chromosome, not from the start position. ... I think it would be better if the tool would keep the original coordinates.
I can't comment on your first two issues. But for the third one, FWIW, keeping the original coordinates would produce a file that is no longer correct according to the definition of the MAF format, which can be found here: http://genome.ucsc.edu/FAQ/FAQformat#format5
So to do that I need to take the length of chromosome and subtract the start position from it, then subtract length and only then I get the same coordinate as in my set of intervals.
This is the reason why the MAF format carries the chromosome length on every line. Someone could make a tool that does it the way you suggest. Hopefully they wouldn't call the result MAF. Bob H
participants (2)
-
Bob Harris
-
Marcin Jakalski