Hi, I want to share a few issues on the performance of tools for working with MAF alignments in the "Fetch Alignments" section in Galaxy.
The first issue I have is with the "Stitch MAF blocks given a set of genomic intervals" tool. For a given set of genomic intervals I use this tool
to obtain, in my case, the human and the corresponding mouse sequences from 17-way MULTIZ alignment. The problem is, that in the case of
insertions in the mouse sequence, the human sequence, which is a reference (the intervals are for human genome) does not contain any
gaps. The gaps are only present in mouse sequence. So when there is any insertion in mouse, there are no gaps in the human sequence - simply
this portion of alignment is not fetched.
The second issue. It is somehow related to the first issue. To overcome the problem described above I used the "Extract MAF blocks given a set of genomic intervals" tool.
Here I get a set of MAF block that overlap my intervals, and because one set of intervals can overlap with more than one block, the number of resulting block is of course
higher than the initial number of intervals. Now I can use the "Join MAF blocks by Species" tool to join some of this smaller block to get the full overlap of my intervals.
However, this tool is sensitive on the order of MAF blocks, meaning the blocks that look like the ones below will be joined to one block with 28nt in length, but if we reverse their
order, they won't be joined together.
s hg18.chr10 101141372 14 + 135374737 CTGCCTTCCCTTCC
s mm8.chr19 43548455 10 + 61321190 CGGCCCTTCA----
s hg18.chr10 101141386 14 + 135374737 ATCTCTTCACCCCT
s mm8.chr19 43548465 12 + 61321190 --CCCTTCACCCCT
This is also specific to the strand, so for the '+' and the '-' strand the order of blocks has to be different, meaning, ascending for the '+', and descending for the '-' strand.
The third issue is with the "Reverse Complement a MAF file" tool. The reverse complement sequence that I get is actually good, there's nothing wrong with it,
but the problem is, that the coordinates change, meaning, tool starts counting from the end position of a chromosome, not from the start position.
So now if I want to relate my genomic intervals with the resulting MAF block which I reverse-complemented no I can't do this, because coordinates changed for
the second one. So to do that I need to take the length of chromosome and subtract the start position from it, then subtract length and only then I get the same
coordinate as in my set of intervals. I think it would be better if the tool would keep the original coordinates.
Marcin