filtering lastz output to *best* match only
Apologies in advance if this isn't the best place to ask the following: I've RTFM-d for lastz-1.02.00 but may have missed it - is there currently any way to filter output from a lastz run such that _only_ the best match is returned (i'm currently doing this externally, but that seems inefficient if there's an option to do so). thanks in advance, b < * ) (_ \\ _ ||
Brant: Forwarding Bob Harris, the creator of lastz... a. On Mar 25, 2010, at 3:26 PM, Brant Faircloth wrote:
Apologies in advance if this isn't the best place to ask the following:
I've RTFM-d for lastz-1.02.00 but may have missed it - is there currently any way to filter output from a lastz run such that _only_ the best match is returned (i'm currently doing this externally, but that seems inefficient if there's an option to do so).
thanks in advance, b
< * ) (_ \\ _ ||
_______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user
Anton Nekrutenko http://nekrut.bx.psu.edu http://usegalaxy.org
Howdy, Brant, I'm at a conference this week and my internet connectivity is spotty. Short answer is "no", lastz doesn't provide a best-only filter. I am not sure what your input files are (one-to-one, many-to-many, one-to-many?). For one-to-one you could use sort and head to solve this (at command line-- not sure specifically how you would do this in galaxy). For many-to-one, such as mapping a large number of reads to a genome, it's a little more difficult because you want to sort to occur separately per read. It isn't an overly difficult thing to do though. ... laptop battery going down. Gotta sign off. Bob H
On Mar 25, 2010, at 12:28 PM, Anton Nekrutenko wrote:
Forwarding Bob Harris, the creator of lastz
Thanks - apparently, i didn't RTFM enough to see the lastz mailing list info right at the top of the README ; ).
best, b
< * ) (_ \\ _ ||
_______________________________________________ galaxy-user mailing list galaxy-user@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-user
On Mar 25, 2010, at 1:09 PM, rsharris@bx.psu.edu wrote:
Howdy, Brant,
hi bob,
I'm at a conference this week and my internet connectivity is spotty.
no problem, thanks for the fast response.
Short answer is "no", lastz doesn't provide a best-only filter. I am not sure what your input files are (one-to-one, many-to-many, one-to-many?).
it's basically many to many right now, as long my RAM holds and i don't go one-to-many.
For one-to-one you could use sort and head to solve this (at command line-- not sure specifically how you would do this in galaxy). For many-to-one, such as mapping a large number of reads to a genome, it's a little more difficult because you want to sort to occur separately per read.
i've basically been parsing the lastz output to build a dictionary like so: { read1: { score1:match1, score2:match2 }, read2: { score1:match1, } } where score1…scoreN is an integer, then: for read in matches: m = matches[read] bestMatch = m[max(m.keys())] … do stuff … this seems fast enough with a few tens of thousands of lines returned from lastz.
It isn't an overly difficult thing to do though. ... laptop battery going down. Gotta sign off.
i basically just wanted to see if i had missed anything and maybe give the old +1 for --bestmatch as a potential filter parameter in the future ; ). (given enough time, i'll poke around in the source and see what i can do with my meager C skills) thanks again for the extremely rapid response! enjoy your conference... best, b < * ) (_ \\ _ ||
Howdy, Brant, The method you describe below is reasonable. You could potentially run into memory problems if you have too many reads, or if you have too many hits for a particular read (python dictionaries, worst case, contain three times as many hash-bins as keys). You could avoid this by only keeping a dictionary of read-name to best-hit. Depending on your scoring scheme, highest score may or may not be the right choice for "best". For example, we have used the number of matches as the score for such comparisons. Also, depending on your application, you might want the "best" hit only in those cases where it is significantly better than the second best hit. Bob H On Mar 25, 2010, at 4:42 PM, Brant Faircloth wrote:
On Mar 25, 2010, at 1:09 PM, rsharris@bx.psu.edu wrote:
Howdy, Brant,
hi bob,
I'm at a conference this week and my internet connectivity is spotty.
no problem, thanks for the fast response.
Short answer is "no", lastz doesn't provide a best-only filter. I am not sure what your input files are (one-to-one, many-to-many, one-to- many?).
it's basically many to many right now, as long my RAM holds and i don't go one-to-many.
For one-to-one you could use sort and head to solve this (at command line-- not sure specifically how you would do this in galaxy). For many-to-one, such as mapping a large number of reads to a genome, it's a little more difficult because you want to sort to occur separately per read.
i've basically been parsing the lastz output to build a dictionary like so:
{ read1: { score1:match1, score2:match2 },
read2: { score1:match1, } }
where score1…scoreN is an integer, then:
for read in matches: m = matches[read] bestMatch = m[max(m.keys())]
… do stuff …
this seems fast enough with a few tens of thousands of lines returned from lastz.
It isn't an overly difficult thing to do though. ... laptop battery going down. Gotta sign off.
i basically just wanted to see if i had missed anything and maybe give the old +1 for --bestmatch as a potential filter parameter in the future ; ). (given enough time, i'll poke around in the source and see what i can do with my meager C skills)
thanks again for the extremely rapid response! enjoy your conference...
best, b
< * ) (_ \\ _ ||
participants (4)
-
Anton Nekrutenko
-
Bob Harris
-
Brant Faircloth
-
rsharris@bx.psu.edu