Select first/last N rows from grouped tabular files (e.g. top BLAST hits)
Hi all, I'm wondering if the following task can be done in Galaxy with the standard tools. The specific example is selecting the top (e.g. 3) match sequences for each blast query, but I see this problem as much more general than a "Select top BLAST hits" tool. I want to select the first few (e.g. 1) rows of each group in a tabular file, where the group criteria is having certain columns equal (e.g. the first 2). e.g. Tabular BLAST output has columns of query ID, match ID, etc. queryA match1 ... queryA match2 ... queryA match2 ... queryA match3 ... queryA match4 ... queryA match4 ... queryA match4 ... queryB match5 ... queryB match5 ... queryC match6 ... queryC match7 ... In this example, some of my queries have more than one HSP per match (more than one line with the same first two columns). If I group on the first two columns, the groups are: ------------------------ queryA match1 ... ------------------------ queryA match2 ... queryA match2 ... ------------------------ queryA match3 ... ------------------------ queryA match4 ... queryA match4 ... queryA match4 ... ------------------------ queryB match5 ... queryB match5 ... ------------------------ queryC match6 ... ------------------------ queryC match7 ... ------------------------ If I then take the first row in each group, that gives me just the first HSP for each query+match combination. queryA match1 ... queryA match2 ... queryA match3 ... queryA match4 ... queryB match5 ... queryC match6 ... queryC match7 ... If for example I wanted only the top 3 matches for each query, I could repeat the proposed tool one more time but with different settings - this time grouping on the first column only: queryA match1 ... queryA match2 ... queryA match3 ... queryB match5 ... queryC match6 ... queryC match7 ... I hope I've conveyed the idea here. The existing tools "Select first lines from a dataset" and "Select last lines from a dataset" are related, but do this at the file level. Does this make sense? Does it seem like a useful tool to write if there isn't anything like this already present? Or might it be simpler to just write a "Select top BLAST hits" tool? Peter
On Tue, May 17, 2011 at 5:30 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hi all,
I'm wondering if the following task can be done in Galaxy with the standard tools. The specific example is selecting the top (e.g. 3) match sequences for each blast query, but I see this problem as much more general than a "Select top BLAST hits" tool.
...
Does this make sense? Does it seem like a useful tool to write if there isn't anything like this already present? Or might it be simpler to just write a "Select top BLAST hits" tool?
While I still think the above task could be useful in general, I am now considering a general "BLAST filter" tool to offer this and some other commonly used filters like a minimum coverage threshold (which is possible with a filter on the extended tabular output, but not trivial). Peter
participants (1)
-
Peter Cock