On Thu, Feb 16, 2012 at 9:02 PM, Peter wrote:
On Thu, Feb 16, 2012 at 6:42 PM, Chris wrote:
On Feb 16, 2012, at 12:24 PM, Peter wrote:
I also need to look at merging multiple BLAST XML outputs, but this is looking promising.
Yep, that's definitely one where a simple concatenation wouldn't work (though NCBI used to think so, years ago…)
Well, given the NCBI's historic practise of producing 'XML' output which was the concatenation of several XML files, some tools will tolerate this out of practicality - the Biopython BLAST XML parser for example.
But yes, some care is needed over the header/footer to ensure a valid XML output is created by the merge. This may also require renumbering queries... I will check.
Basic BLAST XML merging implemented and apparently working: https://bitbucket.org/peterjc/galaxy-central/changeset/ebf65c0b1e26 This does not currently attempt to remap the iteration numbers or automatically assigned query names, e.g. you can have this kind of thing in the middle of the XML at a merge point: <Iteration_iter-num>1</Iteration_iter-num> <Iteration_query-ID>Query_1</Iteration_query-ID> That isn't a problem for some tools, e.g. my code in Galaxy to convert BLAST XML to tabular, but I suspect it could cause trouble elsewhere. If anyone has specific suggestions for what to test, that would be great. If this is an issue, then the merge code needs a little more work to edit these values. I think the FASTA split code could be reviewed for inclusion though. Dan - do you want to look at that? Would a clean branch help? Peter