On Thu, Feb 16, 2012 at 6:42 PM, Fields, Christopher J <cjfields@illinois.edu> wrote:
On Feb 16, 2012, at 12:24 PM, Peter Cock wrote:
I've checked in my FASTA splitting, which now seems to be working OK with my BLAST tests.
(If this was unclear, I mean checked into my branch - I don't have commit privileges to the main repository. When/if this is ready I'll ask for it to be merged in though.)
So far this only does splitting into chunks of the requested number of sequences, rather than the option to split the whole file into a given number of pieces. https://bitbucket.org/peterjc/galaxy-central/changeset/416c961c0da9
Cool! Seems like a perfectly fine start. I guess you could grab the # of sequences from the dataset somehow (I'm guessing that is set somehow upon import into Galaxy).
Yes, I should be able to get that from Galaxy's metadata if known - much like how the FASTQ splitter works. It only needs to be an estimate anyway - which is what I think Galaxy does for large files - if we get it wrong then rather than using n sub-jobs as suggested, we might use n+1 or n-1.
I also need to look at merging multiple BLAST XML outputs, but this is looking promising.
Yep, that's definitely one where a simple concatenation wouldn't work (though NCBI used to think so, years ago…)
Well, given the NCBI's historic practise of producing 'XML' output which was the concatenation of several XML files, some tools will tolerate this out of practicality - the Biopython BLAST XML parser for example. But yes, some care is needed over the header/footer to ensure a valid XML output is created by the merge. This may also require renumbering queries... I will check. Peter