On Thu, Feb 16, 2012 at 9:02 PM, Peter wrote:
On Thu, Feb 16, 2012 at 6:42 PM, Chris wrote:
Cool! Seems like a perfectly fine start. I guess you could grab the # of sequences from the dataset somehow (I'm guessing that is set somehow upon import into Galaxy).
Yes, I should be able to get that from Galaxy's metadata if known - much like how the FASTQ splitter works. It only needs to be an estimate anyway - which is what I think Galaxy does for large files - if we get it wrong then rather than using n sub-jobs as suggested, we might use n+1 or n-1.
Done, and it seems to be working nicely now. If we don't know the sequence count, I divide the file based on the total size in bytes - which avoids any extra IO. https://bitbucket.org/peterjc/galaxy-central/changeset/26a0c0aa776d Taking advantage of this I have switched the BLAST tools from saying split the query into batches of 500 sequences (which worked fine but only gave benefits if doing genome scale queries) to just split the query into four parts (which will be done based on the sequence count if known, or the file size if not). This way any multi-query BLAST will get divided and run in parallel, not just the larger jobs. This gives a nice improvement (over yesterday's progress) with small tasks like 10 query sequences against a big database like NR or NT. https://bitbucket.org/peterjc/galaxy-central/changeset/1fb89ae798be Peter