Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

17 Feb 2012

      On Thu, Feb 16, 2012 at 9:02 PM, Peter wrote:
...
On Thu, Feb 16, 2012 at 6:42 PM, Chris wrote:
...
Cool!  Seems like a perfectly fine start.  I guess you could
grab the # of sequences from the dataset somehow (I'm
guessing that is set somehow upon import into Galaxy).
Yes, I should be able to get that from Galaxy's metadata
if known - much like how the FASTQ splitter works. It only
needs to be an estimate anyway - which is what I think
Galaxy does for large files - if we get it wrong then rather
than using n sub-jobs as suggested, we might use n+1
or n-1.
Done, and it seems to be working nicely now. If we don't
know the sequence count, I divide the file based on the
total size in bytes - which avoids any extra IO.
https://bitbucket.org/peterjc/galaxy-central/changeset/26a0c0aa776d

Taking advantage of this I have switched the BLAST tools
from saying split the query into batches of 500 sequences
(which worked fine but only gave benefits if doing genome
scale queries) to just split the query into four parts (which
will be done based on the sequence count if known, or the
file size if not). This way any multi-query BLAST will get
divided and run in parallel, not just the larger jobs. This
gives a nice improvement (over yesterday's progress)
with small tasks like 10 query sequences against a big
database like NR or NT.
https://bitbucket.org/peterjc/galaxy-central/changeset/1fb89ae798be

Peter

Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

Peter Cock