The BLAST+ binaries support multi-threaded operation, which is handled via the $GALAXY_SLOTS environment variable. This should be set automatically by Galaxy via your job runner settings, which allows you to (for example) allocate four cores to each BLAST job. In addition, the BLAST+ wrappers also support high level parallelism by task splitting if "use_tasked_jobs = True" is enabled in your "universe_wsgi.ini" configuration file. Essentially, the FASTA input query files are broken up into batches of 1000 sequences, a separate BLAST child job is run for each chunk, and then the BLAST output files are merged (in order). This is transparent for the end user. Each tool enables this via their XML file, e.g. <parallelism method="multi" split_inputs="query" split_mode="to_size" split_size="1000" merge_outputs="output1"></parallelism> This requires splitting support in the FASTA input datatypes, and merging support in the selected output datatype (e.g. BLAST XML, tabular, etc). This is done by methods in the Python datatype classes. It would be interesting to see if any of John's work on collections of files of the same type might fit nicely with this approach (and thus avoid the disk IO overhead of the merge step?). Peter On Mon, Feb 10, 2014 at 1:56 AM, Ketan Maheshwari <ketancmaheshwari@gmail.com> wrote:
Thanks Dannon for the reference. I checked out the tool and installed from toolshed on my local Galaxy instance. I also checked out the related paper which refers that the Blast executables run in parallel by partitioning the input files into fragments and running batches in parallel. That sounds cool. I browsed the code but could not find the exact mechanism. Is the parallelism at workflow level aka branch parallelism or is it at the tool level that is the tool invokes parallel code?
Thanks, Ketan
On Sun, Feb 9, 2014 at 7:50 PM, Ketan Maheshwari <ketan@mcs.anl.gov> wrote:
Thanks Dannon for the reference. I checked out the tool and installed from toolshed on my local Galaxy instance. I also checked out the related paper which refers that the Blast executables run in parallel by partitioning the input files into fragments and running batches in parallel. That sounds cool. I browsed the code but could not find the exact mechanism. Is the parallelism at workflow level aka branch parallelism or is it at the tool level that is the tool invokes parallel code?
Thanks, Ketan
On Thu, Feb 6, 2014 at 9:42 AM, Dannon Baker <dannon.baker@gmail.com> wrote:
Ketan,
Have you taken a look at galaxy's built-in parallelism framework? For a great current example of a tool using this, look at Peter's NCBI BLAST+ wrappers. https://github.com/peterjc/galaxy_blast
-Dannon