On Thu, Aug 8, 2013 at 2:38 AM, John Chilton <chilton@msi.umn.edu> wrote:
On Wed, Aug 7, 2013 at 3:33 PM, Ganote, Carrie L <cganote@iu.edu> wrote:
Hi John,
That was it. I feel silly. I still have a lot of tooth-cutting to do on python!
I saw the parallel tags in the Blast tool and was very intrigued, but couldn't find reference to it in the read-the-docs or on the Galaxy wiki. Perhaps there is some documentation of this that I missed?
No I don't think there is documentation unless you count the code base the mailing list archive. I think setting use_tasked_jobs to True in universe_wsgi.ini might be all you need to do to start splitting such blast inputs. I think the parallelism tag in the tool file describes how to split the inputs.
Yes, two basic forms - into chunks of a set size (which is what the BLAST tools and my other wrappers, use for FASTA files this is a given number of sequences) or into a target number of parts.
# This enables splitting of jobs into tasks, if specified by the particular tool config. # This is a new feature and not recommended for production servers yet. #use_tasked_jobs = False
I don't use this functionality (at least not in this fashion) so I don't have a lot of advice. Otherwise, if you have a AMQP thing working you should probably just stick with that sounds like a perfectly good way to go.
-John
In our case, the python splitting program is doing this: * Take the blast query * Split the sequences up * For each sequence, submit the query and the command to a queue on a RabbitMQ server (Consumers are set up to listen for queries and then run the jobs). * Write each result to a temp file * When all of the sequence jobs are finished, concat the files back in the correct order and write to the output file Galaxy expects
That's pretty much what the BLAST+ wrappers do already via Galaxy's parallel / task splitting. When your cluster is not under full load, this gives faster processing for individual jobs. The downside is more IO, making the cluster as a whole less productive (if it was normally under high usage). We use use_tasked_jobs = True on our Galaxy instance.
I made a wrapper for this splitter and it works fine on its own. Now I'm trying to add this functionality (run on AMQP) as a user-available option on the Blast tool. So for my dynamic runner, I need to know whether to send the job to DRMAA or to this AMQP python script. Hopefully that makes more sense...
If using the parallel / task splitting as it is doesn't work, I would suggest trying to re-use the Galaxy datatype definition classes and their split/merge methods (which in the case of many formats is non-trivial). For instance, merging XML files needs a bit more care, and this work is done for BLAST XML. But ideally could you integrate AMQP as an alternative cluster backend which can be called instead of DRMAA etc? Regards, Peter