Hi James,
    We have made some progress in understanding the workflow-specific job crashes.

It seems that 'parallel' workflows are sending jobs simultaneously, and this is problematic for torque.

We get this error:
10/18/2012 10:06:18;0080;PBS_Server;Req;req_reject;Reject reply code=15058(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=Connect, from @


There is a thread here:
http://osdir.com/ml/galaxy-source-control/2011-08/msg00136.html

which is very similar to what we are experiencing.

In the post linked above, the author indicates he found a fix (pasted below). Would you recommend we make the same change?

Thanks!
Todd

"To deal with this I modified the lib/galaxy/jobs/runners/pbs.py script to make multiple attempts at submitting in the following way:
@@ -286,6 +286,12 @@ class PBSJobRunner( BaseJobRunner ):         log.debug("(%s) submitting file %s" % ( galaxy_job_id, job_file ) )         log.debug("(%s) command is: %s" % ( galaxy_job_id, command_line ) )          job_id = pbs.pbs_submit(c, job_attrs, job_file, pbs_queue_name, None)+       ##Modified to give ten tries for qsubbing a job+       num_try=0+       while(not job_id and num_try<10): +               job_id = pbs.pbs_submit(c, job_attrs, job_file, pbs_queue_name, None)+               num_try+=1+                        pbs.pbs_disconnect(c)          # check to see if it submitted "

On 10/17/2012 9:40 AM, James Taylor wrote:
Todd, this is definitely unusual. Can you post (or send directly)
relevant sections from the Galaxy log?

-- jt


On Tue, Oct 16, 2012 at 8:15 PM, Todd Oakley
<todd.oakley@lifesci.ucsb.edu> wrote:
Hello,
    We just did a few tweaks to improve Galaxy performance, and a new issue
popped up that I would like advice on troubleshooting.

    When we run workflows, we see that tools later in the workflow run and
crash before the results they depend on have completed running.

    We can re-run the crashed jobs later and they work fine, suggesting that
they are only failing in the context of running workflows.

    I'd appreciate any advice on how to start troubleshooting this problem.

Thanks much!
Todd


--

***************************************
Todd Oakley, Professor
Ecology Evolution and Marine Biology
University of California, Santa Barbara
Santa Barbara, CA 93106 USA
***************************************

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/

--

***************************************
Todd Oakley, Professor
Ecology Evolution and Marine Biology
University of California, Santa Barbara
Santa Barbara, CA 93106 USA
***************************************

Lab Website
Twitter: @UCSB_OakleyLab

Recent Papers: