i believe the latest stable update of galaxy included changes to drmaa.py which allows a job to be rechecked indefinitely with regard to scheduler communication errors, so perhaps your "cluster could not complete job" errors are due to a filesystem race condition, whereby the cluster node completes the job but the inode metadata table updates haven't propagated completely so the files appear to be missing to the job runner, on a different server. in this case, the config variable you want to increase is the new "retry_job_output_collection", also part of the last update to stable.

On Wed, Feb 22, 2012 at 5:52 AM, Aurélien Bernard <aurelien.bernard@univ-montp2.fr> wrote:

Hello everybody :)

Today, I have a question related to timeout management in Galaxy.

More particularly, I'm searching for a way to set (in a configuration file if possible) all timeouts related to DRMAA and timeouts related to communication between Galaxy and SGE.

My goal is to increase current timeouts to avoid the "Cluster could not complete job" error on successful jobs when there is a temporary problem of "job status checking" (due to heavy write load on the hard drive or whatever).

Is this possible ?

Thank you in advance,

Have a nice day

A. Bernard

--
Aurélien Bernard
IE Bioprogrammeur - CNRS
Université des sciences Montpellier II
Institut des Sciences de l'Evolution
France

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client. To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/