Nate Coraor wrote:
Sonali Amonkar wrote:
I am currently facing another issue. When I run my Workflow, I am seeing the following error on the server log. This error is not consistent, and occurs in an erratic manner.
galaxy.jobs INFO 2011-02-03 05:17:03,522 job 151 dispatched galaxy.jobs.runners.pbs DEBUG 2011-02-03 05:17:09,755 (150/69156.<primaryserver>) PBS job has left queue galaxy.jobs.runners.pbs DEBUG 2011-02-03 05:17:09,879 (151) submitting file galaxy-dist/database/pbs/151.sh galaxy.jobs.runners.pbs DEBUG 2011-02-03 05:17:09,880 (151) command is: java -cp galaxy-dist/tools/my_tools/jars/PreRef1.jar RefFilterModule galaxy-dist/database/files/000/dataset_192.dat galaxy-dist/database/files/000/dataset_194.dat galaxy-dist/database/files/000/dataset_195.dat galaxy.jobs.runners.pbs DEBUG 2011-02-03 05:17:09,880 (151) pbs_submit failed, PBS error 15031: Protocol (ASN.1) error galaxy.jobs DEBUG 2011-02-03 05:17:13,363 job 150 ended galaxy.jobs ERROR 2011-02-03 05:17:15,816 Unable to cleanup job 152
I am pretty sure this problem is somehow specific to the TORQUE setup, and I see you also posted this to the torquedev list, but unfortunately received no response.
I am not sure what is up here, but you may want to try adjusting the 'tcp_timeout' server setting. (qmgr -c 'set server tcp_timeout = X')
Also, you may want to see if PBS is making an attempt to queue this job on a particular node, and if so, check the mom_logs for that node.
Any help/pointer would be appreciated for this issue. Thank you very much for your time Nate.
To manage your subscriptions to this and other Galaxy lists, please use the interface at: