Hi all, I've configured one of our tools to submit jobs to our condor cluster. I can see the job is routed to the condor runner: ==> handler4.log <== galaxy.jobs.handler DEBUG 2015-01-30 09:14:58,092 (508) Dispatching to condor runner galaxy.jobs DEBUG 2015-01-30 09:14:58,204 (508) Persisting job destination (destination id: condor) I can see that indeed the job is submitted to the condor cluster: [root@galaxy galaxy-dist]# condor_q -- Submitter: galaxy.local : <10.177.61.90:55265> : galaxy.local ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 21.0 galaxy 1/30 09:15 0+00:00:02 R 0 0.0 galaxy_509.sh The job begins to run: ==> handler4.log <== galaxy.jobs.runners.condor DEBUG 2015-01-30 09:15:03,827 (508/20) job is now running galaxy.jobs.runners.condor DEBUG 2015-01-30 09:15:05,183 (508/20) job has completed Galaxy is almost immediately removing the job working directory: Here is a snippet of the errors: ==> handler4.log <== galaxy.jobs.runners DEBUG 2015-01-30 09:15:06,372 (508/20) Unable to cleanup /panfs/storage.local/opt/galaxy-dist/database/pbs/galaxy_508.ec: [Errno 2] No such file or directory: '/panfs/storage.local/opt/galaxy-dist/database/pbs/galaxy_508.ec' galaxy.jobs DEBUG 2015-01-30 09:15:06,816 setting dataset state to ERROR galaxy.datatypes.metadata DEBUG 2015-01-30 09:15:06,996 Failed to cleanup MetadataTempFile temp files from /panfs/storage.local/galaxy-data/job_working_directory/000/508/metadata_out_HistoryDatasetAssociation_717_zrzVqh: No JSON object could be decoded Is it possible galaxy is attempting to query condor and see if the job is running, not finding anything and deciding that the job is not running and bailing out? I've reconstructed the process step by step using the logs but I have not been able to see exactly where the condor_submit command is shown so I can try to submit the same job manually. Does anyone have a suggestion for debugging this? Thanks, Don Florida State University Research Computing Center
On Fri, Jan 30, 2015 at 9:24 AM, Shrum, Donald C <DCShrum@admin.fsu.edu> wrote:
Hi all,
I’ve configured one of our tools to submit jobs to our condor cluster. I can see the job is routed to the condor runner:
==> handler4.log <==
galaxy.jobs.handler DEBUG 2015-01-30 09:14:58,092 (508) Dispatching to condor runner
galaxy.jobs DEBUG 2015-01-30 09:14:58,204 (508) Persisting job destination (destination id: condor)
I can see that indeed the job is submitted to the condor cluster:
[root@galaxy galaxy-dist]# condor_q
-- Submitter: galaxy.local : <10.177.61.90:55265> : galaxy.local
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
21.0 galaxy 1/30 09:15 0+00:00:02 R 0 0.0 galaxy_509.sh
The job begins to run:
==> handler4.log <==
galaxy.jobs.runners.condor DEBUG 2015-01-30 09:15:03,827 (508/20) job is now running
galaxy.jobs.runners.condor DEBUG 2015-01-30 09:15:05,183 (508/20) job has completed
Galaxy is almost immediately removing the job working directory:
Here is a snippet of the errors:
==> handler4.log <==
galaxy.jobs.runners DEBUG 2015-01-30 09:15:06,372 (508/20) Unable to cleanup /panfs/storage.local/opt/galaxy-dist/database/pbs/galaxy_508.ec: [Errno 2] No such file or directory: '/panfs/storage.local/opt/galaxy-dist/database/pbs/galaxy_508.ec'
galaxy.jobs DEBUG 2015-01-30 09:15:06,816 setting dataset state to ERROR
galaxy.datatypes.metadata DEBUG 2015-01-30 09:15:06,996 Failed to cleanup MetadataTempFile temp files from /panfs/storage.local/galaxy-data/job_working_directory/000/508/metadata_out_HistoryDatasetAssociation_717_zrzVqh: No JSON object could be decoded
Is it possible galaxy is attempting to query condor and see if the job is running, not finding anything and deciding that the job is not running and bailing out?
I’ve reconstructed the process step by step using the logs but I have not been able to see exactly where the condor_submit command is shown so I can try to submit the same job manually.
Does anyone have a suggestion for debugging this?
Two most relevant files would be: https://bitbucket.org/galaxy/galaxy-central/src/tip/lib/galaxy/jobs/runners/... https://bitbucket.org/galaxy/galaxy-central/src/tip/lib/galaxy/jobs/runners/... The condor job logic strikes me as pretty brittle - it has always worked for me when I have tested it but given how it is readiing logs I would imagine very small changes to condor might cause it to fail. So one thing to check is summarize_condor_log in lib/galaxy/jobs/runners/util/condor/__init__.py - to see if that logic matches the way your condor produces log files. Given the symptoms - it might be also worth just sleeping for 5 seconds in condor_submit in lib/galaxy/jobs/runners/util/condor/__init__.py and then verifying that Galaxy is actually properly parsing the correct external id. Here is an untested diff that adds the sleep and some more log statements that might help: https://gist.github.com/jmchilton/d0afd7242370642d5b43 If you are able to fix the problem - please let us know how so we can fix it upstream. -John
Thanks,
Don
Florida State University
Research Computing Center
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
participants (2)
-
John Chilton
-
Shrum, Donald C