Hi all,
I’ve configured one of our tools to submit jobs to our condor cluster. I can see the job is routed to the condor runner:
==> handler4.log <==
galaxy.jobs.handler DEBUG 2015-01-30 09:14:58,092 (508) Dispatching to condor runner
galaxy.jobs DEBUG 2015-01-30 09:14:58,204 (508) Persisting job destination (destination id: condor)
I can see that indeed the job is submitted to the condor cluster:
[root@galaxy galaxy-dist]# condor_q
-- Submitter: galaxy.local : <10.177.61.90:55265> : galaxy.local
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
21.0 galaxy 1/30 09:15 0+00:00:02 R 0 0.0 galaxy_509.sh
The job begins to run:
==> handler4.log <==
galaxy.jobs.runners.condor DEBUG 2015-01-30 09:15:03,827 (508/20) job is now running
galaxy.jobs.runners.condor DEBUG 2015-01-30 09:15:05,183 (508/20) job has completed
Galaxy is almost immediately removing the job working directory:
Here is a snippet of the errors:
==> handler4.log <==
galaxy.jobs.runners DEBUG 2015-01-30 09:15:06,372 (508/20) Unable to cleanup /panfs/storage.local/opt/galaxy-dist/database/pbs/galaxy_508.ec: [Errno 2] No such file or directory: '/panfs/storage.local/opt/galaxy-dist/database/pbs/galaxy_508.ec'
galaxy.jobs DEBUG 2015-01-30 09:15:06,816 setting dataset state to ERROR
galaxy.datatypes.metadata DEBUG 2015-01-30 09:15:06,996 Failed to cleanup MetadataTempFile temp files from /panfs/storage.local/galaxy-data/job_working_directory/000/508/metadata_out_HistoryDatasetAssociation_717_zrzVqh: No JSON object
could be decoded
Is it possible galaxy is attempting to query condor and see if the job is running, not finding anything and deciding that the job is not running and bailing out?
I’ve reconstructed the process step by step using the logs but I have not been able to see exactly where the condor_submit command is shown so I can try to submit the same job manually.
Does anyone have a suggestion for debugging this?
Thanks,
Don
Florida State University
Research Computing Center