Hi all,

I’ve configured one of our tools to submit jobs to our condor cluster. I can see the job is routed to the condor runner:

==> handler4.log <==

galaxy.jobs.handler DEBUG 2015-01-30 09:14:58,092 (508) Dispatching to condor runner

galaxy.jobs DEBUG 2015-01-30 09:14:58,204 (508) Persisting job destination (destination id: condor)

I can see that indeed the job is submitted to the condor cluster:

[root@galaxy galaxy-dist]# condor_q

-- Submitter: galaxy.local : <10.177.61.90:55265> : galaxy.local

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

21.0 galaxy 1/30 09:15 0+00:00:02 R 0 0.0 galaxy_509.sh

The job begins to run:

==> handler4.log <==

galaxy.jobs.runners.condor DEBUG 2015-01-30 09:15:03,827 (508/20) job is now running

galaxy.jobs.runners.condor DEBUG 2015-01-30 09:15:05,183 (508/20) job has completed

Galaxy is almost immediately removing the job working directory:

Here is a snippet of the errors:

==> handler4.log <==

galaxy.jobs.runners DEBUG 2015-01-30 09:15:06,372 (508/20) Unable to cleanup /panfs/storage.local/opt/galaxy-dist/database/pbs/galaxy_508.ec: [Errno 2] No such file or directory: '/panfs/storage.local/opt/galaxy-dist/database/pbs/galaxy_508.ec'

galaxy.jobs DEBUG 2015-01-30 09:15:06,816 setting dataset state to ERROR

galaxy.datatypes.metadata DEBUG 2015-01-30 09:15:06,996 Failed to cleanup MetadataTempFile temp files from /panfs/storage.local/galaxy-data/job_working_directory/000/508/metadata_out_HistoryDatasetAssociation_717_zrzVqh: No JSON object could be decoded

Is it possible galaxy is attempting to query condor and see if the job is running, not finding anything and deciding that the job is not running and bailing out?

I’ve reconstructed the process step by step using the logs but I have not been able to see exactly where the condor_submit command is shown so I can try to submit the same job manually.

Does anyone have a suggestion for debugging this?

Thanks,

Don

Florida State University

Research Computing Center