Error cleaning up Condor jobs

7 May 2013

      Hi All,
   I've been working to configure a new Galaxy instance to run jobs 
under Condor.  Things are 99% working at this point, but what seems to 
be happening is after the Condor job finishes Galaxy tries to clean up a 
cluster file that isn't there, namely the .ec (exit code) file.  
Relevant log info:

galaxy.jobs DEBUG 2013-05-07 15:02:49,364 (1985) Working directory for 
job is: /home/GLBRCORG/galaxy/database/job_working_directory/001/1985
galaxy.jobs.handler DEBUG 2013-05-07 15:02:49,387 (1985) Dispatching to 
condor runner
galaxy.jobs DEBUG 2013-05-07 15:02:49,720 (1985) Persisting job 
destination (destination id: condor)
galaxy.jobs.handler INFO 2013-05-07 15:02:49,761 (1985) Job dispatched
galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:56,368 (1985) 
submitting file /home/GLBRCORG/galaxy/database/condor/galaxy_1985.sh
galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:56,369 (1985) command 
is: python 
/home/GLBRCORG/galaxy/galaxy-central/tools/fastq/fastq_to_fasta.py 
'/home/GLBRCORG/galaxy/database/files/000/dataset_3.dat' 
'/home/GLBRCORG/galaxy/database/files/002/dataset_2842.dat' ''; cd 
/home/GLBRCORG/galaxy/galaxy-central; 
/home/GLBRCORG/galaxy/galaxy-central/set_metadata.sh 
/home/GLBRCORG/galaxy/database/files 
/home/GLBRCORG/galaxy/database/job_working_directory/001/1985 . 
/home/GLBRCORG/galaxy/galaxy-central/universe_wsgi.ini 
/home/GLBRCORG/galaxy/database/tmp/tmpGe1JZJ 
/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/galaxy.json /home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_in_HistoryDatasetAssociation_3161_are5Bg,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_kwds_HistoryDatasetAssociation_3161_p73Yus,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_out_HistoryDatasetAssociation_3161_tLqep6,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_results_HistoryDatasetAssociation_3161_3QSW5X,,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_override_HistoryDatasetAssociation_3161_JUFvmk
galaxy.jobs.runners.condor INFO 2013-05-07 15:02:58,960 (1985) queued as 15
galaxy.jobs DEBUG 2013-05-07 15:02:59,110 (1985) Persisting job 
destination (destination id: condor)
galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:59,536 (1985/15) job 
is now running
galaxy.jobs.runners.condor DEBUG 2013-05-07 15:07:16,966 (1985/15) job 
is now running
galaxy.jobs.runners.condor DEBUG 2013-05-07 15:07:17,279 (1985/15) job 
has completed
galaxy.jobs.runners DEBUG 2013-05-07 15:07:17,417 (1985/15) Unable to 
cleanup /home/GLBRCORG/galaxy/database/condor/galaxy_1985.ec: [Errno 2] 
No such file or directory: 
'/home/GLBRCORG/galaxy/database/condor/galaxy_1985.ec'
galaxy.jobs DEBUG 2013-05-07 15:07:17,560 setting dataset state to ERROR
galaxy.jobs DEBUG 2013-05-07 15:07:17,961 job 1985 ended
galaxy.datatypes.metadata DEBUG 2013-05-07 15:07:17,961 Cleaning up 
external metadata files

I've done a watch on the condor job directory, and as far as I can tell 
galaxy_1985.ec never gets created.  From a cursory look at 
lib/galaxy/jobs/runners/__init__.py and condor.py, it looks like the 
cleanup is happening in the AsynchronousJobState::cleanup method, which 
iterates on the cleanup_file_attributes list.  I naively tried to 
override cleanup_file_attributes in CondorJobState to disinclude 
'exit_code_file', to no avail.

I'm hoping somebody can spot where the hiccup is here.  Another question 
that is on my mind is should a failure to cleanup cluster files set the 
dataset state to ERROR?  An inspection of the output file from my job 
leads me to believe it finished just fine, and indicating failure to the 
user because Galaxy couldn't cleanup a 1b error code file seems a little 
extreme to me.

Thanks!

--
Branden Timm
btimm@energy.wisc.edu

Branden Timm

tags

participants (1)