Hi All,
I've been working to configure a new Galaxy instance to run jobs
under Condor. Things are 99% working at this point, but what
seems to be happening is after the Condor job finishes Galaxy
tries to clean up a cluster file that isn't there, namely the .ec
(exit code) file. Relevant log info:
galaxy.jobs DEBUG 2013-05-07 15:02:49,364 (1985) Working directory
for job is:
/home/GLBRCORG/galaxy/database/job_working_directory/001/1985
galaxy.jobs.handler DEBUG 2013-05-07 15:02:49,387 (1985)
Dispatching to condor runner
galaxy.jobs DEBUG 2013-05-07 15:02:49,720 (1985) Persisting job
destination (destination id: condor)
galaxy.jobs.handler INFO 2013-05-07 15:02:49,761 (1985) Job
dispatched
galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:56,368 (1985)
submitting file
/home/GLBRCORG/galaxy/database/condor/galaxy_1985.sh
galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:56,369 (1985)
command is: python
/home/GLBRCORG/galaxy/galaxy-central/tools/fastq/fastq_to_fasta.py
'/home/GLBRCORG/galaxy/database/files/000/dataset_3.dat'
'/home/GLBRCORG/galaxy/database/files/002/dataset_2842.dat' ''; cd
/home/GLBRCORG/galaxy/galaxy-central;
/home/GLBRCORG/galaxy/galaxy-central/set_metadata.sh
/home/GLBRCORG/galaxy/database/files
/home/GLBRCORG/galaxy/database/job_working_directory/001/1985 .
/home/GLBRCORG/galaxy/galaxy-central/universe_wsgi.ini
/home/GLBRCORG/galaxy/database/tmp/tmpGe1JZJ
/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/galaxy.json
/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_in_HistoryDatasetAssociation_3161_are5Bg,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_kwds_HistoryDatasetAssociation_3161_p73Yus,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_out_HistoryDatasetAssociation_3161_tLqep6,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_results_HistoryDatasetAssociation_3161_3QSW5X,,/home/GLBRCORG/galaxy/database/job_working_directory/001/1985/metadata_override_HistoryDatasetAssociation_3161_JUFvmk
galaxy.jobs.runners.condor INFO 2013-05-07 15:02:58,960 (1985)
queued as 15
galaxy.jobs DEBUG 2013-05-07 15:02:59,110 (1985) Persisting job
destination (destination id: condor)
galaxy.jobs.runners.condor DEBUG 2013-05-07 15:02:59,536 (1985/15)
job is now running
galaxy.jobs.runners.condor DEBUG 2013-05-07 15:07:16,966 (1985/15)
job is now running
galaxy.jobs.runners.condor DEBUG 2013-05-07 15:07:17,279 (1985/15)
job has completed
galaxy.jobs.runners DEBUG 2013-05-07 15:07:17,417 (1985/15) Unable
to cleanup /home/GLBRCORG/galaxy/database/condor/galaxy_1985.ec:
[Errno 2] No such file or directory:
'/home/GLBRCORG/galaxy/database/condor/galaxy_1985.ec'
galaxy.jobs DEBUG 2013-05-07 15:07:17,560 setting dataset state to
ERROR
galaxy.jobs DEBUG 2013-05-07 15:07:17,961 job 1985 ended
galaxy.datatypes.metadata DEBUG 2013-05-07 15:07:17,961 Cleaning
up external metadata files
I've done a watch on the condor job directory, and as far as I can
tell galaxy_1985.ec never gets created. From a cursory look at
lib/galaxy/jobs/runners/__init__.py and condor.py, it looks like
the cleanup is happening in the AsynchronousJobState::cleanup
method, which iterates on the cleanup_file_attributes list. I
naively tried to override cleanup_file_attributes in
CondorJobState to disinclude 'exit_code_file', to no avail.
I'm hoping somebody can spot where the hiccup is here. Another
question that is on my mind is should a failure to cleanup cluster
files set the dataset state to ERROR? An inspection of the output
file from my job leads me to believe it finished just fine, and
indicating failure to the user because Galaxy couldn't cleanup a
1b error code file seems a little extreme to me.
Thanks!
--
Branden Timm
btimm@energy.wisc.edu