jobs submitted to a cluster

12 Jun 2014

      I've setup galaxy to submit jobs to our HPC cluster as the logged in user.  I used the drama python module to submit the jobs to our moab server.  

It appears that the working directory for a submitted job is being removed by galaxy prior to the job completing on the cluster.

I can see a working directory is created in the logs:
galaxy.jobs DEBUG 2014-06-12 08:21:03,786 (15) Working directory for job is: /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15

I've confirmed the directory is created by watching on the file system and within about two seconds of the folder being created it is deleted.
[root@admin 000]# watch -d ls -lR
Every 2.0s: ls -lR                                                                                                                      Thu Jun 12 08:21:06 2014
total 64
drwxrwxrwx 2 dcshrum dcshrum 4096 Jun 12 08:21 15

I see the job sent via DRMAA:
galaxy.jobs.handler DEBUG 2014-06-12 08:21:03,795 (15) Dispatching to drmaa runner
galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 (15) submitting file /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh
galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 (15) native specification is: -N galaxyjob -l nodes=1,walltime=2:00 -q genacc_q
galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,892 (15) submitting with credentials: dcshrum [uid: 232706]
galaxy.jobs.runners.drmaa INFO 2014-06-12 08:21:06,196 (15) queued as 7570705.moab.local

The job fails:
galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:06,698 (15/7570705.moab.local) state change: job finished, but failed
galaxy.jobs.runners DEBUG 2014-06-12 08:21:07,124 (15/7570705.moab.local) Unable to cleanup /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh: [Errno 2] No such file or directory: '/panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh'

I can see the same error in my moab log:
*** error from copy
/bin/cp: cannot create regular file `/panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.o': No such file or directory
*** end error output

Any idea as to why galaxy removes the working directory?  Is there a setting in the job_conf.xml that would resolve this?

Thanks for any pointers.

Donny
FSU Research Computing Center

Shrum, Donald C

Evan Bollig

Shrum, Donald C

Evan Bollig

John Chilton

Shrum, Donald C

tags

participants (3)