jobs submitted to a cluster
I've setup galaxy to submit jobs to our HPC cluster as the logged in user. I used the drama python module to submit the jobs to our moab server. It appears that the working directory for a submitted job is being removed by galaxy prior to the job completing on the cluster. I can see a working directory is created in the logs: galaxy.jobs DEBUG 2014-06-12 08:21:03,786 (15) Working directory for job is: /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15 I've confirmed the directory is created by watching on the file system and within about two seconds of the folder being created it is deleted. [root@admin 000]# watch -d ls -lR Every 2.0s: ls -lR Thu Jun 12 08:21:06 2014 total 64 drwxrwxrwx 2 dcshrum dcshrum 4096 Jun 12 08:21 15 I see the job sent via DRMAA: galaxy.jobs.handler DEBUG 2014-06-12 08:21:03,795 (15) Dispatching to drmaa runner galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 (15) submitting file /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 (15) native specification is: -N galaxyjob -l nodes=1,walltime=2:00 -q genacc_q galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,892 (15) submitting with credentials: dcshrum [uid: 232706] galaxy.jobs.runners.drmaa INFO 2014-06-12 08:21:06,196 (15) queued as 7570705.moab.local The job fails: galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:06,698 (15/7570705.moab.local) state change: job finished, but failed galaxy.jobs.runners DEBUG 2014-06-12 08:21:07,124 (15/7570705.moab.local) Unable to cleanup /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh: [Errno 2] No such file or directory: '/panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh' I can see the same error in my moab log: *** error from copy /bin/cp: cannot create regular file `/panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.o': No such file or directory *** end error output Any idea as to why galaxy removes the working directory? Is there a setting in the job_conf.xml that would resolve this? Thanks for any pointers. Donny FSU Research Computing Center
Hey Donny, What is the value of keep_completed on your queue (from qmgr -c 'p s')? Could it be that your spool is flushing completed jobs immediately? I ran into issues the other day with libdrmaa requiring at least keep_complete = 60 seconds to properly detect completed jobs and clean up after itself. Cheers, -E -Evan Bollig Research Associate | Application Developer | User Support Consultant Minnesota Supercomputing Institute 599 Walter Library 612 624 1447 evan@msi.umn.edu boll0107@umn.edu On Thu, Jun 12, 2014 at 7:36 AM, Shrum, Donald C <DCShrum@admin.fsu.edu> wrote:
I've setup galaxy to submit jobs to our HPC cluster as the logged in user. I used the drama python module to submit the jobs to our moab server.
It appears that the working directory for a submitted job is being removed by galaxy prior to the job completing on the cluster.
I can see a working directory is created in the logs: galaxy.jobs DEBUG 2014-06-12 08:21:03,786 (15) Working directory for job is: /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15
I've confirmed the directory is created by watching on the file system and within about two seconds of the folder being created it is deleted. [root@admin 000]# watch -d ls -lR Every 2.0s: ls -lR Thu Jun 12 08:21:06 2014 total 64 drwxrwxrwx 2 dcshrum dcshrum 4096 Jun 12 08:21 15
I see the job sent via DRMAA: galaxy.jobs.handler DEBUG 2014-06-12 08:21:03,795 (15) Dispatching to drmaa runner galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 (15) submitting file /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 (15) native specification is: -N galaxyjob -l nodes=1,walltime=2:00 -q genacc_q galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,892 (15) submitting with credentials: dcshrum [uid: 232706] galaxy.jobs.runners.drmaa INFO 2014-06-12 08:21:06,196 (15) queued as 7570705.moab.local
The job fails: galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:06,698 (15/7570705.moab.local) state change: job finished, but failed galaxy.jobs.runners DEBUG 2014-06-12 08:21:07,124 (15/7570705.moab.local) Unable to cleanup /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh: [Errno 2] No such file or directory: '/panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh'
I can see the same error in my moab log: *** error from copy /bin/cp: cannot create regular file `/panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.o': No such file or directory *** end error output
Any idea as to why galaxy removes the working directory? Is there a setting in the job_conf.xml that would resolve this?
Thanks for any pointers.
Donny FSU Research Computing Center
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
It's set to 600 seconds so I don't think that is the issue... Is there some sort of wait time to set in job_conf.xml -----Original Message----- From: Evan Bollig [mailto:boll0107@umn.edu] Sent: Thursday, June 12, 2014 9:27 AM To: Shrum, Donald C Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] jobs submitted to a cluster Hey Donny, What is the value of keep_completed on your queue (from qmgr -c 'p s')? Could it be that your spool is flushing completed jobs immediately? I ran into issues the other day with libdrmaa requiring at least keep_complete = 60 seconds to properly detect completed jobs and clean up after itself. Cheers, -E -Evan Bollig Research Associate | Application Developer | User Support Consultant Minnesota Supercomputing Institute 599 Walter Library 612 624 1447 evan@msi.umn.edu boll0107@umn.edu On Thu, Jun 12, 2014 at 7:36 AM, Shrum, Donald C <DCShrum@admin.fsu.edu> wrote:
I've setup galaxy to submit jobs to our HPC cluster as the logged in user. I used the drama python module to submit the jobs to our moab server.
It appears that the working directory for a submitted job is being removed by galaxy prior to the job completing on the cluster.
I can see a working directory is created in the logs: galaxy.jobs DEBUG 2014-06-12 08:21:03,786 (15) Working directory for job is: /panfs/storage.local/software/galaxy-dist/database/job_working_directo ry/000/15
I've confirmed the directory is created by watching on the file system and within about two seconds of the folder being created it is deleted. [root@admin 000]# watch -d ls -lR Every 2.0s: ls -lR Thu Jun 12 08:21:06 2014 total 64 drwxrwxrwx 2 dcshrum dcshrum 4096 Jun 12 08:21 15
I see the job sent via DRMAA: galaxy.jobs.handler DEBUG 2014-06-12 08:21:03,795 (15) Dispatching to drmaa runner galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 (15) submitting file /panfs/storage.local/software/galaxy-dist/database/job_working_directo ry/000/15/galaxy_15.sh galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 (15) native specification is: -N galaxyjob -l nodes=1,walltime=2:00 -q genacc_q galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,892 (15) submitting with credentials: dcshrum [uid: 232706] galaxy.jobs.runners.drmaa INFO 2014-06-12 08:21:06,196 (15) queued as 7570705.moab.local
The job fails: galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:06,698 (15/7570705.moab.local) state change: job finished, but failed galaxy.jobs.runners DEBUG 2014-06-12 08:21:07,124 (15/7570705.moab.local) Unable to cleanup /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh: [Errno 2] No such file or directory: '/panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh'
I can see the same error in my moab log: *** error from copy /bin/cp: cannot create regular file `/panfs/storage.local/software/galaxy-dist/database/job_working_direct ory/000/15/galaxy_15.o': No such file or directory *** end error output
Any idea as to why galaxy removes the working directory? Is there a setting in the job_conf.xml that would resolve this?
Thanks for any pointers.
Donny FSU Research Computing Center
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
job_conf.xml is outside of my knowledge. Better wait to see what the others can tell us. -E -Evan Bollig Research Associate | Application Developer | User Support Consultant Minnesota Supercomputing Institute 599 Walter Library 612 624 1447 evan@msi.umn.edu boll0107@umn.edu On Thu, Jun 12, 2014 at 8:31 AM, Shrum, Donald C <DCShrum@admin.fsu.edu> wrote:
It's set to 600 seconds so I don't think that is the issue... Is there some sort of wait time to set in job_conf.xml
-----Original Message----- From: Evan Bollig [mailto:boll0107@umn.edu] Sent: Thursday, June 12, 2014 9:27 AM To: Shrum, Donald C Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] jobs submitted to a cluster
Hey Donny,
What is the value of keep_completed on your queue (from qmgr -c 'p s')? Could it be that your spool is flushing completed jobs immediately? I ran into issues the other day with libdrmaa requiring at least keep_complete = 60 seconds to properly detect completed jobs and clean up after itself.
Cheers,
-E
-Evan Bollig Research Associate | Application Developer | User Support Consultant Minnesota Supercomputing Institute 599 Walter Library 612 624 1447 evan@msi.umn.edu boll0107@umn.edu
On Thu, Jun 12, 2014 at 7:36 AM, Shrum, Donald C <DCShrum@admin.fsu.edu> wrote:
I've setup galaxy to submit jobs to our HPC cluster as the logged in user. I used the drama python module to submit the jobs to our moab server.
It appears that the working directory for a submitted job is being removed by galaxy prior to the job completing on the cluster.
I can see a working directory is created in the logs: galaxy.jobs DEBUG 2014-06-12 08:21:03,786 (15) Working directory for job is: /panfs/storage.local/software/galaxy-dist/database/job_working_directo ry/000/15
I've confirmed the directory is created by watching on the file system and within about two seconds of the folder being created it is deleted. [root@admin 000]# watch -d ls -lR Every 2.0s: ls -lR Thu Jun 12 08:21:06 2014 total 64 drwxrwxrwx 2 dcshrum dcshrum 4096 Jun 12 08:21 15
I see the job sent via DRMAA: galaxy.jobs.handler DEBUG 2014-06-12 08:21:03,795 (15) Dispatching to drmaa runner galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 (15) submitting file /panfs/storage.local/software/galaxy-dist/database/job_working_directo ry/000/15/galaxy_15.sh galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 (15) native specification is: -N galaxyjob -l nodes=1,walltime=2:00 -q genacc_q galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,892 (15) submitting with credentials: dcshrum [uid: 232706] galaxy.jobs.runners.drmaa INFO 2014-06-12 08:21:06,196 (15) queued as 7570705.moab.local
The job fails: galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:06,698 (15/7570705.moab.local) state change: job finished, but failed galaxy.jobs.runners DEBUG 2014-06-12 08:21:07,124 (15/7570705.moab.local) Unable to cleanup /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh: [Errno 2] No such file or directory: '/panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh'
I can see the same error in my moab log: *** error from copy /bin/cp: cannot create regular file `/panfs/storage.local/software/galaxy-dist/database/job_working_direct ory/000/15/galaxy_15.o': No such file or directory *** end error output
Any idea as to why galaxy removes the working directory? Is there a setting in the job_conf.xml that would resolve this?
Thanks for any pointers.
Donny FSU Research Computing Center
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
My guess is Galaxy is deleting the directory because it believe the job is in error because of some communication problem while polling your DRM via DRMAA - Galaxy thinks the job has failed before it has even been run. You can set cleanup_job = never in universe_wsgi.ini's app:main section to instruct Galaxy to not delete the working directory. I suspect this will allow the DRM to finish running your job - but Galaxy is still going to fail it since it cannot properly detect its status. Can you confirm? -John On Thu, Jun 12, 2014 at 8:36 AM, Evan Bollig <boll0107@umn.edu> wrote:
job_conf.xml is outside of my knowledge. Better wait to see what the others can tell us.
-E -Evan Bollig Research Associate | Application Developer | User Support Consultant Minnesota Supercomputing Institute 599 Walter Library 612 624 1447 evan@msi.umn.edu boll0107@umn.edu
On Thu, Jun 12, 2014 at 8:31 AM, Shrum, Donald C <DCShrum@admin.fsu.edu> wrote:
It's set to 600 seconds so I don't think that is the issue... Is there some sort of wait time to set in job_conf.xml
-----Original Message----- From: Evan Bollig [mailto:boll0107@umn.edu] Sent: Thursday, June 12, 2014 9:27 AM To: Shrum, Donald C Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] jobs submitted to a cluster
Hey Donny,
What is the value of keep_completed on your queue (from qmgr -c 'p s')? Could it be that your spool is flushing completed jobs immediately? I ran into issues the other day with libdrmaa requiring at least keep_complete = 60 seconds to properly detect completed jobs and clean up after itself.
Cheers,
-E
-Evan Bollig Research Associate | Application Developer | User Support Consultant Minnesota Supercomputing Institute 599 Walter Library 612 624 1447 evan@msi.umn.edu boll0107@umn.edu
On Thu, Jun 12, 2014 at 7:36 AM, Shrum, Donald C <DCShrum@admin.fsu.edu> wrote:
I've setup galaxy to submit jobs to our HPC cluster as the logged in user. I used the drama python module to submit the jobs to our moab server.
It appears that the working directory for a submitted job is being removed by galaxy prior to the job completing on the cluster.
I can see a working directory is created in the logs: galaxy.jobs DEBUG 2014-06-12 08:21:03,786 (15) Working directory for job is: /panfs/storage.local/software/galaxy-dist/database/job_working_directo ry/000/15
I've confirmed the directory is created by watching on the file system and within about two seconds of the folder being created it is deleted. [root@admin 000]# watch -d ls -lR Every 2.0s: ls -lR Thu Jun 12 08:21:06 2014 total 64 drwxrwxrwx 2 dcshrum dcshrum 4096 Jun 12 08:21 15
I see the job sent via DRMAA: galaxy.jobs.handler DEBUG 2014-06-12 08:21:03,795 (15) Dispatching to drmaa runner galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 (15) submitting file /panfs/storage.local/software/galaxy-dist/database/job_working_directo ry/000/15/galaxy_15.sh galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 (15) native specification is: -N galaxyjob -l nodes=1,walltime=2:00 -q genacc_q galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,892 (15) submitting with credentials: dcshrum [uid: 232706] galaxy.jobs.runners.drmaa INFO 2014-06-12 08:21:06,196 (15) queued as 7570705.moab.local
The job fails: galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:06,698 (15/7570705.moab.local) state change: job finished, but failed galaxy.jobs.runners DEBUG 2014-06-12 08:21:07,124 (15/7570705.moab.local) Unable to cleanup /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh: [Errno 2] No such file or directory: '/panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh'
I can see the same error in my moab log: *** error from copy /bin/cp: cannot create regular file `/panfs/storage.local/software/galaxy-dist/database/job_working_direct ory/000/15/galaxy_15.o': No such file or directory *** end error output
Any idea as to why galaxy removes the working directory? Is there a setting in the job_conf.xml that would resolve this?
Thanks for any pointers.
Donny FSU Research Computing Center
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Hi John, That did the trick. I have some other problem but I don't think it's galaxy from here. Thanks again for the reply. Donny -----Original Message----- From: John Chilton [mailto:jmchilton@gmail.com] Sent: Thursday, June 12, 2014 10:53 AM To: Evan Bollig Cc: Shrum, Donald C; galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] jobs submitted to a cluster My guess is Galaxy is deleting the directory because it believe the job is in error because of some communication problem while polling your DRM via DRMAA - Galaxy thinks the job has failed before it has even been run. You can set cleanup_job = never in universe_wsgi.ini's app:main section to instruct Galaxy to not delete the working directory. I suspect this will allow the DRM to finish running your job - but Galaxy is still going to fail it since it cannot properly detect its status. Can you confirm? -John On Thu, Jun 12, 2014 at 8:36 AM, Evan Bollig <boll0107@umn.edu> wrote:
job_conf.xml is outside of my knowledge. Better wait to see what the others can tell us.
-E -Evan Bollig Research Associate | Application Developer | User Support Consultant Minnesota Supercomputing Institute 599 Walter Library 612 624 1447 evan@msi.umn.edu boll0107@umn.edu
On Thu, Jun 12, 2014 at 8:31 AM, Shrum, Donald C <DCShrum@admin.fsu.edu> wrote:
It's set to 600 seconds so I don't think that is the issue... Is there some sort of wait time to set in job_conf.xml
-----Original Message----- From: Evan Bollig [mailto:boll0107@umn.edu] Sent: Thursday, June 12, 2014 9:27 AM To: Shrum, Donald C Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] jobs submitted to a cluster
Hey Donny,
What is the value of keep_completed on your queue (from qmgr -c 'p s')? Could it be that your spool is flushing completed jobs immediately? I ran into issues the other day with libdrmaa requiring at least keep_complete = 60 seconds to properly detect completed jobs and clean up after itself.
Cheers,
-E
-Evan Bollig Research Associate | Application Developer | User Support Consultant Minnesota Supercomputing Institute 599 Walter Library 612 624 1447 evan@msi.umn.edu boll0107@umn.edu
On Thu, Jun 12, 2014 at 7:36 AM, Shrum, Donald C <DCShrum@admin.fsu.edu> wrote:
I've setup galaxy to submit jobs to our HPC cluster as the logged in user. I used the drama python module to submit the jobs to our moab server.
It appears that the working directory for a submitted job is being removed by galaxy prior to the job completing on the cluster.
I can see a working directory is created in the logs: galaxy.jobs DEBUG 2014-06-12 08:21:03,786 (15) Working directory for job is: /panfs/storage.local/software/galaxy-dist/database/job_working_direc to ry/000/15
I've confirmed the directory is created by watching on the file system and within about two seconds of the folder being created it is deleted. [root@admin 000]# watch -d ls -lR Every 2.0s: ls -lR Thu Jun 12 08:21:06 2014 total 64 drwxrwxrwx 2 dcshrum dcshrum 4096 Jun 12 08:21 15
I see the job sent via DRMAA: galaxy.jobs.handler DEBUG 2014-06-12 08:21:03,795 (15) Dispatching to drmaa runner galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 (15) submitting file /panfs/storage.local/software/galaxy-dist/database/job_working_direc to ry/000/15/galaxy_15.sh galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 (15) native specification is: -N galaxyjob -l nodes=1,walltime=2:00 -q genacc_q galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,892 (15) submitting with credentials: dcshrum [uid: 232706] galaxy.jobs.runners.drmaa INFO 2014-06-12 08:21:06,196 (15) queued as 7570705.moab.local
The job fails: galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:06,698 (15/7570705.moab.local) state change: job finished, but failed galaxy.jobs.runners DEBUG 2014-06-12 08:21:07,124 (15/7570705.moab.local) Unable to cleanup /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh: [Errno 2] No such file or directory: '/panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh'
I can see the same error in my moab log: *** error from copy /bin/cp: cannot create regular file `/panfs/storage.local/software/galaxy-dist/database/job_working_dire ct ory/000/15/galaxy_15.o': No such file or directory *** end error output
Any idea as to why galaxy removes the working directory? Is there a setting in the job_conf.xml that would resolve this?
Thanks for any pointers.
Donny FSU Research Computing Center
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
participants (3)
-
Evan Bollig
-
John Chilton
-
Shrum, Donald C