a galaxy of issues; Job output not returned from cluster,drmaa getting wrong LSF job id and job.o files not being found and moving files occurring before upload completes.
This is a stand alone instance of Galaxy built Nov 28 a small research cluster using LSF and built on RHEL6.2. galaxy.tools.actions.upload_common INFO 2013-12-05 14:33:01,106 tool upload1 created job id 10 Working directory for job is: /depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10 galaxy.jobs.handler DEBUG 2013-12-05 14:33:01,290 (10) Dispatching to drmaa runner galaxy.jobs DEBUG 2013-12-05 14:33:01,441 (10) Persisting job destination (destination id: drmaa) galaxy.jobs.handler INFO 2013-12-05 14:33:01,471 (10) Job dispatched galaxy.tools.deps DEBUG 2013-12-05 14:33:01,693 Building dependency shell command for dependency 'samtools' galaxy.tools.deps WARNING 2013-12-05 14:33:01,693 Failed to resolve dependency on 'samtools', ignoring galaxy.jobs.runners.drmaa DEBUG 2013-12-05 14:33:02,352 (10) submitting file /depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10/ galaxy_10.sh galaxy.jobs.runners.drmaa DEBUG 2013-12-05 14:33:02,352 (10) command is: pythonŠ. galaxy.jobs DEBUG 2013-12-05 14:33:02,379 (10) Changing ownership of working directory with: /usr/bin/sudo -E scripts/external_chown_script.py /depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10 galaxy 50982 galaxy.jobs.runners.drmaa DEBUG 2013-12-05 14:33:02,528 (10) submitting with credentials: galaxy [uid: 50981] galaxy.jobs.runners.drmaa DEBUG 2013-12-05 14:33:02,574 (10) Job script for external submission is: /depot/shared/app/Galaxy/galaxy-dist/database/lsf/10.jt_json galaxy.jobs.runners.drmaa INFO 2013-12-05 14:33:02,772 (10) queued as Job <28526> is submitted to default queue <medium_priority>. 28526 galaxy.jobs DEBUG 2013-12-05 14:33:02,837 (10) Persisting job destination (destination id: drmaa) galaxy.jobs.runners.drmaa INFO 2013-12-05 14:33:03,237 (10/Job <28526> is submitted to default queue <medium_priority>. 28526) job left DRM queue with following message: code 18: invalid LSF job id: Job <28526> is submitted to default queue <medium_priority>. 28526 galaxy.jobs DEBUG 2013-12-05 14:33:03,303 (10) Changing ownership of working directory with: /usr/bin/sudo -E scripts/external_chown_script.py /depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10 galaxy 50982 128.23.163.166 - - [05/Dec/2013:14:33:05 -0400] "GET /api/histories/50a7a2e81473b416/contents HTTP/1.1" 200 - "http://hpcc3.musc.edu:8089/root" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0" galaxy.jobs.runners ERROR 2013-12-05 14:33:08,661 (10/Job <28526> is submitted to default queue <medium_priority>. 28526) Job output not returned from cluster: [Errno 2] No such file or directory: '/depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10 /galaxy_10.o' galaxy.jobs DEBUG 2013-12-05 14:33:08,701 (10) Changing ownership of working directory with: /usr/bin/sudo -E scripts/external_chown_script.py /depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10 galaxy 50982 galaxy.jobs DEBUG 2013-12-05 14:33:09,049 finish(): Moved /depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10/ galaxy_dataset_16.dat to /depot/shared/app/Galaxy/galaxy-dist/database/files/000/dataset_16.dat galaxy.jobs DEBUG 2013-12-05 14:33:09,049 finish(): Moved /depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10/ galaxy_dataset_15.dat to /depot/shared/app/Galaxy/galaxy-dist/database/files/000/dataset_15.dat galaxy.jobs DEBUG 2013-12-05 14:33:09,105 setting dataset state to ERROR galaxy.jobs DEBUG 2013-12-05 14:33:09,126 setting dataset state to ERROR galaxy.jobs DEBUG 2013-12-05 14:33:09,214 job 10 ended "Job output not returned from cluster² appears in History, BUT in fact files are being written to directories indicated in my universe_wsgi.ini I am having no success getting a solution to this error "job left DRM queue with following message: code 18: invalid LSF job id" This file "No such file or directory: '/depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10 /galaxy_10.o¹ ³ exists AFTER the job finishes. SO at 2013-12-05 14:33:03 the file had not been written but does appear later. This operation is completing before the file can be uploaded and so an empty file is moved Moved /depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10/ galaxy_dataset_15.dat to /depot/shared/app/Galaxy/galaxy-dist/database/files/000/dataset_15.dat
From my universe_wsgi.ini:
file_path = database/files new_file_path = database/job_working_directory job_working_directory = database/job_working_directory cluster_files_directory = database/job_working_directory Any troubleshooting suggestions appreciated. Starr
Hi Starr, I'd suggest setting 'retry_job_output_collection' in universe_wsgi.ini to some value > 0 (e.g. 5). You may also want to try remounting the filesystem on which your working directories are located with the `-noac` temporarily just to rule out that the problem is related to attribute caching. The -noac option is not a good idea to use in production due to the performance penalty of disabling it, but it'd be useful for debugging the problem. --nate On Fri, Dec 6, 2013 at 5:47 PM, Hazard, E. Starr <hazards@musc.edu> wrote:
This is a stand alone instance of Galaxy built Nov 28 a small research cluster using LSF and built on RHEL6.2.
galaxy.tools.actions.upload_common INFO 2013-12-05 14:33:01,106 tool upload1 created job id 10 Working directory for job is: /depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10 galaxy.jobs.handler DEBUG 2013-12-05 14:33:01,290 (10) Dispatching to drmaa runner galaxy.jobs DEBUG 2013-12-05 14:33:01,441 (10) Persisting job destination (destination id: drmaa) galaxy.jobs.handler INFO 2013-12-05 14:33:01,471 (10) Job dispatched galaxy.tools.deps DEBUG 2013-12-05 14:33:01,693 Building dependency shell command for dependency 'samtools' galaxy.tools.deps WARNING 2013-12-05 14:33:01,693 Failed to resolve dependency on 'samtools', ignoring galaxy.jobs.runners.drmaa DEBUG 2013-12-05 14:33:02,352 (10) submitting file /depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10/ galaxy_10.sh galaxy.jobs.runners.drmaa DEBUG 2013-12-05 14:33:02,352 (10) command is: pythonŠ. galaxy.jobs DEBUG 2013-12-05 14:33:02,379 (10) Changing ownership of working directory with: /usr/bin/sudo -E scripts/external_chown_script.py /depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10 galaxy 50982 galaxy.jobs.runners.drmaa DEBUG 2013-12-05 14:33:02,528 (10) submitting with credentials: galaxy [uid: 50981] galaxy.jobs.runners.drmaa DEBUG 2013-12-05 14:33:02,574 (10) Job script for external submission is: /depot/shared/app/Galaxy/galaxy-dist/database/lsf/10.jt_json galaxy.jobs.runners.drmaa INFO 2013-12-05 14:33:02,772 (10) queued as Job <28526> is submitted to default queue <medium_priority>. 28526 galaxy.jobs DEBUG 2013-12-05 14:33:02,837 (10) Persisting job destination (destination id: drmaa) galaxy.jobs.runners.drmaa INFO 2013-12-05 14:33:03,237 (10/Job <28526> is submitted to default queue <medium_priority>. 28526) job left DRM queue with following message: code 18: invalid LSF job id: Job <28526> is submitted to default queue <medium_priority>. 28526 galaxy.jobs DEBUG 2013-12-05 14:33:03,303 (10) Changing ownership of working directory with: /usr/bin/sudo -E scripts/external_chown_script.py /depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10 galaxy 50982 128.23.163.166 - - [05/Dec/2013:14:33:05 -0400] "GET /api/histories/50a7a2e81473b416/contents HTTP/1.1" 200 - "http://hpcc3.musc.edu:8089/root" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:25.0) Gecko/20100101 Firefox/25.0" galaxy.jobs.runners ERROR 2013-12-05 14:33:08,661 (10/Job <28526> is submitted to default queue <medium_priority>. 28526) Job output not returned from cluster: [Errno 2] No such file or directory: '/depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10 /galaxy_10.o' galaxy.jobs DEBUG 2013-12-05 14:33:08,701 (10) Changing ownership of working directory with: /usr/bin/sudo -E scripts/external_chown_script.py /depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10 galaxy 50982 galaxy.jobs DEBUG 2013-12-05 14:33:09,049 finish(): Moved /depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10/ galaxy_dataset_16.dat to /depot/shared/app/Galaxy/galaxy-dist/database/files/000/dataset_16.dat galaxy.jobs DEBUG 2013-12-05 14:33:09,049 finish(): Moved /depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10/ galaxy_dataset_15.dat to /depot/shared/app/Galaxy/galaxy-dist/database/files/000/dataset_15.dat galaxy.jobs DEBUG 2013-12-05 14:33:09,105 setting dataset state to ERROR galaxy.jobs DEBUG 2013-12-05 14:33:09,126 setting dataset state to ERROR galaxy.jobs DEBUG 2013-12-05 14:33:09,214 job 10 ended
"Job output not returned from cluster² appears in History, BUT in fact files are being written to directories indicated in my universe_wsgi.ini
I am having no success getting a solution to this error "job left DRM queue with following message: code 18: invalid LSF job id"
This file "No such file or directory: '/depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10 /galaxy_10.o¹ ³ exists AFTER the job finishes. SO at 2013-12-05 14:33:03 the file had not been written but does appear later.
This operation is completing before the file can be uploaded and so an empty file is moved Moved /depot/shared/app/Galaxy/galaxy-dist/database/job_working_directory/000/10/ galaxy_dataset_15.dat to /depot/shared/app/Galaxy/galaxy-dist/database/files/000/dataset_15.dat
From my universe_wsgi.ini:
file_path = database/files new_file_path = database/job_working_directory job_working_directory = database/job_working_directory cluster_files_directory = database/job_working_directory
Any troubleshooting suggestions appreciated.
Starr
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
participants (2)
-
Hazard, E. Starr
-
Nate Coraor