Re: [galaxy-dev] Rocks cluster; jobs run but Galaxy can't find jobid when submitted via drmaa

20 Jan 2016

      Ok; it's quite weird. Perhaps a Galaxy guru could give you a better answer
but I remembered that I had this kind of issues a while ago. In the
meantime, you could take a look at these parameters:

$ grep retry galaxy.ini.sample
# these instances, you can choose to retry setting it internally or leave
it in
# a failed state (since retrying internally may cause the Galaxy process to
be
# option to retry externally, or set metadata manually (when possible).
#retry_metadata_internally = True
# the job's stdout and stderr files when it completes, you can retry reading
# these files.  The job runner will retry the number of times specified
below,
#retry_job_output_collection = 0

Best,

Remy

2016-01-20 21:12 GMT+01:00 Eric Shell <eshell@soe.ucsc.edu>:
...
Thanks, Remy.  I went through the cluster documentation and our Rocks
environment seems to be configured properly, after all.
It appears that my issue may be related to the UCSC Main table browser.
The jobs that Galaxy reports have failed are leaving the
job_working_directory behind, with galaxy_#.e error files that contain "The
remote data source application has not sent back a URL parameter in the
request."
[root@campusrocks2 7045]# pwd
/campusdata/galaxy/galaxy/database/job_working_directory/007/7045
[root@campusrocks2 7045]# cat galaxy_7045.e
The remote data source application has not sent back a URL parameter in
the request.
These errors correspond with empty dataset_#.dat files
in /campusdata/galaxy/galaxy/database/files/011/:
[root@campusrocks2 7045]# ll /campusdata/galaxy/galaxy/database/files/011/
-rw-rw-r-- 1 galaxy galaxy        0 Jan 20 11:54 dataset_11387.dat
The job failures are intermittent.  Sometimes, a job requesting the exact
same dataset will succeed moments before or after a failed job.  Is there
perhaps a way to tell the table browser to retry when it fails to get the
dataset it is requesting?  Is that even what's going on?
On Wed, Jan 20, 2016 at 7:05 AM, Rémy Dernat <remy.d1@gmail.com> wrote:
...
I forgot to point out the needs of sharing folders and checking the
UID/GID of the galaxy user between your systems (and his access to SGE).
Remy
2016-01-20 16:00 GMT+01:00 Rémy Dernat <remy.d1@gmail.com>:
...
Hi Eric,
Here we use both solutions: Galaxy and RocksCluster. In Galaxy, you have
to define your jobs in "config/job_conf.xml" and you should probably source
a file (search for "environment" in your galaxy.ini) before the submit
process. In fact, you could have to set a DRMAA_LIBRARY_PATH to load your
drmaa library; see
https://wiki.galaxyproject.org/Admin/Config/Performance/Cluster#DRMAA
Best,
Remy
2016-01-19 20:28 GMT+01:00 Eric Shell <eshell@soe.ucsc.edu>:
...
I am trying to get a Galaxy instance running on a Rocks cluster.  I am
able to run jobs with the local runner at this point, but I am having an
issue with the drmaa runner that I haven't been able to fix.  When I submit
a job in Galaxy it is successfully submitted to the cluster and runs to
completion according to qacct, but Galaxy just reports "failure running
job".
Here's what is written to paster.log when I submit a job:
69.181.235.240 - - [19/Jan/2016:11:24:31 -0700] "GET
...
/api/histories/fb86c918c0d3d33b/contents?dataset_details=bae154fe2294752e%2C6fe732485990d2ac%2C604c4e6e60e997bc%2Cf015f1cb819ec50e%2C9f6f4b3cb6cf43eb%2C3d13d598882b6eb8%2C551006fddcb290ae%2C10b9bbc646c48387%2C7670dfdf35146bc5%2Ce0ec2cf59f1fc79e%2Cee30922e5e4854db%2C9e7a0ba216194210
HTTP/1.1" 200 - "https://galaxy.soe.ucsc.edu/" "Mozilla/5.0 (Windows
NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/47.0.2526.111 Safari/537.36"
69.181.235.240 - - [19/Jan/2016:11:24:38 -0700] "GET
/tool_runner/data_source_redirect?tool_id=ucsc_table_direct1 HTTP/1.1" 302
- "https://galaxy.soe.ucsc.edu/" "Mozilla/5.0 (Windows NT 10.0;
Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111
Safari/537.36"
galaxy.tools.actions.__init__ INFO 2016-01-19 11:24:42,801 Handled
output (327.778 ms)
galaxy.tools.actions.__init__ INFO 2016-01-19 11:24:43,236 Verified
access to datasets (0.023 ms)
galaxy.tools.execute DEBUG 2016-01-19 11:24:43,343 Tool
[ucsc_table_direct1] created job [7019] (919.481 ms)
69.181.235.240 - - [19/Jan/2016:11:24:42 -0700] "POST /tool_runner
HTTP/1.1" 200 - "https://genome.ucsc.edu/cgi-bin/hgTables"
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/47.0.2526.111 Safari/537.36"
galaxy.jobs DEBUG 2016-01-19 11:24:44,056 (7019) Working directory
for job is:
/campusdata/galaxy/galaxy/database/job_working_directory/007/7019
galaxy.jobs.handler DEBUG 2016-01-19 11:24:44,070 (7019) Dispatching
to sge runner
galaxy.jobs DEBUG 2016-01-19 11:24:44,378 (7019) Persisting job
destination (destination id: sge_default)
galaxy.jobs.runners DEBUG 2016-01-19 11:24:44,403 Job [7019] queued
(332.423 ms)
galaxy.jobs.handler INFO 2016-01-19 11:24:44,444 (7019) Job dispatched
69.181.235.240 - - [19/Jan/2016:11:24:44 -0700] "GET /api/genomes
HTTP/1.1" 200 - "https://galaxy.soe.ucsc.edu/" "Mozilla/5.0 (Windows
NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/47.0.2526.111 Safari/537.36"
69.181.235.240 - - [19/Jan/2016:11:24:44 -0700] "GET
/api/datatypes?extension_only=False& HTTP/1.1" 200 - "
https://galaxy.soe.ucsc.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64;
x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111
Safari/537.36"
69.181.235.240 - - [19/Jan/2016:11:24:44 -0700] "GET
/history/current_history_json HTTP/1.1" 200 - "
https://galaxy.soe.ucsc.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64;
x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111
Safari/537.36"
galaxy.jobs.runners DEBUG 2016-01-19 11:24:46,399 (7019) command is:
python /campusdata/galaxy/galaxy/tools/data_source/data_source.py
/campusdata/galaxy/galaxy/database/files/011/dataset_11361.dat 0;
return_code=$?; python
"/campusdata/galaxy/galaxy/database/job_working_directory/007/7019/set_metadata_IaPURP.py"
"/campusdata/galaxy/galaxy/database/tmp/tmp9Qt0cv"
"/campusdata/galaxy/galaxy/database/job_working_directory/007/7019/galaxy.json"
"/campusdata/galaxy/galaxy/database/job_working_directory/007/7019/metadata_in_HistoryDatasetAssociation_13512_oucw5s,/campusdata/galaxy/galaxy/database/job_working_directory/007/7019/metadata_kwds_HistoryDatasetAssociation_13512_ZrUbrF,/campusdata/galaxy/galaxy/database/job_working_directory/007/7019/metadata_out_HistoryDatasetAssociation_13512_twCvq7,/campusdata/galaxy/galaxy/database/job_working_directory/007/7019/metadata_results_HistoryDatasetAssociation_13512_FO1cy9,/campusdata/galaxy/galaxy/database/files/011/dataset_11361.dat,/campusdata/galaxy/galaxy/database/job_working_directory/007/7019/metadata_override_HistoryDatasetAssociation_13512_Z_cUTF"
5242880; sh -c "exit $return_code"
galaxy.jobs.runners.drmaa DEBUG 2016-01-19 11:24:46,787 (7019)
submitting file
/campusdata/galaxy/galaxy/database/job_working_directory/007/7019/galaxy_7019.sh
galaxy.jobs.runners.drmaa DEBUG 2016-01-19 11:24:46,808 (7019) native
specification is: -R y -pe mpi 8 -q small.q
galaxy.jobs DEBUG 2016-01-19 11:24:46,828 (7019) Changing ownership
of working directory with: /usr/bin/sudo -E
scripts/external_chown_script.py
/campusdata/galaxy/galaxy/database/job_working_directory/007/7019 eshell
100000
galaxy.jobs.runners.drmaa DEBUG 2016-01-19 11:24:47,020 (7019)
submitting with credentials: eshell [uid: 38559]
galaxy.jobs.runners.drmaa DEBUG 2016-01-19 11:24:47,129 (7019) Job
script for external submission is:
/campusdata/galaxy/galaxy/database/pbs/7019.jt_json
galaxy.jobs.runners.drmaa INFO 2016-01-19 11:24:47,130 Running command
['/usr/bin/sudo', '-E', 'scripts/drmaa_external_runner.py', '38559',
'/campusdata/galaxy/galaxy/database/pbs/7019.jt_json']
galaxy.jobs.runners.drmaa INFO 2016-01-19 11:24:47,981 (7019) queued
as 116563
galaxy.jobs DEBUG 2016-01-19 11:24:48,198 (7019) Persisting job
destination (destination id: sge_default)
galaxy.jobs.runners.drmaa DEBUG 2016-01-19 11:24:48,823 (7019/116563)
state change: job is queued and active
69.181.235.240 - - [19/Jan/2016:11:24:45 -0700] "GET
/api/histories/fb86c918c0d3d33b/contents?dataset_details=bae154fe2294752e%2C6fe732485990d2ac%2C604c4e6e60e997bc%2Cf015f1cb819ec50e%2C9f6f4b3cb6cf43eb%2C3d13d598882b6eb8%2C551006fddcb290ae%2C10b9bbc646c48387%2C7670dfdf35146bc5%2Ce0ec2cf59f1fc79e%2Cee30922e5e4854db%2C9e7a0ba216194210
HTTP/1.1" 200 - "https://galaxy.soe.ucsc.edu/" "Mozilla/5.0 (Windows
NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/47.0.2526.111 Safari/537.36"
galaxy.jobs.runners.drmaa DEBUG 2016-01-19 11:24:53,532 (7019/116563)
state change: job is running
galaxy.jobs WARNING 2016-01-19 11:24:53,922 (7019) Ignoring state
change from 'error' to 'running' for job that is already terminal
69.181.235.240 - - [19/Jan/2016:11:24:54 -0700] "GET
/api/histories/fb86c918c0d3d33b/contents HTTP/1.1" 200 - "
https://galaxy.soe.ucsc.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64;
x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111
Safari/537.36"
69.181.235.240 - - [19/Jan/2016:11:24:55 -0700] "GET
/api/histories/fb86c918c0d3d33b HTTP/1.1" 200 - "
https://galaxy.soe.ucsc.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64;
x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111
Safari/537.36"
galaxy.jobs.runners.drmaa INFO 2016-01-19 11:25:06,240 (7019/116563)
job left DRM queue with following message: code 18: The job specified by
the 'jobid' does not exist.
galaxy.jobs DEBUG 2016-01-19 11:25:06,412 (7019) Changing ownership
of working directory with: /usr/bin/sudo -E
scripts/external_chown_script.py
/campusdata/galaxy/galaxy/database/job_working_directory/007/7019 galaxy
59997
galaxy.jobs DEBUG 2016-01-19 11:25:06,622 (7019) Changing ownership
of working directory with: /usr/bin/sudo -E
scripts/external_chown_script.py
/campusdata/galaxy/galaxy/database/job_working_directory/007/7019 galaxy
59997
'qacct -j -o eshell' shows that job 7019 completed, though:
qname        all.q
...
hostname     campusrocks2-0-4.local
group        users
owner        eshell
project      NONE
department   defaultdepartment
jobname      g7019_ucsc_table_direct1_eshell_ucsc_edu
jobnumber    116563
taskid       undefined
account      sge
priority     0
qsub_time    Tue Jan 19 11:24:47 2016
start_time   Tue Jan 19 11:24:53 2016
end_time     Tue Jan 19 11:25:05 2016
granted_pe   mpi
slots        8
failed       0
exit_status  0
ru_wallclock 12
ru_utime     9.285
ru_stime     0.908
ru_maxrss    98384
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    81778
ru_majflt    2
ru_nswap     0
ru_inblock   6728
ru_oublock   184
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     13952
ru_nivcsw    301
cpu          10.192
mem          2.326
io           0.150
iow          0.000
maxvmem      448.820M
arid         undefined
Why does Galaxy not see the job after it has been submitted to the
cluster?
Thanks in advance for your help!
--
Eric Shell
UNIX Software & Google Apps Administrator
Baskin School of Engineering
UC Santa Cruz
831 459 4919
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/
--
Eric Shell
UNIX Software & Google Apps Administrator
Baskin School of Engineering
UC Santa Cruz
831 459 4919