Job output not returned from cluster

Joseph Hargitai

13 Oct 2011 13 Oct '11

11:18 p.m.

Hi, i was browsing through the list and found many entries for this issue but not a definite answer. We are actually running into this error for simple file uploads from the internal filesystem. thanks, joe

Attachments:

attachment.htm (text/html — 573 bytes)

Show replies by date

Nate Coraor

21 Oct 21 Oct

2:26 p.m.

Joseph Hargitai wrote:

...

Hi,

i was browsing through the list and found many entries for this issue but not a definite answer.

We are actually running into this error for simple file uploads from the internal filesystem.

Hi Joe, This error occurs when the job's standard output and error files are not found where Galaxy expects them, namely: <cluster_files_directory>/<job_id>.o <cluster_files_directory>/<job_id>.e Please check your queueing system to make sure it can correctly deliver these back from the execution hosts to the specified filesystem. --nate

...

thanks, joe

...

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Joseph Hargitai

3:04 p.m.

Nate, this error is intermittent. You resubmit the same job twice or three time and then it works. Once we are over the midterm exams - which use galaxy - we will try to switch the filesystem from autofs to hard mount. We suspect this to be the issue. Could we suppress e and o SGE style to resolve this issue, or Galaxy wants the o? Do you have an idea about the url build for galaxy - ucsc page return when the url is :8080/galaxy and not just /galaxy? thanks, joe ________________________________________ From: Nate Coraor [nate@bx.psu.edu] Sent: Friday, October 21, 2011 10:26 AM To: Joseph Hargitai Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Job output not returned from cluster Joseph Hargitai wrote:

...

Hi,

i was browsing through the list and found many entries for this issue but not a definite answer.

We are actually running into this error for simple file uploads from the internal filesystem.

...

thanks, joe

...

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Nate Coraor

24 Oct 24 Oct

5:24 p.m.

Joseph Hargitai wrote:

...

Nate,

this error is intermittent. You resubmit the same job twice or three time and then it works. Once we are over the midterm exams - which use galaxy - we will try to switch the filesystem from autofs to hard mount. We suspect this to be the issue.

Ah, I suspect this is attribute caching in NFS. Try mounting with the option 'noac' and see if it solves the problem.

...

Could we suppress e and o SGE style to resolve this issue, or Galaxy wants the o?

The filename is unimportant, but I doubt it's the cause.

...

Do you have an idea about the url build for galaxy - ucsc page return when the url is :8080/galaxy and not just /galaxy?

Not off the top of my head. I have this message marked, I'll take a look as soon as I have time. --nate

...

thanks, joe

________________________________________ From: Nate Coraor [nate@bx.psu.edu] Sent: Friday, October 21, 2011 10:26 AM To: Joseph Hargitai Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Job output not returned from cluster

Joseph Hargitai wrote:

...
Hi,

i was browsing through the list and found many entries for this issue but not a definite answer.

We are actually running into this error for simple file uploads from the internal filesystem.

Hi Joe,

This error occurs when the job's standard output and error files are not found where Galaxy expects them, namely:

<cluster_files_directory>/<job_id>.o <cluster_files_directory>/<job_id>.e

Please check your queueing system to make sure it can correctly deliver these back from the execution hosts to the specified filesystem.

--nate

...
thanks, joe

...
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Edward Kirton

28 Nov 28 Nov

9:15 p.m.

hi, we've had this issue too -- in short, the cluster node(s) finish writing outfiles to disk, but the file system (inode metadata) isn't updated at the galaxy server yet when galaxy checks for the files. turning the metadata caching off (as recommended on the galaxy wiki) isn't an option for me (and the performance hit would be significant), so i added some loops around the file checking (5sec sleep and retry up to 6 times). there were a couple of places this probably should be done (not just .[eo]* log files but also the outfiles). i am testing these hacks now but due to the intermittent nature of these errors, it'll be a few days before i know if this is working as expected. once vetted, i will put these minor edits in a clone of galaxy-central so the changes can be picked up. ed On Mon, Oct 24, 2011 at 10:24 AM, Nate Coraor <nate@bx.psu.edu> wrote:

...

Joseph Hargitai wrote:

...
Nate,

this error is intermittent. You resubmit the same job twice or three time and then it works. Once we are over the midterm exams - which use galaxy - we will try to switch the filesystem from autofs to hard mount. We suspect this to be the issue.

Ah, I suspect this is attribute caching in NFS. Try mounting with the option 'noac' and see if it solves the problem.

...
Could we suppress e and o SGE style to resolve this issue, or Galaxy wants the o?

The filename is unimportant, but I doubt it's the cause.

...
Do you have an idea about the url build for galaxy - ucsc page return when the url is :8080/galaxy and not just /galaxy?

Not off the top of my head. I have this message marked, I'll take a look as soon as I have time.

--nate

...
thanks, joe

________________________________________ From: Nate Coraor [nate@bx.psu.edu] Sent: Friday, October 21, 2011 10:26 AM To: Joseph Hargitai Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Job output not returned from cluster

Joseph Hargitai wrote:

...
Hi,

i was browsing through the list and found many entries for this issue

...
...
We are actually running into this error for simple file uploads from

but not a definite answer. the internal filesystem.

...
Hi Joe,

This error occurs when the job's standard output and error files are not found where Galaxy expects them, namely:

<cluster_files_directory>/<job_id>.o <cluster_files_directory>/<job_id>.e

Please check your queueing system to make sure it can correctly deliver these back from the execution hosts to the specified filesystem.

--nate

...
thanks, joe

...
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Joseph Hargitai

9:28 p.m.

Ed, we had the classic goof on our cluster with this. 4 nodes could not see the /home/galaxy folder due to a missing entry in /etc/fstab. When the jobs hit those nodes (which explains the randomness) we got the error message. Bothersome was the lack of good logs to go on. The error message was too generic - however I discovered that Galaxy was depositing the error and our messages in the /pbs folder and you could briefly read them before they got deleted. There the message was the classic SGE input/output message - /home/galaxy.... file not found. Hence my follow up question - how can I have galaxy NOT to delete these SGE error and out files? best, joe ________________________________ From: Edward Kirton [eskirton@lbl.gov] Sent: Monday, November 28, 2011 4:15 PM To: Nate Coraor Cc: Joseph Hargitai; galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Job output not returned from cluster hi, we've had this issue too -- in short, the cluster node(s) finish writing outfiles to disk, but the file system (inode metadata) isn't updated at the galaxy server yet when galaxy checks for the files. turning the metadata caching off (as recommended on the galaxy wiki) isn't an option for me (and the performance hit would be significant), so i added some loops around the file checking (5sec sleep and retry up to 6 times). there were a couple of places this probably should be done (not just .[eo]* log files but also the outfiles). i am testing these hacks now but due to the intermittent nature of these errors, it'll be a few days before i know if this is working as expected. once vetted, i will put these minor edits in a clone of galaxy-central so the changes can be picked up. ed On Mon, Oct 24, 2011 at 10:24 AM, Nate Coraor <nate@bx.psu.edu<mailto:nate@bx.psu.edu>> wrote: Joseph Hargitai wrote:

...

Nate,

this error is intermittent. You resubmit the same job twice or three time and then it works. Once we are over the midterm exams - which use galaxy - we will try to switch the filesystem from autofs to hard mount. We suspect this to be the issue.

Ah, I suspect this is attribute caching in NFS. Try mounting with the option 'noac' and see if it solves the problem.

...

Could we suppress e and o SGE style to resolve this issue, or Galaxy wants the o?

The filename is unimportant, but I doubt it's the cause.

...

Do you have an idea about the url build for galaxy - ucsc page return when the url is :8080/galaxy and not just /galaxy?

Not off the top of my head. I have this message marked, I'll take a look as soon as I have time. --nate

...

thanks, joe

________________________________________ From: Nate Coraor [nate@bx.psu.edu<mailto:nate@bx.psu.edu>] Sent: Friday, October 21, 2011 10:26 AM To: Joseph Hargitai Cc: galaxy-dev@lists.bx.psu.edu<mailto:galaxy-dev@lists.bx.psu.edu> Subject: Re: [galaxy-dev] Job output not returned from cluster

Joseph Hargitai wrote:

...
Hi,

i was browsing through the list and found many entries for this issue but not a definite answer.

We are actually running into this error for simple file uploads from the internal filesystem.

Hi Joe,

This error occurs when the job's standard output and error files are not found where Galaxy expects them, namely:

<cluster_files_directory>/<job_id>.o <cluster_files_directory>/<job_id>.e

Please check your queueing system to make sure it can correctly deliver these back from the execution hosts to the specified filesystem.

--nate

...
thanks, joe

...
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

Peter Cock

29 Nov 29 Nov

9:13 a.m.

...

Ed,

we had the classic goof on our cluster with this. 4 nodes could not see

On Monday, November 28, 2011, Joseph Hargitai < joseph.hargitai@einstein.yu.edu> wrote: the /home/galaxy folder due to a missing entry in /etc/fstab. When the jobs hit those nodes (which explains the randomness) we got the error message.

...

Bothersome was the lack of good logs to go on. The error message was too

generic - however I discovered that Galaxy was depositing the error and our messages in the /pbs folder and you could briefly read them before they got deleted. There the message was the classic SGE input/output message - /home/galaxy.... file not found.

...

Hence my follow up question - how can I have galaxy NOT to delete these

SGE error and out files?

...

best, joe

Better yet, Galaxy should read the SGE o and e files and record their contents as it would for a directly executed tools stdout and stderr. Peter

Fields, Christopher J

30 Nov 30 Nov

2:22 a.m.

On Nov 29, 2011, at 3:13 AM, Peter Cock wrote:

...

On Monday, November 28, 2011, Joseph Hargitai <joseph.hargitai@einstein.yu.edu> wrote:

...
Ed,

we had the classic goof on our cluster with this. 4 nodes could not see the /home/galaxy folder due to a missing entry in /etc/fstab. When the jobs hit those nodes (which explains the randomness) we got the error message.

Bothersome was the lack of good logs to go on. The error message was too generic - however I discovered that Galaxy was depositing the error and our messages in the /pbs folder and you could briefly read them before they got deleted. There the message was the classic SGE input/output message - /home/galaxy.... file not found.

Hence my follow up question - how can I have galaxy NOT to delete these SGE error and out files?

best, joe

Better yet, Galaxy should read the SGE o and e files and record their contents as it would for a directly executed tools stdout and stderr.

Peter

...or at least have the option to do so, maybe a level of verbosity. I have been bitten by lack of stderr output myself, where having it might have saved some manual debugging. chris

Nate Coraor

1 Dec 1 Dec

6:01 p.m.

On Nov 29, 2011, at 9:22 PM, Fields, Christopher J wrote:

...

On Nov 29, 2011, at 3:13 AM, Peter Cock wrote:

...
On Monday, November 28, 2011, Joseph Hargitai <joseph.hargitai@einstein.yu.edu> wrote:

...
Ed,

we had the classic goof on our cluster with this. 4 nodes could not see the /home/galaxy folder due to a missing entry in /etc/fstab. When the jobs hit those nodes (which explains the randomness) we got the error message.

Bothersome was the lack of good logs to go on. The error message was too generic - however I discovered that Galaxy was depositing the error and our messages in the /pbs folder and you could briefly read them before they got deleted. There the message was the classic SGE input/output message - /home/galaxy.... file not found.

Hence my follow up question - how can I have galaxy NOT to delete these SGE error and out files?

best, joe

Better yet, Galaxy should read the SGE o and e files and record their contents as it would for a directly executed tools stdout and stderr.

Peter

...or at least have the option to do so, maybe a level of verbosity. I have been bitten by lack of stderr output myself, where having it might have saved some manual debugging.

Unless I'm misunderstanding, this is what Galaxy already does. stdout/stderr up to 32K are read from .o and .e and stored in job.stdout/job.stderr. We do need to just store them as files and make them accessible for each tool run, this will hopefully happen sometime soonish. --nate

...

chris ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Edward Kirton

2 Dec 2 Dec

1:51 a.m.

yes, i think galaxy already grabs these files. i seem to recall this process would get stuck if the output was too large (i was running something with a --debug/verbose option and galaxy would not finish the job even though it was off the cluster -- had to redirect to a log file). so i guess others aren't having the same problems as i had, which is good news On Thu, Dec 1, 2011 at 10:01 AM, Nate Coraor <nate@bx.psu.edu> wrote:

...

On Nov 29, 2011, at 9:22 PM, Fields, Christopher J wrote:

...
On Nov 29, 2011, at 3:13 AM, Peter Cock wrote:

...
...
Ed,

we had the classic goof on our cluster with this. 4 nodes could not see the /home/galaxy folder due to a missing entry in /etc/fstab. When the jobs hit those nodes (which explains the randomness) we got the error message.

Bothersome was the lack of good logs to go on. The error message was too generic - however I discovered that Galaxy was depositing the error and our messages in the /pbs folder and you could briefly read them before they got deleted. There the message was the classic SGE input/output message - /home/galaxy.... file not found.

Hence my follow up question - how can I have galaxy NOT to delete

On Monday, November 28, 2011, Joseph Hargitai < joseph.hargitai@einstein.yu.edu> wrote: these SGE error and out files?

...
best, joe

Better yet, Galaxy should read the SGE o and e files and record their contents as it would for a directly executed tools stdout and stderr.

Peter

...or at least have the option to do so, maybe a level of verbosity. I have been bitten by lack of stderr output myself, where having it might have saved some manual debugging.

Unless I'm misunderstanding, this is what Galaxy already does. stdout/stderr up to 32K are read from .o and .e and stored in job.stdout/job.stderr. We do need to just store them as files and make them accessible for each tool run, this will hopefully happen sometime soonish.

--nate

...
chris ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

5015

Age (days ago)

5065

Last active (days ago)

List overview

Download

9 comments

5 participants

participants (5)

Edward Kirton
Fields, Christopher J
Joseph Hargitai
Nate Coraor
Peter Cock