Problem related to a job that "failed"
All, This question pertains to a local galaxy install where the jobs are being submitted to a cluster running LSF. I periodically get an error from a galaxy job ("Job output not returned from cluster") even though the job completes properly on the cluster. In researching this issue our systems administrator discovered that the error seems to be happening because of NFS caching. The problem is that the job finishes on the job node, but the Galaxy server doesn't see that reflected in the output file because of delays in the cache update in NFS. The only solution he discovered was turning the caching off on the file system. In our case it is not possible because it will lead to a performance hit that would be not acceptable on the shared cluster. Most other file systems on the cluster are "NFS" so moving the database/pbs folder to another file system is not an option. I generally work around the NFS cache issue but now it is leading to a second problem. Since the job appears to be in some failed state to galaxy (it shows up as red in history) I can't seem to use the output file (even though it is there and I can see it using the "eye" icon) to move to the next step. The file attribute is set right. I assume a possible solution may be to reset the "failed" flag on the history item. Would this need to be done in the database? Downloading and then re-uploading the result file (a 25+ GB SAM file in this case) may be a workaround but it is not very practical. Any ideas/suggestions? Thanks, Hemant
On May 11, 2012, at 3:39 PM, Kelkar, Hemant wrote:
All,
This question pertains to a local galaxy install where the jobs are being submitted to a cluster running LSF.
I periodically get an error from a galaxy job (“Job output not returned from cluster”) even though the job completes properly on the cluster. In researching this issue our systems administrator discovered that the error seems to be happening because of NFS caching. The problem is that the job finishes on the job node, but the Galaxy server doesn't see that reflected in the output file because of delays in the cache update in NFS. The only solution he discovered was turning the caching off on the file system. In our case it is not possible because it will lead to a performance hit that would be not acceptable on the shared cluster. Most other file systems on the cluster are “NFS” so moving the database/pbs folder to another file system is not an option.
Hi Hemant, You wouldn't need to change attribute caching on the entire cluster, just on the filesystem in which Galaxy is stored on the server on which the Galaxy application runs. If that's not possible either, see the 'retry_job_output_collection' option in universe_wsgi.ini.
I generally work around the NFS cache issue but now it is leading to a second problem. Since the job appears to be in some failed state to galaxy (it shows up as red in history) I can’t seem to use the output file (even though it is there and I can see it using the “eye” icon) to move to the next step. The file attribute is set right.
I assume a possible solution may be to reset the “failed” flag on the history item. Would this need to be done in the database? Downloading and then re-uploading the result file (a 25+ GB SAM file in this case) may be a workaround but it is not very practical.
Yes, this would be something to update in the database. The process is something like: 1a. Collect the encoded history_dataset_association ID from the errored dataset's display URL. 1b. Use galaxy-dist/scripts/helper.py -d <encoded_hda_id> to decode the id. 1c. In the database, 'select dataset_id from history_dataset_association where id=<decoded_hda_id>' OR 1. Collect the output dataset id from the command line that was executed and logged in the Galaxy server log 2. In the database, 'update dataset set state='ok' where id=<dataset_id>' --nate
Any ideas/suggestions?
Thanks,
Hemant ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Thanks Nate for your prompt response. Here is what I tried. Got the jobid (e.g. 02b1cfb59f716d61) from the file download link (could not find the "display URL" as you described). Then tried to use the helper script but got the following error: python ./helper.py -d 02b1cfb59f716d61 Traceback (most recent call last): File "/path_to/galaxy-prod/galaxy-dist/scripts/helper.py", line 40, in ? from galaxy.web import security File "lib/galaxy/web/__init__.py", line 5, in ? from framework import expose, json, json_pretty, require_login, require_admin, url_for, error, form, FormBuilder, expose_api File "lib/galaxy/web/framework/__init__.py", line 8, in ? pkg_resources.require( "Cheetah" ) File "lib/galaxy/eggs/__init__.py", line 415, in require raise EggNotFetchable( str( [ egg.name for egg in e.eggs ] ) ) galaxy.eggs.EggNotFetchable: ['Cheetah'] -----Original Message----- From: Nate Coraor [mailto:nate@bx.psu.edu] Sent: Friday, May 11, 2012 3:54 PM To: Kelkar, Hemant Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Problem related to a job that "failed" On May 11, 2012, at 3:39 PM, Kelkar, Hemant wrote:
All,
This question pertains to a local galaxy install where the jobs are being submitted to a cluster running LSF.
I periodically get an error from a galaxy job ("Job output not returned from cluster") even though the job completes properly on the cluster. In researching this issue our systems administrator discovered that the error seems to be happening because of NFS caching. The problem is that the job finishes on the job node, but the Galaxy server doesn't see that reflected in the output file because of delays in the cache update in NFS. The only solution he discovered was turning the caching off on the file system. In our case it is not possible because it will lead to a performance hit that would be not acceptable on the shared cluster. Most other file systems on the cluster are "NFS" so moving the database/pbs folder to another file system is not an option.
Hi Hemant, You wouldn't need to change attribute caching on the entire cluster, just on the filesystem in which Galaxy is stored on the server on which the Galaxy application runs. If that's not possible either, see the 'retry_job_output_collection' option in universe_wsgi.ini.
I generally work around the NFS cache issue but now it is leading to a second problem. Since the job appears to be in some failed state to galaxy (it shows up as red in history) I can't seem to use the output file (even though it is there and I can see it using the "eye" icon) to move to the next step. The file attribute is set right.
I assume a possible solution may be to reset the "failed" flag on the history item. Would this need to be done in the database? Downloading and then re-uploading the result file (a 25+ GB SAM file in this case) may be a workaround but it is not very practical.
Yes, this would be something to update in the database. The process is something like: 1a. Collect the encoded history_dataset_association ID from the errored dataset's display URL. 1b. Use galaxy-dist/scripts/helper.py -d <encoded_hda_id> to decode the id. 1c. In the database, 'select dataset_id from history_dataset_association where id=<decoded_hda_id>' OR 1. Collect the output dataset id from the command line that was executed and logged in the Galaxy server log 2. In the database, 'update dataset set state='ok' where id=<dataset_id>' --nate
Any ideas/suggestions?
Thanks,
Hemant ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
On May 11, 2012, at 4:05 PM, Kelkar, Hemant wrote:
Thanks Nate for your prompt response.
Here is what I tried.
Got the jobid (e.g. 02b1cfb59f716d61) from the file download link (could not find the "display URL" as you described).
Then tried to use the helper script but got the following error:
python ./helper.py -d 02b1cfb59f716d61 Traceback (most recent call last): File "/path_to/galaxy-prod/galaxy-dist/scripts/helper.py", line 40, in ? from galaxy.web import security File "lib/galaxy/web/__init__.py", line 5, in ? from framework import expose, json, json_pretty, require_login, require_admin, url_for, error, form, FormBuilder, expose_api File "lib/galaxy/web/framework/__init__.py", line 8, in ? pkg_resources.require( "Cheetah" ) File "lib/galaxy/eggs/__init__.py", line 415, in require raise EggNotFetchable( str( [ egg.name for egg in e.eggs ] ) ) galaxy.eggs.EggNotFetchable: ['Cheetah']
That's pretty odd, you're executing this on the Galaxy server? What happens if you run fetch_eggs.py in the same directory? --nate
-----Original Message----- From: Nate Coraor [mailto:nate@bx.psu.edu] Sent: Friday, May 11, 2012 3:54 PM To: Kelkar, Hemant Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Problem related to a job that "failed"
On May 11, 2012, at 3:39 PM, Kelkar, Hemant wrote:
All,
This question pertains to a local galaxy install where the jobs are being submitted to a cluster running LSF.
I periodically get an error from a galaxy job ("Job output not returned from cluster") even though the job completes properly on the cluster. In researching this issue our systems administrator discovered that the error seems to be happening because of NFS caching. The problem is that the job finishes on the job node, but the Galaxy server doesn't see that reflected in the output file because of delays in the cache update in NFS. The only solution he discovered was turning the caching off on the file system. In our case it is not possible because it will lead to a performance hit that would be not acceptable on the shared cluster. Most other file systems on the cluster are "NFS" so moving the database/pbs folder to another file system is not an option.
Hi Hemant,
You wouldn't need to change attribute caching on the entire cluster, just on the filesystem in which Galaxy is stored on the server on which the Galaxy application runs. If that's not possible either, see the 'retry_job_output_collection' option in universe_wsgi.ini.
I generally work around the NFS cache issue but now it is leading to a second problem. Since the job appears to be in some failed state to galaxy (it shows up as red in history) I can't seem to use the output file (even though it is there and I can see it using the "eye" icon) to move to the next step. The file attribute is set right.
I assume a possible solution may be to reset the "failed" flag on the history item. Would this need to be done in the database? Downloading and then re-uploading the result file (a 25+ GB SAM file in this case) may be a workaround but it is not very practical.
Yes, this would be something to update in the database. The process is something like:
1a. Collect the encoded history_dataset_association ID from the errored dataset's display URL. 1b. Use galaxy-dist/scripts/helper.py -d <encoded_hda_id> to decode the id. 1c. In the database, 'select dataset_id from history_dataset_association where id=<decoded_hda_id>'
OR
1. Collect the output dataset id from the command line that was executed and logged in the Galaxy server log
2. In the database, 'update dataset set state='ok' where id=<dataset_id>'
--nate
Any ideas/suggestions?
Thanks,
Hemant ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
No I did not execute it on the "galaxy" server. I do not have shell access to the blade that actually runs the galaxy server. So I would need to get my sys admin to run this then?
That's pretty odd, you're executing this on the Galaxy server? What happens if you run fetch_eggs.py in the same directory?
--nate
On May 11, 2012, at 4:18 PM, Kelkar, Hemant wrote:
No I did not execute it on the "galaxy" server. I do not have shell access to the blade that actually runs the galaxy server. So I would need to get my sys admin to run this then?
Not necessarily, as long as you are running it as the same user and have access to the database from wherever you're running it. --nate
That's pretty odd, you're executing this on the Galaxy server? What happens if you run fetch_eggs.py in the same directory?
--nate
No I was not running it as user "galaxy" (was running it as me). I suppose I will have to request our sys administrator to run the command. So in general this would not be an end-user user accessible solution (easily) for us. We will try changing the "output_collection_parameter" as you had suggested first to see if we can avoid the "failed" job problem altogether. Thanks again. -----Original Message----- From: Nate Coraor [mailto:nate@bx.psu.edu] Sent: Friday, May 11, 2012 4:20 PM To: Kelkar, Hemant Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Problem related to a job that "failed" On May 11, 2012, at 4:18 PM, Kelkar, Hemant wrote:
No I did not execute it on the "galaxy" server. I do not have shell access to the blade that actually runs the galaxy server. So I would need to get my sys admin to run this then?
Not necessarily, as long as you are running it as the same user and have access to the database from wherever you're running it. --nate
That's pretty odd, you're executing this on the Galaxy server? What happens if you run fetch_eggs.py in the same directory?
--nate
Increasing the 'retry_job_output_collection' value in universe_wsgi.ini fixed the "failed" job problem. Thanks for the suggestion Nate. --Hemant -----Original Message----- From: Nate Coraor [mailto:nate@bx.psu.edu] Sent: Friday, May 11, 2012 3:54 PM To: Kelkar, Hemant Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Problem related to a job that "failed" On May 11, 2012, at 3:39 PM, Kelkar, Hemant wrote:
All,
This question pertains to a local galaxy install where the jobs are being submitted to a cluster running LSF.
I periodically get an error from a galaxy job ("Job output not returned from cluster") even though the job completes properly on the cluster. In researching this issue our systems administrator discovered that the error seems to be happening because of NFS caching. The problem is that the job finishes on the job node, but the Galaxy server doesn't see that reflected in the output file because of delays in the cache update in NFS. The only solution he discovered was turning the caching off on the file system. In our case it is not possible because it will lead to a performance hit that would be not acceptable on the shared cluster. Most other file systems on the cluster are "NFS" so moving the database/pbs folder to another file system is not an option.
Hi Hemant, You wouldn't need to change attribute caching on the entire cluster, just on the filesystem in which Galaxy is stored on the server on which the Galaxy application runs. If that's not possible either, see the 'retry_job_output_collection' option in universe_wsgi.ini.
I generally work around the NFS cache issue but now it is leading to a second problem. Since the job appears to be in some failed state to galaxy (it shows up as red in history) I can't seem to use the output file (even though it is there and I can see it using the "eye" icon) to move to the next step. The file attribute is set right.
I assume a possible solution may be to reset the "failed" flag on the history item. Would this need to be done in the database? Downloading and then re-uploading the result file (a 25+ GB SAM file in this case) may be a workaround but it is not very practical.
Yes, this would be something to update in the database. The process is something like: 1a. Collect the encoded history_dataset_association ID from the errored dataset's display URL. 1b. Use galaxy-dist/scripts/helper.py -d <encoded_hda_id> to decode the id. 1c. In the database, 'select dataset_id from history_dataset_association where id=<decoded_hda_id>' OR 1. Collect the output dataset id from the command line that was executed and logged in the Galaxy server log 2. In the database, 'update dataset set state='ok' where id=<dataset_id>' --nate
Any ideas/suggestions?
Thanks,
Hemant ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (2)
-
Kelkar, Hemant
-
Nate Coraor