troubleshooting Galaxy with LSF
Hello, This is largely a repost from the biostar forum following the suggestion there to post here. I'm doing my first steps in setting up a Galaxy server with an LSF job scheduler. Recently LSF started supporting DRMAA again so I decided to give it a go. I have two setups. The one that works is a stand along server (OpenSuse 12.1, python 2.7.2, LSF 9.1.2). By "works" I mean that when I login into Galaxy using a browser and upload a file, a job gets submitted and run and everything seems fine. The second setup does not work (RH 6.4, python 2.6.6, LSF 9.1.2). It's a server running Galaxy which is meant to submit jobs to an LSF cluster. When I similarly pick and download a file I get Job <72266> is submitted to queue <short>. ./run.sh: line 79: 99087 Segmentation fault python ./scripts/paster.py serve universe_wsgi.ini $@ For the moment, I'm not bothered with the full server setup, I'm just testing whether Galaxy works with LSF and therefore run ./run.sh as a user. The job configuration job_conf.xml is identical in both cases: <?xml version="1.0"?> <job_conf> <plugins> <plugin id="lsf" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner"> <param id="drmaa_library_path">/opt/gridware/lsf/9.1/linux2.6-glibc2.3-x86_64/lib/libdrmaa.so</param> </plugin> </plugins> <handlers> <handler id="main"/> </handlers> <destinations default="lsf_default"> <destination id="lsf_default" runner="lsf"> <param id="nativeSpecification">-W 24:00</param> </destination> </destinations> </job_conf> run.sh is only changed to allow remote access. Most recently I tried replacing python with 2.7.5 to no avail. Still the same kind of error. I also updated Galaxy. Any hints would be much appreciated. Thank you
This is just a guess, which may help you troubleshoot. It could be a that python is reaching a stack limit: run ulimit -s and set it to a higher value if required I’m completely guessing here but is it possible that the DRMAA is missing a linked library on the redhat system – check with ldd? Regards, Iyad Kandalaft Iyad Kandalaft Microbial Biodiversity Bioinformatics Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada 960 Carling Ave.| 960 Ave. Carling Ottawa, ON| Ottawa (ON) K1A 0C6 E-mail Address / Adresse courriel Iyad.Kandalaft@agr.gc.ca Telephone | Téléphone 613-759-1228 Facsimile | Télécopieur 613-759-1701 Teletypewriter | Téléimprimeur 613-773-2600 Government of Canada | Gouvernement du Canada From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of I Kozin Sent: Tuesday, June 10, 2014 12:42 PM To: galaxy-dev@lists.bx.psu.edu Subject: [galaxy-dev] troubleshooting Galaxy with LSF Hello, This is largely a repost from the biostar forum following the suggestion there to post here. I'm doing my first steps in setting up a Galaxy server with an LSF job scheduler. Recently LSF started supporting DRMAA again so I decided to give it a go. I have two setups. The one that works is a stand along server (OpenSuse 12.1, python 2.7.2, LSF 9.1.2). By "works" I mean that when I login into Galaxy using a browser and upload a file, a job gets submitted and run and everything seems fine. The second setup does not work (RH 6.4, python 2.6.6, LSF 9.1.2). It's a server running Galaxy which is meant to submit jobs to an LSF cluster. When I similarly pick and download a file I get Job <72266> is submitted to queue <short>. ./run.sh: line 79: 99087 Segmentation fault python ./scripts/paster.py serve universe_wsgi.ini $@ For the moment, I'm not bothered with the full server setup, I'm just testing whether Galaxy works with LSF and therefore run ./run.sh as a user. The job configuration job_conf.xml is identical in both cases: <?xml version="1.0"?> <job_conf> <plugins> <plugin id="lsf" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner"> <param id="drmaa_library_path">/opt/gridware/lsf/9.1/linux2.6-glibc2.3-x86_64/lib/libdrmaa.so</param> </plugin> </plugins> <handlers> <handler id="main"/> </handlers> <destinations default="lsf_default"> <destination id="lsf_default" runner="lsf"> <param id="nativeSpecification">-W 24:00</param> </destination> </destinations> </job_conf> run.sh is only changed to allow remote access. Most recently I tried replacing python with 2.7.5 to no avail. Still the same kind of error. I also updated Galaxy. Any hints would be much appreciated. Thank you
Thank you, Iayd. Indeed, setting ulimit -s to unlimited helped to advance this further. I can see now that a job gets generated and submitted. However Galaxy crashes immediately after that. Job <108038> is submitted to queue <short>. *** glibc detected *** python: free(): invalid pointer: 0x00007fff79f10b64 *** ======= Backtrace: ========= < further output is omitted > Tracking the job through the scheduler reveals that the job finished successfully. The command in the job script is something like this: python /galaxy-dist/tools/data_source/upload.py /galaxy-dist /galaxy-dist/database/tmp/tmpGY5_lI /galaxy-dist/database/tmp/tmpr7VKGy 1:/galaxy-dist/database/job_working_directory/000/1/dataset_1_files:/galaxy-dist/database/files/000/dataset_1.dat usage: upload.py <root> <datatypes_conf> <json paramfile> <output spec> ... I cannot re-run it because only the first file in the tmp folder is there. The second (json paramfile, tmpr7VKGy) is gone. I presume dataset_1.dat is the output and it's there. The second half of the job script is the execution of set_metadata.sh I can execute it without issues (is this a db update?). One significant difference between the setup which works and which doesnt is that the working setup sits on local disk whereas the not working on Lustre. Could that be relevant? By the way, is there a method for removing the pending job? When I re-run Galaxy, it promptly crashes again due the stuck job. When Galaxy starts, the only error that I see is this IOError: [Errno 2] No such file or directory: './tools/mutation/visualize.xml' While it might be a good question why mutation directory is not there, the error is very likely not relevant to the issue. So I'm open to further suggestions as to how to understand what's going on. Thank you On 10 June 2014 19:24, Kandalaft, Iyad <Iyad.Kandalaft@agr.gc.ca> wrote:
This is just a guess, which may help you troubleshoot.
It could be a that python is reaching a stack limit: run ulimit -s and set it to a higher value if required
I’m completely guessing here but is it possible that the DRMAA is missing a linked library on the redhat system – check with ldd?
Regards,
Iyad Kandalaft
*Iyad Kandalaft*
Microbial Biodiversity Bioinformatics
Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada 960 Carling Ave.| 960 Ave. Carling
Ottawa, ON| Ottawa (ON) K1A 0C6
E-mail Address / Adresse courriel Iyad.Kandalaft@agr.gc.ca Telephone | Téléphone 613-759-1228 Facsimile | Télécopieur 613-759-1701 Teletypewriter | Téléimprimeur 613-773-2600 Government of Canada | Gouvernement du Canada
Hi Kozin, Are you using a python environment specifically for galaxy? If not, then jobs running on the compute will be using the wrong python environment. I setup galaxy (universe_wsgi.ini option) to source the python environment for galaxy before every job. Galaxy is coded to work only if it is shared across the cluster under the same path for all the nodes. Is this the case for the install sitting on Lustre? Hence, /home/galaxy/ is mounted on every compute node in the cluster from your LustreFS system? I would be interested in the omitted output (assuming it is relevant). Regards, Iyad Kandalaft Microbial Biodiversity Bioinformatics Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada 960 Carling Ave.| 960 Ave. Carling Ottawa, ON| Ottawa (ON) K1A 0C6 E-mail Address / Adresse courriel Iyad.Kandalaft@agr.gc.ca Telephone | Téléphone 613-759-1228 Facsimile | Télécopieur 613-759-1701 Teletypewriter | Téléimprimeur 613-773-2600 Government of Canada | Gouvernement du Canada From: I Kozin [mailto:igko50@gmail.com] Sent: Wednesday, June 11, 2014 12:55 PM To: Kandalaft, Iyad Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF Thank you, Iayd. Indeed, setting ulimit -s to unlimited helped to advance this further. I can see now that a job gets generated and submitted. However Galaxy crashes immediately after that. Job <108038> is submitted to queue <short>. *** glibc detected *** python: free(): invalid pointer: 0x00007fff79f10b64 *** ======= Backtrace: ========= < further output is omitted > Tracking the job through the scheduler reveals that the job finished successfully. The command in the job script is something like this: python /galaxy-dist/tools/data_source/upload.py /galaxy-dist /galaxy-dist/database/tmp/tmpGY5_lI /galaxy-dist/database/tmp/tmpr7VKGy 1:/galaxy-dist/database/job_working_directory/000/1/dataset_1_files:/galaxy-dist/database/files/000/dataset_1.dat usage: upload.py <root> <datatypes_conf> <json paramfile> <output spec> ... I cannot re-run it because only the first file in the tmp folder is there. The second (json paramfile, tmpr7VKGy) is gone. I presume dataset_1.dat is the output and it's there. The second half of the job script is the execution of set_metadata.sh I can execute it without issues (is this a db update?). One significant difference between the setup which works and which doesnt is that the working setup sits on local disk whereas the not working on Lustre. Could that be relevant? By the way, is there a method for removing the pending job? When I re-run Galaxy, it promptly crashes again due the stuck job. When Galaxy starts, the only error that I see is this IOError: [Errno 2] No such file or directory: './tools/mutation/visualize.xml' While it might be a good question why mutation directory is not there, the error is very likely not relevant to the issue. So I'm open to further suggestions as to how to understand what's going on. Thank you On 10 June 2014 19:24, Kandalaft, Iyad <Iyad.Kandalaft@agr.gc.ca<mailto:Iyad.Kandalaft@agr.gc.ca>> wrote: This is just a guess, which may help you troubleshoot. It could be a that python is reaching a stack limit: run ulimit -s and set it to a higher value if required I’m completely guessing here but is it possible that the DRMAA is missing a linked library on the redhat system – check with ldd? Regards, Iyad Kandalaft Iyad Kandalaft Microbial Biodiversity Bioinformatics Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada 960 Carling Ave.| 960 Ave. Carling Ottawa, ON| Ottawa (ON) K1A 0C6 E-mail Address / Adresse courriel Iyad.Kandalaft@agr.gc.ca<mailto:Iyad.Kandalaft@agr.gc.ca> Telephone | Téléphone 613-759-1228<tel:613-759-1228> Facsimile | Télécopieur 613-759-1701<tel:613-759-1701> Teletypewriter | Téléimprimeur 613-773-2600<tel:613-773-2600> Government of Canada | Gouvernement du Canada
The problem seems to be with DRMAA for Python. While it works fine on the OpenSuse 12.1 box, I'm getting a segfault on RH 6.4. Surprisingly, the job however gets submitted and run successfully.
Just for my curiosity, are you running a python environment isolated for galaxy or just using a system wide python environment? Regards, Iyad Kandalaft Microbial Biodiversity Bioinformatics Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada 960 Carling Ave.| 960 Ave. Carling Ottawa, ON| Ottawa (ON) K1A 0C6 E-mail Address / Adresse courriel Iyad.Kandalaft@agr.gc.ca Telephone | Téléphone 613-759-1228 Facsimile | Télécopieur 613-759-1701 Teletypewriter | Téléimprimeur 613-773-2600 Government of Canada | Gouvernement du Canada From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of INKozin Sent: Thursday, June 12, 2014 11:35 AM To: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF The problem seems to be with DRMAA for Python. While it works fine on the OpenSuse 12.1 box, I'm getting a segfault on RH 6.4. Surprisingly, the job however gets submitted and run successfully.
I'm not sure what you mean. I'm not using python virtual env, anaconda etc But I do set all variables which are required to run Galaxy such as DRMAA path, PYTHONPATH. In addition to the default Python 2.6, I also tried Python 2.7 built from source. On 13 June 2014 17:34, Kandalaft, Iyad <Iyad.Kandalaft@agr.gc.ca> wrote:
Just for my curiosity, are you running a python environment isolated for galaxy or just using a system wide python environment?
Regards,
*Iyad Kandalaft*
Microbial Biodiversity Bioinformatics
Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada 960 Carling Ave.| 960 Ave. Carling
Ottawa, ON| Ottawa (ON) K1A 0C6
E-mail Address / Adresse courriel Iyad.Kandalaft@agr.gc.ca Telephone | Téléphone 613-759-1228 Facsimile | Télécopieur 613-759-1701 Teletypewriter | Téléimprimeur 613-773-2600 Government of Canada | Gouvernement du Canada
*From:* galaxy-dev-bounces@lists.bx.psu.edu [mailto: galaxy-dev-bounces@lists.bx.psu.edu] *On Behalf Of *INKozin *Sent:* Thursday, June 12, 2014 11:35 AM *To:* galaxy-dev@lists.bx.psu.edu
*Subject:* Re: [galaxy-dev] troubleshooting Galaxy with LSF
The problem seems to be with DRMAA for Python. While it works fine on the OpenSuse
12.1 box, I'm getting a segfault on RH 6.4. Surprisingly, the job however gets submitted and run successfully.
If you are running galaxy on a cluster, you can do one of two things: 1. install python on every node in your cluster and ensure they are identical. <-- I've had problems with this method 2. use pyenv to create a python environment that gets sourced before you start galaxy and before any galaxy jobs are run. I usually activate it by putting it in .bashrc Iyad Kandalaft Bioinformatics Programmer Microbial Biodiversity Bioinformatics Science & Technology Branch Agriculture & Agri-Food Canada Iyad.Kandalaft@agr.gc.ca | (613) 759-1228 ________________________________ From: igko50@gmail.com [igko50@gmail.com] on behalf of INKozin [i.n.kozin@googlemail.com] Sent: June 14, 2014 6:18 PM To: Kandalaft, Iyad Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF I'm not sure what you mean. I'm not using python virtual env, anaconda etc But I do set all variables which are required to run Galaxy such as DRMAA path, PYTHONPATH. In addition to the default Python 2.6, I also tried Python 2.7 built from source. On 13 June 2014 17:34, Kandalaft, Iyad <Iyad.Kandalaft@agr.gc.ca<mailto:Iyad.Kandalaft@agr.gc.ca>> wrote: Just for my curiosity, are you running a python environment isolated for galaxy or just using a system wide python environment? Regards, Iyad Kandalaft Microbial Biodiversity Bioinformatics Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada 960 Carling Ave.| 960 Ave. Carling Ottawa, ON| Ottawa (ON) K1A 0C6 E-mail Address / Adresse courriel Iyad.Kandalaft@agr.gc.ca<mailto:Iyad.Kandalaft@agr.gc.ca> Telephone | Téléphone 613-759-1228<tel:613-759-1228> Facsimile | Télécopieur 613-759-1701<tel:613-759-1701> Teletypewriter | Téléimprimeur 613-773-2600<tel:613-773-2600> Government of Canada | Gouvernement du Canada From: galaxy-dev-bounces@lists.bx.psu.edu<mailto:galaxy-dev-bounces@lists.bx.psu.edu> [mailto:galaxy-dev-bounces@lists.bx.psu.edu<mailto:galaxy-dev-bounces@lists.bx.psu.edu>] On Behalf Of INKozin Sent: Thursday, June 12, 2014 11:35 AM To: galaxy-dev@lists.bx.psu.edu<mailto:galaxy-dev@lists.bx.psu.edu> Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF The problem seems to be with DRMAA for Python. While it works fine on the OpenSuse 12.1 box, I'm getting a segfault on RH 6.4. Surprisingly, the job however gets submitted and run successfully.
We are doing 1 one our clusters. Like I said elsewhere the problem appears to be in the interaction between Python DRMAA and LSF DRMAA, not Galaxy. On 15 June 2014 01:51, Kandalaft, Iyad <Iyad.Kandalaft@agr.gc.ca> wrote:
If you are running galaxy on a cluster, you can do one of two things: 1. install python on every node in your cluster and ensure they are identical. <-- I've had problems with this method 2. use pyenv to create a python environment that gets sourced before you start galaxy and before any galaxy jobs are run. I usually activate it by putting it in .bashrc
Iyad Kandalaft Bioinformatics Programmer Microbial Biodiversity Bioinformatics Science & Technology Branch Agriculture & Agri-Food Canada Iyad.Kandalaft@agr.gc.ca | (613) 759-1228 ________________________________ From: igko50@gmail.com [igko50@gmail.com] on behalf of INKozin [ i.n.kozin@googlemail.com] Sent: June 14, 2014 6:18 PM To: Kandalaft, Iyad Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF
I'm not sure what you mean. I'm not using python virtual env, anaconda etc But I do set all variables which are required to run Galaxy such as DRMAA path, PYTHONPATH. In addition to the default Python 2.6, I also tried Python 2.7 built from source.
On 13 June 2014 17:34, Kandalaft, Iyad <Iyad.Kandalaft@agr.gc.ca<mailto: Iyad.Kandalaft@agr.gc.ca>> wrote: Just for my curiosity, are you running a python environment isolated for galaxy or just using a system wide python environment?
Regards,
Iyad Kandalaft Microbial Biodiversity Bioinformatics Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada 960 Carling Ave.| 960 Ave. Carling Ottawa, ON| Ottawa (ON) K1A 0C6 E-mail Address / Adresse courriel Iyad.Kandalaft@agr.gc.ca<mailto: Iyad.Kandalaft@agr.gc.ca> Telephone | Téléphone 613-759-1228<tel:613-759-1228> Facsimile | Télécopieur 613-759-1701<tel:613-759-1701> Teletypewriter | Téléimprimeur 613-773-2600<tel:613-773-2600> Government of Canada | Gouvernement du Canada
From: galaxy-dev-bounces@lists.bx.psu.edu<mailto: galaxy-dev-bounces@lists.bx.psu.edu> [mailto: galaxy-dev-bounces@lists.bx.psu.edu<mailto: galaxy-dev-bounces@lists.bx.psu.edu>] On Behalf Of INKozin Sent: Thursday, June 12, 2014 11:35 AM To: galaxy-dev@lists.bx.psu.edu<mailto:galaxy-dev@lists.bx.psu.edu>
Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF
The problem seems to be with DRMAA for Python. While it works fine on the OpenSuse
12.1 box, I'm getting a segfault on RH 6.4. Surprisingly, the job however gets submitted and run successfully.
Hi, The drmaa library seems suspect here, I am not sure what the current state of the Platform/IBM-supplied libdrmaa is but the vendor versions have given me problems in the past. Could you try FedStage DRMAA for LSF?: http://sourceforge.net/projects/lsf-drmaa/files/lsf_drmaa/ This does look a bit older than FedStage/PSNC's other DRMAA implementations but the FedStage/PSNC implementations have been known to work well in the past, and this library can also have debugging output enabled (./configure --enable-debug) if you get segfaults with it (and would be useful for producing cores). --nate On Sun, Jun 15, 2014 at 10:15 AM, INKozin <i.n.kozin@googlemail.com> wrote:
We are doing 1 one our clusters. Like I said elsewhere the problem appears to be in the interaction between Python DRMAA and LSF DRMAA, not Galaxy.
On 15 June 2014 01:51, Kandalaft, Iyad <Iyad.Kandalaft@agr.gc.ca> wrote:
If you are running galaxy on a cluster, you can do one of two things: 1. install python on every node in your cluster and ensure they are identical. <-- I've had problems with this method 2. use pyenv to create a python environment that gets sourced before you start galaxy and before any galaxy jobs are run. I usually activate it by putting it in .bashrc
Iyad Kandalaft Bioinformatics Programmer Microbial Biodiversity Bioinformatics Science & Technology Branch Agriculture & Agri-Food Canada Iyad.Kandalaft@agr.gc.ca | (613) 759-1228 ________________________________ From: igko50@gmail.com [igko50@gmail.com] on behalf of INKozin [ i.n.kozin@googlemail.com] Sent: June 14, 2014 6:18 PM To: Kandalaft, Iyad Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF
I'm not sure what you mean. I'm not using python virtual env, anaconda etc But I do set all variables which are required to run Galaxy such as DRMAA path, PYTHONPATH. In addition to the default Python 2.6, I also tried Python 2.7 built from source.
On 13 June 2014 17:34, Kandalaft, Iyad <Iyad.Kandalaft@agr.gc.ca<mailto: Iyad.Kandalaft@agr.gc.ca>> wrote: Just for my curiosity, are you running a python environment isolated for galaxy or just using a system wide python environment?
Regards,
Iyad Kandalaft Microbial Biodiversity Bioinformatics Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada 960 Carling Ave.| 960 Ave. Carling Ottawa, ON| Ottawa (ON) K1A 0C6 E-mail Address / Adresse courriel Iyad.Kandalaft@agr.gc.ca<mailto: Iyad.Kandalaft@agr.gc.ca> Telephone | Téléphone 613-759-1228<tel:613-759-1228> Facsimile | Télécopieur 613-759-1701<tel:613-759-1701> Teletypewriter | Téléimprimeur 613-773-2600<tel:613-773-2600> Government of Canada | Gouvernement du Canada
From: galaxy-dev-bounces@lists.bx.psu.edu<mailto: galaxy-dev-bounces@lists.bx.psu.edu> [mailto: galaxy-dev-bounces@lists.bx.psu.edu<mailto: galaxy-dev-bounces@lists.bx.psu.edu>] On Behalf Of INKozin Sent: Thursday, June 12, 2014 11:35 AM To: galaxy-dev@lists.bx.psu.edu<mailto:galaxy-dev@lists.bx.psu.edu>
Subject: Re: [galaxy-dev] troubleshooting Galaxy with LSF
The problem seems to be with DRMAA for Python. While it works fine on the OpenSuse
12.1 box, I'm getting a segfault on RH 6.4. Surprisingly, the job however gets submitted and run successfully.
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Hi Nate, The "new" LSF DRMAA is the old FedStage source code. However the latest release (v 1.1.1) is claimed to be compatible with LSF 9.1.2 which we are using. And it does work indeed. Yes, I've re-built the lib with --enable-debug and confirmed that the segfault happens on returning from drmaa_run_job back to Python. I'm talking to Dan Blanchard, Python DRMAA maintainer, so there is a hope this can get resolved. I'm doing my debugging on the latest Python DRMAA release - 0.7.6. Galaxy is still on 0.6 so need to update at some point. Best Igor On 16 June 2014 14:35, Nate Coraor <nate@bx.psu.edu> wrote:
Hi,
The drmaa library seems suspect here, I am not sure what the current state of the Platform/IBM-supplied libdrmaa is but the vendor versions have given me problems in the past. Could you try FedStage DRMAA for LSF?:
http://sourceforge.net/projects/lsf-drmaa/files/lsf_drmaa/
This does look a bit older than FedStage/PSNC's other DRMAA implementations but the FedStage/PSNC implementations have been known to work well in the past, and this library can also have debugging output enabled (./configure --enable-debug) if you get segfaults with it (and would be useful for producing cores).
--nate
participants (4)
-
I Kozin
-
INKozin
-
Kandalaft, Iyad
-
Nate Coraor