Galaxy integration with LSF: seg fault
Hello I've set up Galaxy to use LSF. My first job has failed because Galaxy submitted it to the default queue, which was wrong in my case. However, Galaxy gracefully survived the failure, I was able to get the job number from the console output and figure out what went wrong. Next time I run the Galaxy with the LSB_DEFAULTQUEUE env variable set like this: LSB_DEFAULTQUEUE=test DRMAA_LIBRARY_PATH=/usr/local/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/libdrmaa.so.1.0.4 PATH=/usr/bin:/software/solexa/bin:$PATH sh run.sh The job is submitted to the correct queue and at this point Galaxy fails with this error: run.sh: line 46: 6506 Segmentation fault python ./scripts/paster.py serve universe_wsgi.ini $@ The job successfully completes in its own time. When I try to run Galaxy again I get the following: galaxy.jobs DEBUG 2011-02-02 16:27:32,565 dispatching job 36 to drmaa runner galaxy.jobs INFO 2011-02-02 16:27:32,675 job 36 dispatched galaxy.jobs.runners.drmaa DEBUG 2011-02-02 16:27:33,192 (36) submitting file /nfs/users/nfs_m/mg8/mygalaxy/galaxy-dist/database/pbs/galaxy_36.sh galaxy.jobs.runners.drmaa DEBUG 2011-02-02 16:27:33,192 (36) command is: java -jar /nfs/users/nfs_m/mg8/mygalaxy/galaxy-dist/tool-data/shared/jars/SamToFastq.jar VALIDATION_STRINGENCY=SILENT QUIET=true INPUT=/lustre/scratch103/sanger/mg8/galaxy/datasets/000/dataset_16.dat FASTQ=/lustre/scratch103/sanger/mg8/galaxy/datasets/000/dataset_51.dat SECOND_END_FASTQ=/lustre/scratch103/sanger/mg8/galaxy/datasets/000/dataset_52.dat Job <855341> is submitted to queue <test>. run.sh: line 46: 6506 Segmentation fault python ./scripts/paster.py serve universe_wsgi.ini $@ ie looks like Galaxy is trying to pick up where it has left and fails again. I configured my job runners like this: start_job_runners = drmaa default_cluster_job_runner = drmaa:/// Any suggestions? Regards Marina -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
Looks like Galaxy is OK with LSF queue that submits to Debian etch nodes; it seg. faults with jobs running on Debian lenny. The server itself is running on a lenny node. Investigating further... Marina On 02/02/2011 16:43, Marina Gourtovaia wrote:
Hello
I've set up Galaxy to use LSF. My first job has failed because Galaxy submitted it to the default queue, which was wrong in my case. However, Galaxy gracefully survived the failure, I was able to get the job number from the console output and figure out what went wrong.
Next time I run the Galaxy with the LSB_DEFAULTQUEUE env variable set like this:
LSB_DEFAULTQUEUE=test DRMAA_LIBRARY_PATH=/usr/local/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/libdrmaa.so.1.0.4 PATH=/usr/bin:/software/solexa/bin:$PATH sh run.sh
The job is submitted to the correct queue and at this point Galaxy fails with this error:
run.sh: line 46: 6506 Segmentation fault python ./scripts/paster.py serve universe_wsgi.ini $@
The job successfully completes in its own time.
When I try to run Galaxy again I get the following:
galaxy.jobs DEBUG 2011-02-02 16:27:32,565 dispatching job 36 to drmaa runner galaxy.jobs INFO 2011-02-02 16:27:32,675 job 36 dispatched galaxy.jobs.runners.drmaa DEBUG 2011-02-02 16:27:33,192 (36) submitting file /nfs/users/nfs_m/mg8/mygalaxy/galaxy-dist/database/pbs/galaxy_36.sh galaxy.jobs.runners.drmaa DEBUG 2011-02-02 16:27:33,192 (36) command is: java -jar /nfs/users/nfs_m/mg8/mygalaxy/galaxy-dist/tool-data/shared/jars/SamToFastq.jar VALIDATION_STRINGENCY=SILENT QUIET=true INPUT=/lustre/scratch103/sanger/mg8/galaxy/datasets/000/dataset_16.dat FASTQ=/lustre/scratch103/sanger/mg8/galaxy/datasets/000/dataset_51.dat SECOND_END_FASTQ=/lustre/scratch103/sanger/mg8/galaxy/datasets/000/dataset_52.dat
Job <855341> is submitted to queue <test>. run.sh: line 46: 6506 Segmentation fault python ./scripts/paster.py serve universe_wsgi.ini $@
ie looks like Galaxy is trying to pick up where it has left and fails again.
I configured my job runners like this:
start_job_runners = drmaa default_cluster_job_runner = drmaa:///
Any suggestions?
Regards
Marina
-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
This has now been narrowed down to a seq fault in the drmaa libraries immediately after submitting a job when an LSF queue is set explicitly with the LSB_DEFAULTQUEUE global variable. Marina On 02/02/2011 16:43, Marina Gourtovaia wrote:
Hello
I've set up Galaxy to use LSF. My first job has failed because Galaxy submitted it to the default queue, which was wrong in my case. However, Galaxy gracefully survived the failure, I was able to get the job number from the console output and figure out what went wrong.
Next time I run the Galaxy with the LSB_DEFAULTQUEUE env variable set like this:
LSB_DEFAULTQUEUE=test DRMAA_LIBRARY_PATH=/usr/local/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/libdrmaa.so.1.0.4 PATH=/usr/bin:/software/solexa/bin:$PATH sh run.sh
The job is submitted to the correct queue and at this point Galaxy fails with this error:
run.sh: line 46: 6506 Segmentation fault python ./scripts/paster.py serve universe_wsgi.ini $@
The job successfully completes in its own time.
When I try to run Galaxy again I get the following:
galaxy.jobs DEBUG 2011-02-02 16:27:32,565 dispatching job 36 to drmaa runner galaxy.jobs INFO 2011-02-02 16:27:32,675 job 36 dispatched galaxy.jobs.runners.drmaa DEBUG 2011-02-02 16:27:33,192 (36) submitting file /nfs/users/nfs_m/mg8/mygalaxy/galaxy-dist/database/pbs/galaxy_36.sh galaxy.jobs.runners.drmaa DEBUG 2011-02-02 16:27:33,192 (36) command is: java -jar /nfs/users/nfs_m/mg8/mygalaxy/galaxy-dist/tool-data/shared/jars/SamToFastq.jar VALIDATION_STRINGENCY=SILENT QUIET=true INPUT=/lustre/scratch103/sanger/mg8/galaxy/datasets/000/dataset_16.dat FASTQ=/lustre/scratch103/sanger/mg8/galaxy/datasets/000/dataset_51.dat SECOND_END_FASTQ=/lustre/scratch103/sanger/mg8/galaxy/datasets/000/dataset_52.dat
Job <855341> is submitted to queue <test>. run.sh: line 46: 6506 Segmentation fault python ./scripts/paster.py serve universe_wsgi.ini $@
ie looks like Galaxy is trying to pick up where it has left and fails again.
I configured my job runners like this:
start_job_runners = drmaa default_cluster_job_runner = drmaa:///
Any suggestions?
Regards
Marina
-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
Hi Marina, Thanks for posting updates and information... we've moved from SGE to LSF and are going to implement Galaxy LSF integration next month so what you have posted is very interesting -Leandro On Mon, Feb 7, 2011 at 6:22 PM, Marina Gourtovaia <mg8@sanger.ac.uk> wrote:
This has now been narrowed down to a seq fault in the drmaa libraries immediately after submitting a job when an LSF queue is set explicitly with the LSB_DEFAULTQUEUE global variable.
Marina
On 02/02/2011 16:43, Marina Gourtovaia wrote:
Hello
I've set up Galaxy to use LSF. My first job has failed because Galaxy submitted it to the default queue, which was wrong in my case. However, Galaxy gracefully survived the failure, I was able to get the job number from the console output and figure out what went wrong.
Next time I run the Galaxy with the LSB_DEFAULTQUEUE env variable set like this:
LSB_DEFAULTQUEUE=test DRMAA_LIBRARY_PATH=/usr/local/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/libdrmaa.so.1.0.4 PATH=/usr/bin:/software/solexa/bin:$PATH sh run.sh
The job is submitted to the correct queue and at this point Galaxy fails with this error:
run.sh: line 46: 6506 Segmentation fault python ./scripts/paster.py serve universe_wsgi.ini $@
The job successfully completes in its own time.
When I try to run Galaxy again I get the following:
galaxy.jobs DEBUG 2011-02-02 16:27:32,565 dispatching job 36 to drmaa runner galaxy.jobs INFO 2011-02-02 16:27:32,675 job 36 dispatched galaxy.jobs.runners.drmaa DEBUG 2011-02-02 16:27:33,192 (36) submitting file /nfs/users/nfs_m/mg8/mygalaxy/galaxy-dist/database/pbs/galaxy_36.sh galaxy.jobs.runners.drmaa DEBUG 2011-02-02 16:27:33,192 (36) command is: java -jar /nfs/users/nfs_m/mg8/mygalaxy/galaxy-dist/tool-data/shared/jars/SamToFastq.jar VALIDATION_STRINGENCY=SILENT QUIET=true INPUT=/lustre/scratch103/sanger/mg8/galaxy/datasets/000/dataset_16.dat FASTQ=/lustre/scratch103/sanger/mg8/galaxy/datasets/000/dataset_51.dat
SECOND_END_FASTQ=/lustre/scratch103/sanger/mg8/galaxy/datasets/000/dataset_52.dat Job <855341> is submitted to queue <test>. run.sh: line 46: 6506 Segmentation fault python ./scripts/paster.py serve universe_wsgi.ini $@
ie looks like Galaxy is trying to pick up where it has left and fails again.
I configured my job runners like this:
start_job_runners = drmaa default_cluster_job_runner = drmaa:///
Any suggestions?
Regards
Marina
-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
My thanks as well, since we don't really have a way to debug LSF here. Platform only granted us a 1 month license for the development, so I'd have to get another license to debug it. --nate Leandro Hermida wrote:
Hi Marina,
Thanks for posting updates and information... we've moved from SGE to LSF and are going to implement Galaxy LSF integration next month so what you have posted is very interesting
-Leandro
On Mon, Feb 7, 2011 at 6:22 PM, Marina Gourtovaia <mg8@sanger.ac.uk> wrote:
This has now been narrowed down to a seq fault in the drmaa libraries immediately after submitting a job when an LSF queue is set explicitly with the LSB_DEFAULTQUEUE global variable.
Marina
On 02/02/2011 16:43, Marina Gourtovaia wrote:
Hello
I've set up Galaxy to use LSF. My first job has failed because Galaxy submitted it to the default queue, which was wrong in my case. However, Galaxy gracefully survived the failure, I was able to get the job number from the console output and figure out what went wrong.
Next time I run the Galaxy with the LSB_DEFAULTQUEUE env variable set like this:
LSB_DEFAULTQUEUE=test DRMAA_LIBRARY_PATH=/usr/local/lsf/7.0/linux2.6-glibc2.3-x86_64/lib/libdrmaa.so.1.0.4 PATH=/usr/bin:/software/solexa/bin:$PATH sh run.sh
The job is submitted to the correct queue and at this point Galaxy fails with this error:
run.sh: line 46: 6506 Segmentation fault python ./scripts/paster.py serve universe_wsgi.ini $@
The job successfully completes in its own time.
When I try to run Galaxy again I get the following:
galaxy.jobs DEBUG 2011-02-02 16:27:32,565 dispatching job 36 to drmaa runner galaxy.jobs INFO 2011-02-02 16:27:32,675 job 36 dispatched galaxy.jobs.runners.drmaa DEBUG 2011-02-02 16:27:33,192 (36) submitting file /nfs/users/nfs_m/mg8/mygalaxy/galaxy-dist/database/pbs/galaxy_36.sh galaxy.jobs.runners.drmaa DEBUG 2011-02-02 16:27:33,192 (36) command is: java -jar /nfs/users/nfs_m/mg8/mygalaxy/galaxy-dist/tool-data/shared/jars/SamToFastq.jar VALIDATION_STRINGENCY=SILENT QUIET=true INPUT=/lustre/scratch103/sanger/mg8/galaxy/datasets/000/dataset_16.dat FASTQ=/lustre/scratch103/sanger/mg8/galaxy/datasets/000/dataset_51.dat
SECOND_END_FASTQ=/lustre/scratch103/sanger/mg8/galaxy/datasets/000/dataset_52.dat Job <855341> is submitted to queue <test>. run.sh: line 46: 6506 Segmentation fault python ./scripts/paster.py serve universe_wsgi.ini $@
ie looks like Galaxy is trying to pick up where it has left and fails again.
I configured my job runners like this:
start_job_runners = drmaa default_cluster_job_runner = drmaa:///
Any suggestions?
Regards
Marina
-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
participants (3)
-
Leandro Hermida
-
Marina Gourtovaia
-
Nate Coraor