Hi all, Reply to my own post... So we nailed it down, well almost. The problem comes from the fact that we are using the 'run job as real user'. When we switched back to normal 'all jobs run as galaxy', it is just working fine. When run as real user, the error we are getting from LSF is "can t open lsf.sudoers" (well you need to hack in the drmaa C code to log this out). We traced the runner operations with strace and the process really tries to read this file (and it fails as LSF imposes that this file is owned by root and as 600 rights). The strange thing is that when we disable the 'run as user' feature, this goes away. It might be the setuid() executed during the 'run as real user' procedure that somehow forces the process to access this file (ie imposed by LSF in some way) but we are lost. So we just run as galaxy now... Comments welcome! Does anybody run LSF with the "run as real user" feature on? Thx C On 18 Dec 2012, at 10:47, Charles Girardot wrote:
Hi all,
We are currently changing how cluster management from PBSPro to LSF (LSF 7 Update 6). We have a running Galaxy using drmaa with PBSPro (with the "job are submitted as real users" option). We expected an easy transition to LSF i.e. simply changing the drmaa implementation but of course, life is not that simple. So basically it is not working. We have tried with drmaa 1.0.4 and 1.0.3 (downloaded from http://sourceforge.net/projects/lsf-drmaa/ ).
Before getting to the symptoms: does anybody successfully run Galaxy with drmaa and LSF 7 Update 6 ?
Now the symptoms: - first we had an error saying something like "queued as Job <5160> is submitted to default queue <medium_priority>" is not an idea - we traced this in the drmaa C code and added a regex to actually extract the job id (if you are successfully running Galaxy with drmaa and LSF 7 Update 6; did you also have to do this??);
but then a new error came:
- jobs are successfully sent to the LSF queue and submitted to a node - after few ms we get an error : galaxy.jobs.runners.drmaa DEBUG 2012-12-17 11:14:29,227 (1699) submitting with credentials: sauer [uid: 8483] galaxy.jobs.runners.drmaa DEBUG 2012-12-17 11:14:29,229 (1699) Job script for external submission is: /g/galaxy/galaxy-dev_data/pbs/1699.jt_json galaxy.jobs.runners.drmaa INFO 2012-12-17 11:14:29,464 (1699) queued as Job <5160> is submitted to default queue <medium_priority>. E #2bae [ 0.00] * call to lsb_openjobinfo returned with error 1:No matching job found mapped to 1040:Job does not exist in DRMs queue. galaxy.jobs.runners.drmaa DEBUG 2012-12-17 11:14:30,275 (1699/Job <5160> is submitted to default queue <medium_priority>. 5160) job left DRM queue with following message: code 18: lsb_openjobinfo: XDR operation error
We are lost and the PBSPro license runs out on January 1 so we badly need to fix this...
PS: Note that if we simply switch back to PBSPro, it is all working fine; which tells us that the Galaxy setup is ok.
Thx for your help
bw
Charles
===================================== Charles Girardot European Molecular Biology Laboratory E. Furlong Group http://furlonglab.embl.de Tel: +49 6221 387 -8585 (V205) or 8433 (V320) Fax: +49-(0)6221-387-8166 Email: charles.girardot@embl.de Room V205/V320 Meyerhofstraße 1, 69117 Heidelberg, Germany =====================================
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
===================================== Charles Girardot European Molecular Biology Laboratory E. Furlong Group http://furlonglab.embl.de Tel: +49 6221 387 -8585 (V205) or 8433 (V320) Fax: +49-(0)6221-387-8166 Email: charles.girardot@embl.de Room V205/V320 Meyerhofstraße 1, 69117 Heidelberg, Germany =====================================