Hi again, Ok. Here’s the top of our slurm.conf file. One thing I notice when comparing to the configure_slurm template is that the GKS tries to set the SlurmUser to ‘galaxyuser’, while we have it set to ‘slurm’. ControlMachine=exahead1 #BackupController=exahead2 AuthType=auth/munge CacheGroups=0 #CheckpointType=checkpoint/none CryptoType=crypto/munge DisableRootJobs=YES #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=999999 GresTypes=gpu GroupUpdateForce=1 GroupUpdateTime=300 #JobCheckpointDir=/var/slurm/checkpoint #JobCredentialPrivateKey= #JobCredentialPublicCertificate= #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=1 #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=5000 #MaxStepCount=40000 #MaxTasksPerNode=128 MpiDefault=pmi2 #MpiParams=ports=#-# #PluginDir=/root/sw/slurm/14.11.7/lib/slurm #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/linuxproc #Prolog= #PrologFlags= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= RebootProgram=/sbin/reboot ReturnToService=2 #SallocDefaultCommand= SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/var/spool/slurm SwitchType=switch/none #TaskEpilog= TaskPlugin=task/none #TaskPluginParam= #TaskProlog= #TopologyPlugin=topology/tree #TmpFS=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 Thanks! From: galaxy-dev <galaxy-dev-bounces@lists.galaxyproject.org> on behalf of John Letaw <letaw@ohsu.edu> Date: Wednesday, November 8, 2017 at 11:46 AM To: Christophe Antoniewski <drosofff@gmail.com> Cc: galaxy-dev <galaxy-dev@lists.galaxyproject.org> Subject: Re: [galaxy-dev] Fwd: Job Script Integrity with GalaxyKickStart (galaxy-dev Digest, Vol 137, Issue 5) Hi Chris, I am changing my mind, not thinking it is a slurm config problem. I can submit jobs from this vm with the ‘galaxyuser’ user and there is no problem. In the logs, I can see a line that says the script is being submitted, then another line that echos the native specification. After that, I don’t see anything else unless I stop the job. If I do that, it will spit back a message saying it can’t find the job in the scheduler, since it never actually made it there. So, there must be a problem with Galaxy communicating with the scheduler. From the ansible playbook code, I can see there is a step that links the slurm.conf and munge.key files to the galaxy path. This is something I am currently doing manually, since I am not trying to configure a new cluster but instead use an existing one. Maybe there is some other simple step I am overlooking that would cause this behavior? Thanks, John From: Christophe Antoniewski <drosofff@gmail.com> Date: Wednesday, November 8, 2017 at 12:34 AM To: John Letaw <letaw@ohsu.edu> Cc: Marius van den Beek <m.vandenbeek@gmail.com>, galaxy-dev <galaxy-dev@lists.galaxyproject.org> Subject: Re: [galaxy-dev] Fwd: Job Script Integrity with GalaxyKickStart (galaxy-dev Digest, Vol 137, Issue 5) Hi John and Marius, So, I am assuming I have some problem with my slurm configuration, does that sounds accurate? Maybe it would help to see that. I have a couple of complicated experiences with slurm config but up to now it is with Ubuntu 16.04 Xenial Best - Chris Christophe Antoniewski ARTbio<http://artbio.fr/> - Tel +33 1 44 27 70 05 Drosophila Genetics & Epigenetics<http://drosophile.org/> - Tel +33 1 44 27 34 39 Mobile +33 6 68 60 51 50 https://twitter.com/ARTbio_IBPS https://twitter.com/drosofff 2017-11-07 21:09 GMT+01:00 John Letaw <letaw@ohsu.edu<mailto:letaw@ohsu.edu>>: Hi Marius. Ok, this was pretty much how I read the code as well. My first instinct was to do exactly as you suggested, and add that declaration in group_vars/all. This does stop the error, but then I just get stuck with jobs that never run. So, I am assuming I have some problem with my slurm configuration, does that sounds accurate? Thanks, John From: galaxy-dev <galaxy-dev-bounces@lists.galaxyproject.org<mailto:galaxy-dev-bounces@lists.galaxyproject.org>> on behalf of Marius van den Beek <m.vandenbeek@gmail.com<mailto:m.vandenbeek@gmail.com>> Date: Tuesday, November 7, 2017 at 10:06 AM To: Christophe Antoniewski <drosofff@gmail.com<mailto:drosofff@gmail.com>> Cc: galaxy-dev <galaxy-dev@lists.galaxyproject.org<mailto:galaxy-dev@lists.galaxyproject.org>> Subject: Re: [galaxy-dev] Fwd: Job Script Integrity with GalaxyKickStart (galaxy-dev Digest, Vol 137, Issue 5) Hi John and Christophe, What the job script integrity script does is checking that the script is ready to be executed, by setting the environment variable `ABC_TEST_JOB_SCRIPT_INTEGRITY_XYZ` to 1 and then executing the tool_script.sh script that contains the following check: ``` if [ -n "$ABC_TEST_JOB_SCRIPT_INTEGRITY_XYZ" ]; then exit 42 fi ``` So if the script is ready to execute it returns with the exit code 42. Now this can take a few seconds over NFS (I guess that'd be true for lustre as well). This check is being run 35 times with a sleep of .25 seconds. Unfortunately there was a bug in galaxy that would skip the sleep, so the job integrity check would fail frequently. We fixed this in https://github.com/galaxyproject/galaxy/pull/4720 and this has been backported up to galaxy release 16.07, so if you just get to the latest galaxy commit on your branch it *may* work again. Now this has been broken for a long time, and it has never worked for me on our current cluster. Should an update to galaxy not be enough, you can disable this check with `check_job_script_integrity = False` in the galaxy.ini or by adding `-e GALAXY_CONFIG_CHECK_JOB_SCRIPT_INTEGRITY=False` if you're running kickstart in docker. I have not seen any drawback of disabling the integrity check on our cluster. Good luck, Marius On 7 November 2017 at 18:25, Christophe Antoniewski <drosofff@gmail.com<mailto:drosofff@gmail.com>> wrote: Hi John, Can you also raise an issue in https://github.com/ARTbio/GalaxyKickStart/issues ? In order to help, I will need to know the configuration of your GalaxyKickStart (the variables you modified in the playbook, group_vars and inventory_files). Did you use the cloud_setup role ? In that case Enis Afgan https://github.com/afgane may help. Best regards Chris Christophe Antoniewski Institut de Biologie Paris Seine<http://www.ibps.upmc.fr/en> 9, Quai St Bernard<https://maps.google.com/?q=9,+Quai+St+Bernard&entry=gmail&source=g>, Boîte courrier 24 75252 Paris Cedex 05 ARTbio<http://artbio.fr/> Bâtiment B, 7e étage, porte 725 Tel +33 1 44 27 70 05 Mobile +33 6 68 60 51 50<tel:06%2068%2060%2051%2050> Pour accéder à la Plateforme Bâtiment B, 7e étage, Porte 725<https://www.google.com/maps/d/u/0/edit?mid=zmZz-3Vin5D0.kjRSV6vitXE8> Error! Filename not specified. https://twitter.com/ARTbio_IBPS 2017-11-07 18:00 GMT+01:00 <galaxy-dev-request@lists.galaxyproject.org<mailto:galaxy-dev-request@lists.galaxyproject.org>>: Send galaxy-dev mailing list submissions to galaxy-dev@lists.galaxyproject.org<mailto:galaxy-dev@lists.galaxyproject.org> To subscribe or unsubscribe via the World Wide Web, visit https://lists.galaxyproject.org/listinfo/galaxy-dev or, via email, send a message with subject or body 'help' to galaxy-dev-request@lists.galaxyproject.org<mailto:galaxy-dev-request@lists.galaxyproject.org> You can reach the person managing the list at galaxy-dev-owner@lists.galaxyproject.org<mailto:galaxy-dev-owner@lists.galaxyproject.org> When replying, please edit your Subject line so it is more specific than "Re: Contents of galaxy-dev digest..." HEY! This is important! If you reply to a thread in a digest, please 1. Change the subject of your response from "Galaxy-dev Digest Vol ..." to the original subject for the thread. 2. Strip out everything else in the digest that is not part of the thread you are responding to. Why? 1. This will keep the subject meaningful. People will have some idea from the subject line if they should read it or not. 2. Not doing this greatly increases the number of emails that match search queries, but that aren't actually informative. Today's Topics: 1. Job Script Integrity (John Letaw) ---------------------------------------------------------------------- Message: 1 Date: Tue, 7 Nov 2017 03:20:49 +0000 From: "John Letaw" <letaw@ohsu.edu<mailto:letaw@ohsu.edu>> To: "galaxy-dev@lists.galaxyproject.org<mailto:galaxy-dev@lists.galaxyproject.org>" <galaxy-dev@lists.galaxyproject.org<mailto:galaxy-dev@lists.galaxyproject.org>> Subject: [galaxy-dev] Job Script Integrity Message-ID: <FBF795C3-8F01-47EF-8033-F14DD8694328@ohsu.edu<mailto:FBF795C3-8F01-47EF-8033-F14DD8694328@ohsu.edu>> Content-Type: text/plain; charset="utf-8" Hi all, I’m installing via GalaxyKickStart… I’m getting the following error: galaxy.jobs.runners ERROR 2017-11-06 19:14:05,263 (19) Failure preparing job Traceback (most recent call last): File "/home/exacloud/lustre1/galaxydev/galaxyuser/lib/galaxy/jobs/runners/__init__.py", line 175, in prepare_job modify_command_for_container=modify_command_for_container File "/home/exacloud/lustre1/galaxydev/galaxyuser/lib/galaxy/jobs/runners/__init__.py", line 209, in build_command_line container=container File "/home/exacloud/lustre1/galaxydev/galaxyuser/lib/galaxy/jobs/command_factory.py", line 84, in build_command externalized_commands = __externalize_commands(job_wrapper, external_command_shell, commands_builder, remote_command_params) File "/home/exacloud/lustre1/galaxydev/galaxyuser/lib/galaxy/jobs/command_factory.py", line 143, in __externalize_commands write_script(local_container_script, script_contents, config) File "/home/exacloud/lustre1/galaxydev/galaxyuser/lib/galaxy/jobs/runners/util/job_script/__init__.py", line 112, in write_script _handle_script_integrity(path, config) File "/home/exacloud/lustre1/galaxydev/galaxyuser/lib/galaxy/jobs/runners/util/job_script/__init__.py", line 147, in _handle_script_integrity raise Exception("Failed to write job script, could not verify job script integrity.") Exception: Failed to write job script, could not verify job script integrity. galaxy.model.metadata DEBUG 2017-11-06 19:14:05,541 Cleaning up external metadata files galaxy.model.metadata DEBUG 2017-11-06 19:14:05,576 Failed to cleanup MetadataTempFile temp files from /home/exacloud/lustre1/galaxydev/galaxyuser/database/jobs/000/19/metadata_out_HistoryDatasetAssociation_16_I8bhLX: No JSON object could be decoded I would like to further understand what it means to not verify integrity of a job script. Does this just mean there is a permissions error? Ownership doesn’t match up? Thanks, John -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.galaxyproject.org/pipermail/galaxy-dev/attachments/20171107/66103ce8/attachment-0001.html> ------------------------------ Subject: Digest Footer _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.galaxyproject.org<mailto:galaxy-dev@lists.galaxyproject.org> https://lists.galaxyproject.org/listinfo/galaxy-dev To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ ------------------------------ End of galaxy-dev Digest, Vol 137, Issue 5 ****************************************** ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/