Hey Carrie, Thanks for starting this conversation. We talked a while yesterday on IRC, but I wanted to summarize our conversation and give everyone an update. I am not saying no and certainly not saying no to all of it, but I would be very hesitant to include the job staging / rewriting stuff. The LWR has a lot of code to do this in a very tailored, flexible, and powerful way. The approach you outline only works with scp and rewrites the commands Galaxy generates after the fact - which may be error prone. The LWR has many different ways it can potentially stage files (send to LWR via HTTP, pull from Galaxy via HTTP, file system copy from Galaxy, file system copy from remote host, disable staging) and it can be configured to do this on a per path basis - which can be very powerful given the complexity of how different file systems may be shared where and mounted as what paths. The LWR can be configured to affect the tool evaluation process so commands do not need to be rewritten - they use the right paths from the get go. The LWR properly handles paths not just in the command - but in configfiles, param files, metadata commands, from_work_dir outputs, Galaxy's configured with outputs_to_working_directory option, etc.... The LWR can be configured to generate metadata remotely or locally, the LWR can be configured to resolve dependencies (tool shed, Galaxy packages, modules, etc...) remotely, locally, or just disable them. The LWR works with newer features like job metrics and per-destination env tweaking. I understand that the need to run a permanent service on the remote compute login node can be problematic. So what I would like to see (and where I am sure Galaxy will evolve too over the next year) is a separation of "staging jobs" from "running jobs" - if we could do LWR or LWR-like staging without requiring the use of the LWR job runner (stage it like the LWR and submit directly via DRMAA or the CLI runner for instance). (Aside: for instance LWR staging combined with running jobs in docker will be wonderful from a security/isolation perspective). So if one wanted to do something like what you are doing - I would be more eager to merge it if it were to somehow: - Leverage LWR staging with the CLI runner. - Add LWR staging action type for scp-ing files. That is a lot of work however :(. Given your use case - I think I added to Galaxy today (https://bitbucket.org/galaxy/galaxy-central/commits/72bb9c96a5bad59ca79a669e...) that is a pretty different approach to this but may work equally well (perhaps better in someways). For some background - thanks to great work by Nate, https://test.galaxyproject.org/ is now running LWR on TACC's Stampede supercomputer. We wanted to do it without opening the firewall on TACC login node so the LWR can now be driven by a message queue instead. Galaxy sends the LWR a submit message, LWR stages the job, submits it, sends status updates via message queue back to Galaxy, and then the LWR sends results back to Galaxy. While this process still requires a remote daemon running on the compute's logiin node - it really doesn't need one. So I added some new options to the LWR client to allow that initial submission message to be encoded and in base64 and just passed to a simple command line version of the LWR configured on the remote host. From there the rest of the process works pretty much identically to the MQ-driven LWR approach we are using with Stampede - the remote LWR script will pull the needed files down from Galaxy, submit the job (in your case your application submits itself to condor in chunks so you would wan to just use the LWR equivalent of the local job runner - called queued_python manager - its the default), send updates back to Galaxy via a message queue, and then end once the job is finished. If this version of the compute flow doesn't sit well with you - there are two changes I would definitely be eager to incorporate (feel free to request or contribute them). - If you don't like requiring the message queue infrastructure - I would love to see a variant of this that extended the jobs API to allow status updates that way. (The file transfer for jobs - use a single-propose key scheme to secure job related files - similar keys could be used for status updates). - If instead you don't like the HTTP transfer and would prefer scp/rcp - I would love to see more action types added to LWR's staging setup to allow scping files between the Galaxy and the remote login node (either initiated on the Galaxy side or the remote host - LWR contains example actions similar to either). Hope this helps. -John On Tue, Jun 3, 2014 at 1:15 PM, Ganote, Carrie L <cganote@iu.edu> wrote:
Hi Devs,
I'd like to open up some discussion about incorporating some code bits into the Galaxy distribution.
My code is here: https://bitbucket.org/cganote/osg-blast-galaxy
First off, I'd like to say that these changes were made initially as hacks to get Galaxy working with a grid interface for our nefarious purposes. For us, the results have been spiffy, in that we can offload a bunch of Blast work off of our own clusters and onto the grid, which processes them fast on a distributed set of computers.
In order to do this, I wanted to be able to take as much control over the process as I could. The destination uses Condor, but it used condor_dag to submit jobs - that means I would have to modify the condor job runner.
The destination needed to have the files shipped over to it first - so I had to be able to stage. This made lwr attractive, but then I would need to guarantee that the server at the other end was running lwr, and since I don't have control of that server, this seemed less likely to be a good option.
The easiest thing for me to understand was the cli runner. I could do ssh, I could do scp, so this seemed the best place to start. So I started by trying to figure out which files needed to be sent to the server, and then implementing a way to send them. I start with stdout, stderr and exit code files. I also want to stage any datasets that are in the param_dict, and anything that is in extra_files_path. Then we alter the command line that is run such that all the paths make sense on the remote server, and to make sure that the right things are run remotely vs. locally (i.e., metadata.sh is run locally after job is done). Right now, this is done by splitting the command line on a specific string, which is not robust for future changes to the command_factory, but I'm open to suggestions.
So, here's one hack. The hidden data tool parameter is something I hijacked - as far as I can tell, hidden data is only used for Cufflinks, so it seemed safe. I use it to send the shell script that will be run on the server (but NOT sent to the worker nodes). It needed to be a DATA type so that my stager would pick it up and send it over. I wanted it to be hidden because it was only used by the tool and it should not need to be an HDA. I made changes to allow the value of the hidden data to be set in the tool - this would become the false_path of the data, which would then become its actual path.
Please have a look, and ask questions, and if there are improvements needed before anything is considered for pulling, let me know. I'd like to present this at the Galaxy conference without having vegetables thrown at me. Thanks!
-Carrie Ganote National Center for Genome Analysis Support Indiana University
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/