Pre pull-request: CLI and hidden data

3 Jun 2014

      Hi Devs,

I'd like to open up some discussion about incorporating some code bits into the Galaxy distribution.

My code is here:
https://bitbucket.org/cganote/osg-blast-galaxy

First off, I'd like to say that these changes were made initially as hacks to get Galaxy working with a grid interface for our nefarious purposes. For us, the results have been spiffy, in that we can offload a bunch of Blast work off of our own clusters and onto the grid, which processes them fast on a distributed set of computers.

In order to do this, I wanted to be able to take as much control over the process as I could. The destination uses Condor, but it used condor_dag to submit jobs - that means I would have to modify the condor job runner.

The destination needed to have the files shipped over to it first - so I had to be able to stage. This made lwr attractive, but then I would need to guarantee that the server at the other end was running lwr, and since I don't have control of that server, this seemed less likely to be a good option.

The easiest thing for me to understand was the cli runner. I could do ssh, I could do scp, so this seemed the best place to start. So I started by trying to figure out which files needed to be sent to the server, and then implementing a way to send them. I start with stdout, stderr and exit code files. I also want to stage any datasets that are in the param_dict, and anything that is in extra_files_path. Then we alter the command line that is run such that all the paths make sense on the remote server, and to make sure that the right things are run remotely vs. locally (i.e., metadata.sh is run locally after job is done). Right now, this is done by splitting the command line on a specific string, which is not robust for future changes to the command_factory, but I'm open to suggestions.

So, here's one hack. The hidden data tool parameter is something I hijacked - as far as I can tell, hidden data is only used for Cufflinks, so it seemed safe. I use it to send the shell script that will be run on the server (but NOT sent to the worker nodes). It needed to be a DATA type so that my stager would pick it up and send it over. I wanted it to be hidden because it was only used by the tool and it should not need to be an HDA. I made changes to allow the value of the hidden data to be set in the tool - this would become the false_path of the data, which would then become its actual path.

Please have a look, and ask questions, and if there are improvements needed before anything is considered for pulling, let me know. I'd like to present this at the Galaxy conference without having vegetables thrown at me. Thanks!

-Carrie Ganote
National Center for Genome Analysis Support
Indiana University

Ganote, Carrie L

John Chilton

tags

participants (2)