Re: [galaxy-dev] Managing Data Locality

5 Nov 2013

      Hey Eric,

I think what you are purposing would be a major development effort and
mirrors major development efforts ongoing. There are  sortof ways to
do this already, with various trade-offs, and none particularly well
documented. So before undertaking this efforts I would dig into some
alternatives.

If you are using PBS, the PBS runner contains some logic for
delegating to PBS for doing this kind of thing - I have never tried
it.

https://bitbucket.org/galaxy/galaxy-central/src/default/lib/galaxy/jobs/runn...

In may be possible to use a specially configured handler and the
Galaxy object store to stage files to a particular mount before
running jobs - not sure it makes sense in this case. It might be worth
looking into this (having the object store stage your files, instead
of solving it at the job runner level).

My recommendation however would be to investigate the LWR job runner.
There are a bunch of fairly recent developments to enable something
like what you are describing. For specificity lets say you are using
DRMAA to talk to some HPC cluster and Galaxy's file data is stored in
/galaxy/data on the galaxy web server but not on the HPC and there is
some scratch space (/scratch) that is mounted on both the Galaxy web
server and your HPC cluster.

I would stand up an LWR (http://lwr.readthedocs.org/en/latest/) server
right beside Galaxy on your web server. The LWR has a concept of
managers that sort of mirrors the concept of runners in Galaxy - see
the sample config for guidance on how to get it to talk with your
cluster. It could use DRMAA, torque command-line tools, or condor at
this time (I could add new methods e.g. PBS library if that would
help). https://bitbucket.org/jmchilton/lwr/src/default/job_managers.ini.sample?at=d...

On the Galaxy side, I would then create a job_conf.xml file telling
certain HPC tools to be sent to the LWR. Be sure to enable the LWR
runner at the top (see advanced example config) and then add at least
one LWR destination.

 <destinations>
    ....
    <destination id="lwr" runner="lwr">
      <param id="url">http://localhost:8913/</param>
      <!-- Leave Galaxy directory and data indices alone, assumes they
are mounted in both places. -->
      <param id="default_file_action">none</param>
      <!-- Do stage everything in /galaxy/data though -->
      <param id="file_action_config">file_actions.json</param>
    </destination>

Then create a file_actions.json file in the Galaxy root directory
(structure of this file is subject to change, current json layout
doesn't feel very Galaxy-ish).

{"paths": [
{"path": "/galaxy/data", "action": "copy"}
] }

More details on the structure of this file_actions.json file can be
found in the following changeset:
https://bitbucket.org/galaxy/galaxy-central/commits/b0b83be30136e2939a4a4f5d...

I am really eager to see the LWR gain adoption and tackle tricky cases
like this, so if there is anything I can do to help please let me know
and contributions in terms of development or documentation would be
greatly appreciated as well.

Hope this helps,
-John

On Tue, Nov 5, 2013 at 8:23 AM, Paniagua, Eric <epaniagu@cshl.edu> wrote:
...
Dear Galaxy Developers,
I administer a Galaxy instance at Cold Spring Harbor Laboratory, which servers around 200 laboratory members.  While our initial hardware purchase has scaled well for the last 3 years, we are finding that we can't quite keep up with rising the demand for compute-intensive jobs, such as mapping.  We are hesitant to consider buying more hardware to support the load, since we can't expect that solution to scale.
Rather, we are attempting to set up Galaxy to queue jobs (especially mappers) out to the lab's HPCC to accommodate the increasing load.  While there is a good number of technical challenges involved in this strategy, I am only writing to ask about one: data locality.
Normally, all Galaxy datasets are stored directly on the private server hosting our Galaxy instance.  The HPCC cannot mount our Galaxy server's storage (ie: for the purpose of running jobs reading/writing datasets) for security reasons.  However, we can mount a small portion of the HPCC file system to our Galaxy server.  Storage on the HPCC is at a premium, so we can't afford to just let newly created (or copied) datasets just sit there.  It follows that we need a mechanism for maintaining temporary storage in the (restricted) HPCC space which allows for transfer of input datasets to the HPCC (so they will be visible to jobs running there) and transfer of output datasets back to persistent storage on our server.
I am in the process of analyzing when/where/how exact path names are substituted into tool command lines, looking for potential hooks to facilitate the staging/unstaging of data before/after job execution on the HPCC.  I have found a few places where I might try to insert logic for handling this case.
Before modifying too much of Galaxy's core code, I would like to know if there is a recommended method for handling this situation and whether other members of the Galaxy community have implemented fixes or workarounds for this or similar data locality issues.  If you can offer either type of information, I shall be most grateful.  Of course, if the answer were that there were no recommended or known technique, then that would be valuable information too.
Thank you in advance,
Eric Paniagua
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Managing Data Locality

John Chilton