Re: [galaxy-dev] Request: Option to reduce server data transfer for big workflow in cluster

17 Dec 2013

      Hey Ben,

  Hmmm... I don't think Galaxy is doing that - not directly anyway.
Unless I am mistaken, Galaxy will put the file in one location on the
web server node or VM. Typically, this location is on a filesystem
that is shared between the web server and a cluster's compute nodes.
So I wouldn't describe that as Galaxy copying the data to all of the
compute nodes. If your cluster doesn't have a shared file system and
to get around this someone has configured the Galaxy data to be synced
across all nodes - that would be copying the data to all nodes but I
don't think Galaxy is doing that. Alternatively, you might have condor
configured to copy the galaxy data to the remote nodes in such a way
that everytime a Galaxy job is run on a node, all the data is copied
to that node?

  Does that make sense?

  So I still don't entirely understand your setup, but my advice is
pretty general - for now you may want to solve this problem at the
condor level. I am assuming this is a general purpose condor cluster
and not setup explicitly for Galaxy? Lets say you have 200 nodes in
your condor cluster and they cannot all mount the Galaxy filesystem -
because it would overload the file server being used by Galaxy. I
think you could setup a FileSystemDomain at the condor level that just
10 of nodes belonged to say (these 10 nodes can continue to run
anything in general but Galaxy will only submit to these) This
filesystemdomain could have a name like galaxy.example.com if
example.com is your default filesystemdomain. Then you can setup the
Galaxy condor runner with a requirement such that
"FileSystemDomain=galaxy.example.com" and Galaxy jobs will only run on
these 10 nodes then. Having 10 nodes mount a file server is much more
manageable than 200.

-John

On Tue, Dec 17, 2013 at 11:52 AM, Ben Gift <corn8bit2@gmail.com> wrote:
...
Hi John, thanks for the reply.
Yes, I mean Galaxy's default behavior of keeping all the data on all nodes
of our condor cluster. So for instance if I run a job, then the output of
that job is copied to every node in the cluster. Is this not the normal
behavior?
On Tue, Dec 17, 2013 at 9:42 AM, John Chilton <chilton@msi.umn.edu> wrote:
...
Hey Ben,
Thanks for the e-mail. I did not promise anything was coming soon, I
only said people were working on parts of it. It is not a feature yet
unfortunately - multiple people including myself are thinking about
various parts of this problem though.
I would like to respond, but I am trying to understand this line: "We
can't do this because Galaxy copies all intermediate steps to all
no(d)es, which would bog down the servers too much."
Can you describe how you are doing this staging for me? Is data
currently being copied around to all the nodes, if so how are you
doing that? Or are you trying to say that Galaxy requires the data to
be available on all of the nodes?
-John
On Tue, Dec 17, 2013 at 11:15 AM, Ben Gift <corn8bit2@gmail.com> wrote:
...
We've run into a scenario lately where we need to run a very large
workflow
(huge data in intermediate steps) many times. We can't do this because
Galaxy copies all intermediate steps to all notes, which would bog down
the
servers too much.
I asked about something similar before and John mentioned the feature to
automatically delete intermediate step data in a workflow once it
completed,
was coming soon. Is that a feature now? That would help.
Ultimately though we can't be copying all this data around to all nodes.
The
network just isn't good enough, so I have an idea.
What if we have an option on the 'run workflow' screen to only run on
one
node (eliminating the neat Galaxy concurrency ability for that workflow
unfortunately)? Then it just propagates the final step data.
Or maybe only copy to a couple other nodes, to keep concurrency.
If the job errored then in this case I think it should just throw out
all
the data, or propagate where it stopped.
I've been trying to work on implementing this myself but it's taking me
a
long time. I only just started understanding the pyramid stack, and am
putting in the checkbox in the run.mako template. I still need to learn
the
database schema, message passing, and how jobs are stored, and how to
tell
condor to only use 1 node, (and more I'm sure) in Galaxy. (I'm drowning)
This seems like a really important feature though as Galaxy gains more
traction as a research tool for bigger projects that demand working with
huge data, and running huge workflows many many times.
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/