dear all,
There have been a few posts lately about doing distributed computing via Galaxy - i.e. job splitters etc - below a contribution of some ideas we have developed and applied in our work, where we have arranged for some Galaxy tools to execute in parallel on our cluster.
We have developed a job-splitter script "tardis.py" (available from https://bitbucket.org/agr-bifo/tardis), which takes marked-up standard unix commands that run an application or tool. The mark-up is prefixed to the input and output command-line options. Tardis strips off the mark-up, and re-writes the commands to refer to split inputs and outputs, which are then executed in parallel e.g. on a distributed compute resource. Tardis knows the output files to expect and how to join them back together.
(This was referred to in our GCC2013 talk http://wiki.galaxyproject.org/Events/GCC2013/Abstracts#Events.2FGCC2013.2FAb... )
Any reasonable unix based data processing or analysis command may be marked up and run using tardis, though of course tardis needs to know how to split and join the data. Our approach also assumes a "symmetrical" HPC cluster configuration, in the sense that each node sees the same view of the file system (and has the required underlying application installed). We use tardis to support both Galaxy and command-line based compute.
Background / design pattern / motivating analogy: Galaxy provides a high level "end to end" view of a workflow; the HPC cluster resource that one uses then involves spraying chunks of data out into parallel processes, usually in the form of some kind of distributed compute cluster - but an end-user looking at a Galaxy history, should ideally not be able to tell whether the workflow was run as a single process on the server, or via many parallel processes on the cluster (apart from the fact that when run in parallel on the cluster, its alot faster!). We noticed that the TCP / IP layered networking protocol stack provides a useful metaphor and design pattern - with the "end to end" topology of a Galaxy workflow corresponding to the transport layer of TCP/ IP; and the distribution of computation across a cluster corresponding to the next TCP/IP layer down - the packet-routing layer.
This picture suggested a strongly layered approach to provisioning Galaxy with parallelised compute on split data, and hence to an approach in which the footprint in the Galaxy code-base, of parallel / distributed compute support, should ideally (from the layered-design point of view) be minimal and superficial. Thus in our approach so far, the only footprint is in the tool config files, where we arrange the templating to (optionally) prefix the required tardis mark-up to the input and output command options, and the tardis script name to the command as a whole. tardis then takes care of rewriting and launching all of the jobs, and finally joining the results back together and putting them where galaxy expects them to be (and also housekeeping such as collating and passing up stderr and stdout , and appropriate process exit codes). (For each galaxy job, tardis creates a working folder in a designated scratch area, where input files are uncompressed and split; job files and their output are stored; logging is done etc. Split data is cleaned up at the end unless there was an error in some part of the job, in which case everything is retained for debugging and in some cases restart)
(We modify Galaxy tool-configs so that the user can optionally choose to run the tool on our HPC cluster - there are three HPC related input fields, appended to the input section of a tool. Here the user selects whether they want to use our cluster and if so, they specify the chunk size, and can also at that point specify a sampling rate, since we often find it useful to be able to run preliminary analyses on a random sample of (for example) single or paired-end NGS sequence data, to obtain a fairly quick snapshot of the data, before the expense of a complete run. We found it convenient to include support for input sampling in tardis).
The pdf document at https://bitbucket.org/agr-bifo/tardis includes a number of examples of marking up a command, and also a simple example of a galaxy tool-config that has been modified to include support for optionally running the job on our HPC cluster via the tardis pre-processor.
Known limitations:
* we have not yet attempted to integrate our approach with the existing Galaxy job-splitting distributed compute support, partly because of our "layered" design goal (admittedly also partly because of ignorance about its details ! )
* our current implementation is quite naive in the distributed compute API it uses - it supports launching condor job files (and also native sub-processes) - our plan is to replace that with using the drmaa API
* we would like to integrate it better with the galaxy type system, probably via a galaxy-tardis wrapper
We would be keen to contribute our approach to Galaxy if people are interested.
Cheers
Alan McCulloch Bioinformatics Software Engineer AgResearch NZ
Alan,
At first glance this look promising. I am a little leery of tools that claim to do parallel processing. However I would like to test it out on our HPC cluster here at UCI.
Few questions:
Could you explain how your tool actually does the parallel processing on something that is sequential? For example in your PDF you mention the fastq example, but I do not see the explanation as to how it "splits" up the work across multiple cores/nodes. Does it simply split the sequence string N times and then merges the results?
- our current implementation is quite naive in the distributed compute API
it uses - it supports launching condor job files (and also native sub-processes) - our plan is to replace that with using the drmaa API
We are strictly a SGE (Son of Grid Engine) cluster with a lot of work done by Joseph Farran (check pointing, freeq system, etc). Using DRMAA APIs would be great. If this tool can parallel fastq jobs along with BAM as described, it would be a great improvement for a number of people here.
~Adam
-- Adam Brenner Computer Science, Undergraduate Student Donald Bren School of Information and Computer Sciences
Research Computing Support Office of Information Technology http://www.oit.uci.edu/rcs/
University of California, Irvine www.ics.uci.edu/~aebrenne/ aebrenne@uci.edu
On Mon, Oct 28, 2013 at 7:39 PM, McCulloch, Alan alan.mcculloch@agresearch.co.nz wrote:
dear all,
There have been a few posts lately about doing distributed computing via Galaxy – i.e.
job splitters etc – below a contribution of some ideas we have developed
and applied in our work, where we have arranged for some Galaxy tools to execute in parallel
on our cluster.
We have developed a job-splitter script "tardis.py" (available from
https://bitbucket.org/agr-bifo/tardis), which takes marked-up
standard unix commands that run an application or tool. The mark-up is
prefixed to the input and output command-line options. Tardis strips off the
mark-up, and re-writes the commands to refer to split inputs and outputs, which are then
executed in parallel e.g. on a distributed compute resource. Tardis knows
the output files to expect and how to join them back together.
(This was referred to in our GCC2013 talk
http://wiki.galaxyproject.org/Events/GCC2013/Abstracts#Events.2FGCC2013.2FAb... )
Any reasonable unix based data processing or analysis command may be marked up and run
using tardis, though of course tardis needs to know how to split and join the data. Our approach
also assumes a “symmetrical” HPC cluster configuration, in the sense that each node sees the same
view of the file system (and has the required underlying application installed). We use tardis
to support both Galaxy and command-line based compute.
Background / design pattern / motivating analogy: Galaxy provides a high level
"end to end" view of a workflow; the HPC cluster resource that one uses then involves
spraying chunks of data out into parallel processes, usually in the form of some kind of
distributed compute cluster - but an end-user looking at a Galaxy history, should ideally not be able
to tell whether the workflow was run as a single process on the server, or
via many parallel processes on the cluster (apart from the fact that when run
in parallel on the cluster, its alot faster!). We noticed that the TCP / IP layered networking
protocol stack provides a useful metaphor and design pattern - with the "end to end" topology
of a Galaxy workflow corresponding to the transport layer of TCP/ IP; and the distribution
of computation across a cluster corresponding to the next TCP/IP layer down
- the packet-routing
layer.
This picture suggested a strongly layered approach to provisioning
Galaxy with parallelised compute on split data, and hence to an approach in which the
footprint in the Galaxy code-base, of parallel / distributed compute support, should ideally
(from the layered-design point of view) be minimal and superficial. Thus in our approach so far,
the only footprint is in the tool config files, where we arrange the templating to
(optionally) prefix the required tardis mark-up to the input and output command options, and
the tardis script name to the command as a whole. tardis then takes care of rewriting and
launching all of the jobs, and finally joining the results back together and putting them where
galaxy expects them to be (and also housekeeping such as collating and passing up stderr and stdout , and
appropriate process exit codes). (For each galaxy job, tardis creates a working folder in a designated
scratch area, where input files are uncompressed and split; job files and their output
are stored; logging is done etc. Split data is cleaned up at the end unless there
was an error in some part of the job, in which case everything is retained
for debugging and in some cases restart)
(We modify Galaxy tool-configs so that the user can optionally choose to run
the tool on our HPC cluster - there are three HPC related input fields, appended
to the input section of a tool. Here the user selects whether they want to use
our cluster and if so, they specify the chunk size, and can also at that point
specify a sampling rate, since we often find it useful to be able to run preliminary
analyses on a random sample of (for example) single or paired-end NGS sequence
data, to obtain a fairly quick snapshot of the data, before the expense of a
complete run. We found it convenient to include support for input sampling
in tardis).
The pdf document at https://bitbucket.org/agr-bifo/tardis includes a number of
examples of marking up a command, and also a simple example of a galaxy tool-config that
has been modified to include support for optionally running the job on our HPC cluster
via the tardis pre-processor.
Known limitations:
- we have not yet attempted to integrate our approach with the existing
Galaxy job-splitting
distributed compute support, partly because of our “layered” design goal (admittedly also partly
because of ignorance about its details ! )
- our current implementation is quite naive in the distributed compute API
it uses - it supports launching condor job files (and also native sub-processes) - our plan
is to replace that with using the drmaa API
- we would like to integrate it better with the galaxy type system, probably
via
a galaxy-tardis wrapper
We would be keen to contribute our approach to Galaxy if people are
interested.
Cheers
Alan McCulloch
Bioinformatics Software Engineer
AgResearch NZ
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
hi Adam,
Could you explain how your tool actually does the parallel processing on something that is sequential? For example in your PDF you mention the fastq example, but I do not see the explanation as to how it "splits" up the work across multiple cores/nodes. Does it simply split the sequence string N times and then merges the results?
yes, if I understand you correctly. A typical use-case for us is trimming and reference-aligning large numbers of short reads, usually paired. Tardis reads the (usually compressed) fastq input file (files if paired), and every "chunksize" sequence records, launches a "reconditioned" command as a job on the cluster. The reconditioned command is the same as the original, except that it refers to the chunk-input file(s) , and the chunk output file(s), rather than to the original input and output files. (Paired fastq files are split in "semantic lock step" - seq names are required to match)
After all chunks are launched tardis polls for the expected chunk-outputs (and the log, stdout and stderr files associated with each completed chunk) . Once all results are in, the "join" step is done - chunk outputs are joined to yield the single output file(s) expected by Galaxy (or the user if running from the command line). stderr and stdout from each chunk is collated and emitted as the stdout and stderr of tardis. Process exit codes from the chunk jobs are collated, and tardis exits with an appropriate consensus exit code (i.e. 0 if all were 0 , > 0 if not)
Because we are typically dealing with many very large fastq files, we want to avoid if possible too much uncompressed data, and redundant copies of data, as this can very quickly consume our disk resources. A few things included that we have found useful in practice :
* tardis reads and splits compressed fastq on the fly, so that the original can remain compressed in place * fastq to fasta conversion is done on the fly and may be specified in the command mark-up - again this means for example original compressed fastq can remain compressed in-place, and we do not need to format-convert the whole file first. This is good for use-cases such as inexpensively blasting a random sample of short reads against a database of potential contaminants. * tardis can read a list file containing the names of (usually compressed fastq or fasta) files, and will treat this as one single input stream (each file is opened in turn and its contents processed as required). Again this is useful in that a number of compressed fastq files can be input to processes such as trimming, blast contamination check or reference alignment, while remaining compressed in place. (Such a list file then works quite fine in Galaxy, provided one is not above cheating by telling Galaxy that a list file containing the names of compressed fastqsanger files, is actually itself a fastqsanger file. This is also useful in avoiding an overly cluttered Galaxy history) * outputs are all compressed by default - e.g. mark-up of "_condition_fastq_output_myfile.fastq" will result in a file "myfile.fastq.gz" being delivered. Uncompressed output is specified by using markup such as "_condition_uncompressedfastq_output_myfile.fastq". (Trying to encourage good housekeeping) * support for random sampling of inputs - the idea again is that this can often help avoid disk and processing contention, by allowing users to obtain inexpensive initial summaries and insights into the data. In some cases (e.g. bad data), this avoids the cost and delay of a complete analysis.
I hope that answers your question, and sorry if this post is a bit long.
Updating this to use drmaa API is about next on the to-do list (- I 'd be grateful for any tips on python drmaa ? - e.g. best library or approach).
Also grateful for any general suggestions or comments.
Cheers
Alan
-----Original Message----- From: Adam Brenner [mailto:aebrenne@uci.edu] Sent: Wednesday, 30 October 2013 5:37 a.m. To: McCulloch, Alan Cc: galaxy-dev@lists.bx.psu.edu; Harry Mangalam Subject: Re: [galaxy-dev] tardis job splitter
Alan,
At first glance this look promising. I am a little leery of tools that claim to do parallel processing. However I would like to test it out on our HPC cluster here at UCI.
Few questions:
Could you explain how your tool actually does the parallel processing on something that is sequential? For example in your PDF you mention the fastq example, but I do not see the explanation as to how it "splits" up the work across multiple cores/nodes. Does it simply split the sequence string N times and then merges the results?
- our current implementation is quite naive in the distributed compute
API it uses - it supports launching condor job files (and also native sub-processes) - our plan is to replace that with using the drmaa API
We are strictly a SGE (Son of Grid Engine) cluster with a lot of work done by Joseph Farran (check pointing, freeq system, etc). Using DRMAA APIs would be great. If this tool can parallel fastq jobs along with BAM as described, it would be a great improvement for a number of people here.
~Adam
-- Adam Brenner Computer Science, Undergraduate Student Donald Bren School of Information and Computer Sciences
Research Computing Support Office of Information Technology http://www.oit.uci.edu/rcs/
University of California, Irvine www.ics.uci.edu/~aebrenne/ aebrenne@uci.edu
On Mon, Oct 28, 2013 at 7:39 PM, McCulloch, Alan alan.mcculloch@agresearch.co.nz wrote:
dear all,
There have been a few posts lately about doing distributed computing via Galaxy - i.e.
job splitters etc - below a contribution of some ideas we have developed
and applied in our work, where we have arranged for some Galaxy tools to execute in parallel
on our cluster.
We have developed a job-splitter script "tardis.py" (available from
https://bitbucket.org/agr-bifo/tardis), which takes marked-up
standard unix commands that run an application or tool. The mark-up is
prefixed to the input and output command-line options. Tardis strips off the
mark-up, and re-writes the commands to refer to split inputs and outputs, which are then
executed in parallel e.g. on a distributed compute resource. Tardis knows
the output files to expect and how to join them back together.
(This was referred to in our GCC2013 talk
http://wiki.galaxyproject.org/Events/GCC2013/Abstracts#Events.2FGCC201
3.2FAbstracts.2FTalks.A_layered_genotyping-by-sequencing_pipeline_usin g_Galaxy )
Any reasonable unix based data processing or analysis command may be marked up and run
using tardis, though of course tardis needs to know how to split and join the data. Our approach
also assumes a "symmetrical" HPC cluster configuration, in the sense that each node sees the same
view of the file system (and has the required underlying application installed). We use tardis
to support both Galaxy and command-line based compute.
Background / design pattern / motivating analogy: Galaxy provides a high level
"end to end" view of a workflow; the HPC cluster resource that one uses then involves
spraying chunks of data out into parallel processes, usually in the form of some kind of
distributed compute cluster - but an end-user looking at a Galaxy history, should ideally not be able
to tell whether the workflow was run as a single process on the server, or
via many parallel processes on the cluster (apart from the fact that when run
in parallel on the cluster, its alot faster!). We noticed that the TCP / IP layered networking
protocol stack provides a useful metaphor and design pattern - with the "end to end" topology
of a Galaxy workflow corresponding to the transport layer of TCP/ IP; and the distribution
of computation across a cluster corresponding to the next TCP/IP layer down
- the packet-routing
layer.
This picture suggested a strongly layered approach to provisioning
Galaxy with parallelised compute on split data, and hence to an approach in which the
footprint in the Galaxy code-base, of parallel / distributed compute support, should ideally
(from the layered-design point of view) be minimal and superficial. Thus in our approach so far,
the only footprint is in the tool config files, where we arrange the templating to
(optionally) prefix the required tardis mark-up to the input and output command options, and
the tardis script name to the command as a whole. tardis then takes care of rewriting and
launching all of the jobs, and finally joining the results back together and putting them where
galaxy expects them to be (and also housekeeping such as collating and passing up stderr and stdout , and
appropriate process exit codes). (For each galaxy job, tardis creates a working folder in a designated
scratch area, where input files are uncompressed and split; job files and their output
are stored; logging is done etc. Split data is cleaned up at the end unless there
was an error in some part of the job, in which case everything is retained
for debugging and in some cases restart)
(We modify Galaxy tool-configs so that the user can optionally choose to run
the tool on our HPC cluster - there are three HPC related input fields, appended
to the input section of a tool. Here the user selects whether they want to use
our cluster and if so, they specify the chunk size, and can also at that point
specify a sampling rate, since we often find it useful to be able to run preliminary
analyses on a random sample of (for example) single or paired-end NGS sequence
data, to obtain a fairly quick snapshot of the data, before the expense of a
complete run. We found it convenient to include support for input sampling
in tardis).
The pdf document at https://bitbucket.org/agr-bifo/tardis includes a number of
examples of marking up a command, and also a simple example of a galaxy tool-config that
has been modified to include support for optionally running the job on our HPC cluster
via the tardis pre-processor.
Known limitations:
- we have not yet attempted to integrate our approach with the
existing Galaxy job-splitting
distributed compute support, partly because of our "layered" design goal (admittedly also partly
because of ignorance about its details ! )
- our current implementation is quite naive in the distributed compute
API
it uses - it supports launching condor job files (and also native sub-processes) - our plan
is to replace that with using the drmaa API
- we would like to integrate it better with the galaxy type system,
probably via
a galaxy-tardis wrapper
We would be keen to contribute our approach to Galaxy if people are
interested.
Cheers
Alan McCulloch
Bioinformatics Software Engineer
AgResearch NZ
_
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
On Mon, Oct 28, 2013 at 9:39 PM, McCulloch, Alan alan.mcculloch@agresearch.co.nz wrote:
dear all,
There have been a few posts lately about doing distributed computing via Galaxy – i.e.
job splitters etc – below a contribution of some ideas we have developed
and applied in our work, where we have arranged for some Galaxy tools to execute in parallel
on our cluster.
We have developed a job-splitter script "tardis.py" (available from
https://bitbucket.org/agr-bifo/tardis), which takes marked-up
standard unix commands that run an application or tool. The mark-up is
prefixed to the input and output command-line options. Tardis strips off the
mark-up, and re-writes the commands to refer to split inputs and outputs, which are then
executed in parallel e.g. on a distributed compute resource. Tardis knows
the output files to expect and how to join them back together.
(This was referred to in our GCC2013 talk
http://wiki.galaxyproject.org/Events/GCC2013/Abstracts#Events.2FGCC2013.2FAb... )
Any reasonable unix based data processing or analysis command may be marked up and run
using tardis, though of course tardis needs to know how to split and join the data. Our approach
also assumes a “symmetrical” HPC cluster configuration, in the sense that each node sees the same
view of the file system (and has the required underlying application installed). We use tardis
to support both Galaxy and command-line based compute.
Background / design pattern / motivating analogy: Galaxy provides a high level
"end to end" view of a workflow; the HPC cluster resource that one uses then involves
spraying chunks of data out into parallel processes, usually in the form of some kind of
distributed compute cluster - but an end-user looking at a Galaxy history, should ideally not be able
to tell whether the workflow was run as a single process on the server, or
via many parallel processes on the cluster (apart from the fact that when run
in parallel on the cluster, its alot faster!). We noticed that the TCP / IP layered networking
protocol stack provides a useful metaphor and design pattern - with the "end to end" topology
of a Galaxy workflow corresponding to the transport layer of TCP/ IP; and the distribution
of computation across a cluster corresponding to the next TCP/IP layer down
- the packet-routing
layer.
This picture suggested a strongly layered approach to provisioning
Galaxy with parallelised compute on split data, and hence to an approach in which the
footprint in the Galaxy code-base, of parallel / distributed compute support, should ideally
(from the layered-design point of view) be minimal and superficial. Thus in our approach so far,
the only footprint is in the tool config files, where we arrange the templating to
(optionally) prefix the required tardis mark-up to the input and output command options, and
the tardis script name to the command as a whole. tardis then takes care of rewriting and
launching all of the jobs, and finally joining the results back together and putting them where
galaxy expects them to be (and also housekeeping such as collating and passing up stderr and stdout , and
appropriate process exit codes). (For each galaxy job, tardis creates a working folder in a designated
scratch area, where input files are uncompressed and split; job files and their output
are stored; logging is done etc. Split data is cleaned up at the end unless there
was an error in some part of the job, in which case everything is retained
for debugging and in some cases restart)
(We modify Galaxy tool-configs so that the user can optionally choose to run
the tool on our HPC cluster - there are three HPC related input fields, appended
to the input section of a tool. Here the user selects whether they want to use
our cluster and if so, they specify the chunk size, and can also at that point
specify a sampling rate, since we often find it useful to be able to run preliminary
analyses on a random sample of (for example) single or paired-end NGS sequence
data, to obtain a fairly quick snapshot of the data, before the expense of a
complete run. We found it convenient to include support for input sampling
in tardis).
The pdf document at https://bitbucket.org/agr-bifo/tardis includes a number of
examples of marking up a command, and also a simple example of a galaxy tool-config that
has been modified to include support for optionally running the job on our HPC cluster
via the tardis pre-processor.
Known limitations:
- we have not yet attempted to integrate our approach with the existing
Galaxy job-splitting
distributed compute support, partly because of our “layered” design goal (admittedly also partly
because of ignorance about its details ! )
- our current implementation is quite naive in the distributed compute API
it uses - it supports launching condor job files (and also native sub-processes) - our plan
is to replace that with using the drmaa API
- we would like to integrate it better with the galaxy type system, probably
via
a galaxy-tardis wrapper
We would be keen to contribute our approach to Galaxy if people are
interested.
Very interesting approach, thanks for reaching out and open sourcing your stuff!
What form would this contribution take? Are there changes that need to be made to the Galaxy core that need to be made or were you hoping to mark up the core tools with tardis markup? I would be happy to help adjust (generalize?) the framework to support your use cases if there are things to be done in that arena, but I would be skeptical about marking up the core tools with tardis descriptions and introducing a dependency between these tools an tardis.
I would imagine most Galaxy developers would agree with me that the Galaxy philosophy is that the tool should be agnostic to the job manager/DRM used - in fact they should all ideally work with the local job runner with no DRM available at all. Handling that interaction is the responsibility of job runners. Galaxy is structured in layers too, it just job manager interaction is above the tool layer not below it. I am not certain this is the correct philosophy - but I suspect it is the Galaxy philosophy and it is how Galaxy main the devteam tools are likely to continue to operate.
Like I said though, I would be eager for the platform itself to support alternatives models of interaction between the Galaxy, Jobs, Tools, Parallelism, etc... One of the really nice things about the tardis approach is that it seems very self contained and all at the tool level, so you should be able to create tardis tools pretty easily, use them inside a normal Galaxy instance, and distribute them on the tool shed.
I hope this is reasonable and we can find ways to collaborate even if we don't agree on every single point :).
-John
Cheers
Alan McCulloch
Bioinformatics Software Engineer
AgResearch NZ
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Very interesting approach, thanks for reaching out and open sourcing your stuff!
thanks for your time having a look at this, and your comments and advice !
What form would this contribution take? Are there changes that need to be made to the Galaxy core that need to be made or were you hoping to mark up the core tools with tardis markup? I would be happy to help adjust (generalize?) the framework to support your use cases if there are things to be done in that arena, but I would be skeptical about marking up the core tools with tardis descriptions and introducing a dependency between these tools an tardis.
yes I agree.
I guess I am using this Galaxy-embedding hack (...i.e. modifying tool configs so that they can splice in tardis-related markup of the command string) because it’s a way I can try out galaxy-embedding of this approach , in an exploratory way (but also getting some good data processing throughput gains along the way of course !). I should have thought about and made that qualification.
I would imagine most Galaxy developers would agree with me that the Galaxy philosophy is that the tool should be agnostic to the job manager/DRM used - in fact they should all ideally work with the local job runner with no DRM available at all. Handling that interaction is the responsibility of job runners. Galaxy is structured in layers too, it just job manager interaction is above the tool layer not below it. I am not certain this is the correct philosophy - but I suspect it is the Galaxy philosophy and it is how Galaxy main the devteam tools are likely to continue to operate.
thanks - that’s a good point about the ordering of the layers.
what I was thinking of exploring next, was the possibility of the galaxy engine using the tool input and output metadata as supplied by the tool wrapper XML, in a similar way that tardis uses command-line markup. All that the tardis command-line-markup is providing, is annotation of the command - i.e. which arguments represent input and output, and what are the respective data-types - so that the command and data can be factorised into split-command and split-data. But when the command is embedded in galaxy, that annotation is already supplied by the XML tool config. Therefore it should be possible to incorporate something like the tardis command and data factorisation approach, into the galaxy tool-config interpreter engine, with the required annotation provided by the tool-xml, rather than command-line markup.
(....a switch in tool-config.xml for each tool would control whether parallelisation / job splitting was to be enabled for a given tool. If enabled, then 2 or 3 additional user input fields would be generated for the tool UI (e.g. parallelise or not ? chunksize ? and I include an option for randomly sampling the input there as well as its a handy place to put it); and then on submission the engine would factorise the job into splits, with a supervisor similar in function to tardis handling job submission to the cluster and output collation.)
Like I said though, I would be eager for the platform itself to support alternatives models of interaction between the Galaxy, Jobs, Tools, Parallelism, etc... One of the really nice things about the tardis approach is that it seems very self contained and all at the tool level, so you should be able to create tardis tools pretty easily, use them inside a normal Galaxy instance, and distribute them on the tool shed.
thanks for the feedback ! I'll have a think about that approach as well - i.e. tardifying tools and putting these in the tool shed. This is what I am doing internally, with the tardified tools in our local tool shed.
I hope this is reasonable and we can find ways to collaborate even if we don't agree on every single point :).
-John
Thanks again for looking at this and your encouraging comments. I probably won't make all that much progress with the "put it in the galaxy engine" idea for a while - but thats a potential road map I had thought of . In the near future I need to tidy up a few things, such as getting tardis to use drmaa. I'm also continuing to "tardify" a few tools for which thats useful for us , using the current galaxy-embedding hack, to get some processing done and also as further test cases. Usually each one flushes out another bug or two.
Cheers
Alan
-----Original Message----- From: jmchilton@gmail.com [mailto:jmchilton@gmail.com] On Behalf Of John Chilton Sent: Thursday, 14 November 2013 9:56 a.m. To: McCulloch, Alan Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] tardis job splitter
On Mon, Oct 28, 2013 at 9:39 PM, McCulloch, Alan alan.mcculloch@agresearch.co.nz wrote:
dear all,
There have been a few posts lately about doing distributed computing via Galaxy – i.e.
job splitters etc – below a contribution of some ideas we have developed
and applied in our work, where we have arranged for some Galaxy tools to execute in parallel
on our cluster.
We have developed a job-splitter script "tardis.py" (available from
https://bitbucket.org/agr-bifo/tardis), which takes marked-up
standard unix commands that run an application or tool. The mark-up is
prefixed to the input and output command-line options. Tardis strips off the
mark-up, and re-writes the commands to refer to split inputs and outputs, which are then
executed in parallel e.g. on a distributed compute resource. Tardis knows
the output files to expect and how to join them back together.
(This was referred to in our GCC2013 talk
http://wiki.galaxyproject.org/Events/GCC2013/Abstracts#Events.2FGCC201
3.2FAbstracts.2FTalks.A_layered_genotyping-by-sequencing_pipeline_usin g_Galaxy )
Any reasonable unix based data processing or analysis command may be marked up and run
using tardis, though of course tardis needs to know how to split and join the data. Our approach
also assumes a “symmetrical” HPC cluster configuration, in the sense that each node sees the same
view of the file system (and has the required underlying application installed). We use tardis
to support both Galaxy and command-line based compute.
Background / design pattern / motivating analogy: Galaxy provides a high level
"end to end" view of a workflow; the HPC cluster resource that one uses then involves
spraying chunks of data out into parallel processes, usually in the form of some kind of
distributed compute cluster - but an end-user looking at a Galaxy history, should ideally not be able
to tell whether the workflow was run as a single process on the server, or
via many parallel processes on the cluster (apart from the fact that when run
in parallel on the cluster, its alot faster!). We noticed that the TCP / IP layered networking
protocol stack provides a useful metaphor and design pattern - with the "end to end" topology
of a Galaxy workflow corresponding to the transport layer of TCP/ IP; and the distribution
of computation across a cluster corresponding to the next TCP/IP layer down
- the packet-routing
layer.
This picture suggested a strongly layered approach to provisioning
Galaxy with parallelised compute on split data, and hence to an approach in which the
footprint in the Galaxy code-base, of parallel / distributed compute support, should ideally
(from the layered-design point of view) be minimal and superficial. Thus in our approach so far,
the only footprint is in the tool config files, where we arrange the templating to
(optionally) prefix the required tardis mark-up to the input and output command options, and
the tardis script name to the command as a whole. tardis then takes care of rewriting and
launching all of the jobs, and finally joining the results back together and putting them where
galaxy expects them to be (and also housekeeping such as collating and passing up stderr and stdout , and
appropriate process exit codes). (For each galaxy job, tardis creates a working folder in a designated
scratch area, where input files are uncompressed and split; job files and their output
are stored; logging is done etc. Split data is cleaned up at the end unless there
was an error in some part of the job, in which case everything is retained
for debugging and in some cases restart)
(We modify Galaxy tool-configs so that the user can optionally choose to run
the tool on our HPC cluster - there are three HPC related input fields, appended
to the input section of a tool. Here the user selects whether they want to use
our cluster and if so, they specify the chunk size, and can also at that point
specify a sampling rate, since we often find it useful to be able to run preliminary
analyses on a random sample of (for example) single or paired-end NGS sequence
data, to obtain a fairly quick snapshot of the data, before the expense of a
complete run. We found it convenient to include support for input sampling
in tardis).
The pdf document at https://bitbucket.org/agr-bifo/tardis includes a number of
examples of marking up a command, and also a simple example of a galaxy tool-config that
has been modified to include support for optionally running the job on our HPC cluster
via the tardis pre-processor.
Known limitations:
- we have not yet attempted to integrate our approach with the
existing Galaxy job-splitting
distributed compute support, partly because of our “layered” design goal (admittedly also partly
because of ignorance about its details ! )
- our current implementation is quite naive in the distributed compute
API
it uses - it supports launching condor job files (and also native sub-processes) - our plan
is to replace that with using the drmaa API
- we would like to integrate it better with the galaxy type system,
probably via
a galaxy-tardis wrapper
We would be keen to contribute our approach to Galaxy if people are
interested.
On Thu, Nov 14, 2013 at 5:02 AM, McCulloch, Alan alan.mcculloch@agresearch.co.nz wrote:
Very interesting approach, thanks for reaching out and open sourcing your stuff!
thanks for your time having a look at this, and your comments and advice !
What form would this contribution take? Are there changes that need to be made to the Galaxy core that need to be made or were you hoping to mark up the core tools with tardis markup? I would be happy to help adjust (generalize?) the framework to support your use cases if there are things to be done in that arena, but I would be skeptical about marking up the core tools with tardis descriptions and introducing a dependency between these tools an tardis.
yes I agree.
I guess I am using this Galaxy-embedding hack (...i.e. modifying tool configs so that they can splice in tardis-related markup of the command string) because it’s a way I can try out galaxy-embedding of this approach , in an exploratory way (but also getting some good data processing throughput gains along the way of course !). I should have thought about and made that qualification.
I would imagine most Galaxy developers would agree with me that the Galaxy philosophy is that the tool should be agnostic to the job manager/DRM used - in fact they should all ideally work with the local job runner with no DRM available at all. Handling that interaction is the responsibility of job runners. Galaxy is structured in layers too, it just job manager interaction is above the tool layer not below it. I am not certain this is the correct philosophy - but I suspect it is the Galaxy philosophy and it is how Galaxy main the devteam tools are likely to continue to operate.
thanks - that’s a good point about the ordering of the layers.
what I was thinking of exploring next, was the possibility of the galaxy engine using the tool input and output metadata as supplied by the tool wrapper XML, in a similar way that tardis uses command-line markup. All that the tardis command-line-markup is providing, is annotation of the command - i.e. which arguments represent input and output, and what are the respective data-types - so that the command and data can be factorised into split-command and split-data. But when the command is embedded in galaxy, that annotation is already supplied by the XML tool config. Therefore it should be possible to incorporate something like the tardis command and data factorisation approach, into the galaxy tool-config interpreter engine, with the required annotation provided by the tool-xml, rather than command-line markup.
(....a switch in tool-config.xml for each tool would control whether parallelisation / job splitting was to be enabled for a given tool. If enabled, then 2 or 3 additional user input fields would be generated for the tool UI (e.g. parallelise or not ? chunksize ? and I include an option for randomly sampling the input there as well as its a handy place to put it); and then on submission the engine would factorise the job into splits, with a supervisor similar in function to tardis handling job submission to the cluster and output collation.)
Placing this idea in a slightly broader context I think this kind of extension point is something a lot of people want. Not just for job splitting, but for generally for allowing users to supply guidance to vary aspects of job scheduling (i.e. most obviously - what queue to send a job to for instance).
I have created a Trello card for this here: https://trello.com/c/H87LotF7.
Not a lot of specifics yet, but it is something I have been thinking about and discussing with people for a long time... I think the hard part is what to do in the context of workflows and workflow submission.
Thanks again for the detailed e-mail and response. Good luck with tardis and thanks for using Galaxy!
-John
Like I said though, I would be eager for the platform itself to support alternatives models of interaction between the Galaxy, Jobs, Tools, Parallelism, etc... One of the really nice things about the tardis approach is that it seems very self contained and all at the tool level, so you should be able to create tardis tools pretty easily, use them inside a normal Galaxy instance, and distribute them on the tool shed.
thanks for the feedback ! I'll have a think about that approach as well
- i.e. tardifying tools and putting these in the tool shed. This is what
I am doing internally, with the tardified tools in our local tool shed.
I hope this is reasonable and we can find ways to collaborate even if we don't agree on every single point :).
-John
Thanks again for looking at this and your encouraging comments. I probably won't make all that much progress with the "put it in the galaxy engine" idea for a while - but thats a potential road map I had thought of . In the near future I need to tidy up a few things, such as getting tardis to use drmaa. I'm also continuing to "tardify" a few tools for which thats useful for us , using the current galaxy-embedding hack, to get some processing done and also as further test cases. Usually each one flushes out another bug or two.
Cheers
Alan
-----Original Message----- From: jmchilton@gmail.com [mailto:jmchilton@gmail.com] On Behalf Of John Chilton Sent: Thursday, 14 November 2013 9:56 a.m. To: McCulloch, Alan Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] tardis job splitter
On Mon, Oct 28, 2013 at 9:39 PM, McCulloch, Alan alan.mcculloch@agresearch.co.nz wrote:
dear all,
There have been a few posts lately about doing distributed computing via Galaxy – i.e.
job splitters etc – below a contribution of some ideas we have developed
and applied in our work, where we have arranged for some Galaxy tools to execute in parallel
on our cluster.
We have developed a job-splitter script "tardis.py" (available from
https://bitbucket.org/agr-bifo/tardis), which takes marked-up
standard unix commands that run an application or tool. The mark-up is
prefixed to the input and output command-line options. Tardis strips off the
mark-up, and re-writes the commands to refer to split inputs and outputs, which are then
executed in parallel e.g. on a distributed compute resource. Tardis knows
the output files to expect and how to join them back together.
(This was referred to in our GCC2013 talk
http://wiki.galaxyproject.org/Events/GCC2013/Abstracts#Events.2FGCC201
3.2FAbstracts.2FTalks.A_layered_genotyping-by-sequencing_pipeline_usin g_Galaxy )
Any reasonable unix based data processing or analysis command may be marked up and run
using tardis, though of course tardis needs to know how to split and join the data. Our approach
also assumes a “symmetrical” HPC cluster configuration, in the sense that each node sees the same
view of the file system (and has the required underlying application installed). We use tardis
to support both Galaxy and command-line based compute.
Background / design pattern / motivating analogy: Galaxy provides a high level
"end to end" view of a workflow; the HPC cluster resource that one uses then involves
spraying chunks of data out into parallel processes, usually in the form of some kind of
distributed compute cluster - but an end-user looking at a Galaxy history, should ideally not be able
to tell whether the workflow was run as a single process on the server, or
via many parallel processes on the cluster (apart from the fact that when run
in parallel on the cluster, its alot faster!). We noticed that the TCP / IP layered networking
protocol stack provides a useful metaphor and design pattern - with the "end to end" topology
of a Galaxy workflow corresponding to the transport layer of TCP/ IP; and the distribution
of computation across a cluster corresponding to the next TCP/IP layer down
- the packet-routing
layer.
This picture suggested a strongly layered approach to provisioning
Galaxy with parallelised compute on split data, and hence to an approach in which the
footprint in the Galaxy code-base, of parallel / distributed compute support, should ideally
(from the layered-design point of view) be minimal and superficial. Thus in our approach so far,
the only footprint is in the tool config files, where we arrange the templating to
(optionally) prefix the required tardis mark-up to the input and output command options, and
the tardis script name to the command as a whole. tardis then takes care of rewriting and
launching all of the jobs, and finally joining the results back together and putting them where
galaxy expects them to be (and also housekeeping such as collating and passing up stderr and stdout , and
appropriate process exit codes). (For each galaxy job, tardis creates a working folder in a designated
scratch area, where input files are uncompressed and split; job files and their output
are stored; logging is done etc. Split data is cleaned up at the end unless there
was an error in some part of the job, in which case everything is retained
for debugging and in some cases restart)
(We modify Galaxy tool-configs so that the user can optionally choose to run
the tool on our HPC cluster - there are three HPC related input fields, appended
to the input section of a tool. Here the user selects whether they want to use
our cluster and if so, they specify the chunk size, and can also at that point
specify a sampling rate, since we often find it useful to be able to run preliminary
analyses on a random sample of (for example) single or paired-end NGS sequence
data, to obtain a fairly quick snapshot of the data, before the expense of a
complete run. We found it convenient to include support for input sampling
in tardis).
The pdf document at https://bitbucket.org/agr-bifo/tardis includes a number of
examples of marking up a command, and also a simple example of a galaxy tool-config that
has been modified to include support for optionally running the job on our HPC cluster
via the tardis pre-processor.
Known limitations:
- we have not yet attempted to integrate our approach with the
existing Galaxy job-splitting
distributed compute support, partly because of our “layered” design goal (admittedly also partly
because of ignorance about its details ! )
- our current implementation is quite naive in the distributed compute
API
it uses - it supports launching condor job files (and also native sub-processes) - our plan
is to replace that with using the drmaa API
- we would like to integrate it better with the galaxy type system,
probably via
a galaxy-tardis wrapper
We would be keen to contribute our approach to Galaxy if people are
interested.
galaxy-dev@lists.galaxyproject.org