Hi there
I see from the PR landing in Galaxy and the comments on things like issue #1701 (https://github.com/galaxyproject/galaxy/issues/1701) that there's lots of work happening on the workflow side of Galaxy. This is an area of interest at SANBI too, so we'd like to coordinate development efforts as much as possible. To this end:
1) Are there forks to track so we can see what new code is landing? 2) Is there a roadmap for workflow work or perhaps can we have a Hangout to talk about this? 3) Specifically in terms of workflows and parallelisation: are there any plans to work on running workflows as opposed to just generating lots of jobs? I know this is a major change to how Galaxy works - it would mean something like submitting a workflow specification to a job runner that is located on the cluster, and then returning the results of workflow execution. 4) Currently parallelisation in Galaxy is supported using two mechanisms: collections and dataset splitters/tasks. Are there plans on extending and harmonising Galaxy's parallelisation capabilities?
Thanks, Peter
On Mon, Feb 22, 2016 at 7:57 AM, Peter van Heusden pvh@sanbi.ac.za wrote:
Hi there
...
- Currently parallelisation in Galaxy is supported using two mechanisms:
collections and dataset splitters/tasks. Are there plans on extending and harmonising Galaxy's parallelisation capabilities?
I'm not sure there is anything formal, but chatting to John and others at GCC2015 we recognised that the split/merge capabilities in the Python datatype classes have a lot of functional overlap between splitting and merging for datasets into collections.
https://wiki.galaxyproject.org/Events/GCC2015/BoFs/DataSplittingAndParalleli...
One idea we mooted was defining (pseudo) tools for dataset splitting and merging using the existing datatype classes, with similar integration into the framework as the datatype converter tools.
i.e. You could in principle merge a collection of text files using the text datatype's merge functionality (which is essentially a cat command).
There are a lot of details to think about, particularly for splitting where currently tool wrappers using parallelisation have some control (e.g. split a large FASTA file into chunks of 1000 sequences), which might need to be exposed in any UI for creating a collection from a single file.
Peter
Peter -
My plans for pre-GCC workflow work are sort of outlined in this issue: https://github.com/galaxyproject/planemo/issues/408 (I want an abstract for GCC and BOSC like "Planemo – A Scientific Workflow SDK").
I've been doing most of my work out of this branch https://github.com/galaxyproject/galaxy/compare/dev...common-workflow-langua.... It has my work in progress on CWL support, collection operations (rejected once from Galaxy here https://github.com/galaxyproject/galaxy/pull/1313) but these are so important I'm going to take another stab at pushing them into Galaxy, and work on expression tools to produce values that will hopefully tie back into workflows as connections for non-data parameters - both as Galaxy native enties and CWL based enties.
There have been some completely valid complaints about the background workflow scheduling being slow and buggy, these will need to be fixed by 16.04 since all workflows will be executed this way as of then. I hope also to take another pass at subworkflows - better tracking of sources, allowing upgrading subworkflow steps, fixing glaring bugs like https://github.com/galaxyproject/galaxy/issues/1739.
Peter C. mentioned splitting and joining files into/from collections in workflows based on the datatype methods (so hooking into parallelism) - I have some initial WIP on this here https://github.com/jmchilton/galaxy/commit/c4d93acdb3b0f89b970b7c3d17b965be8... as part of this branch https://github.com/jmchilton/galaxy/tree/split_merge_collections. I spent a couple hours on it - I think if I spent a day or two on it I'd have a usable prototype to hack on - I don't remember thinking there were any big hurdles I was encountering in doing that. (So the answer to your last question is a definitive yes.)
Sam started a bunch of work here with completely replacing the workflow form with an API driven one here https://github.com/galaxyproject/galaxy/pull/1249. I know he hopes to have that done in 16.04 - it will allow us to delete a bunch of paths through the workflow code and should allow future developments to be made more rapidly. It will ensure everything is coming through the API also - which means Galaxy's test coverage of workflow stuff will be much higher (given our depth of workflow API tests).
I'm happy to have a hangout to discuss this more, I consider the planemo issue something of a roadmap for what I want to work on in the first half of 2016 - but I might get pulled away or told the project has other priorities.
As for scheduling workflows instead of jobs - this is intriguing and really would probably be needed to get streaming working well in Galaxy. So I would say - I want to work on it someday - but I probably won't get to it in 2016. If others want to hack on it, that is fantastic but it is also a difficult feat. (At least scheduling out and optimizing pieces of the workflow, Kyle Ellrott, Dannon, and I had some interesting ideas about scheduling whole workflows on local Galaxy instances running on a cluster and just collecting the outputs - that would be significantly more doable given I sort of sculpted the changes made to backgrounding workflows to preserve things for doing that - though the work left is probably still a hard task).
Hope this helps.
-John
On Mon, Feb 22, 2016 at 7:57 AM, Peter van Heusden pvh@sanbi.ac.za wrote:
Hi there
I see from the PR landing in Galaxy and the comments on things like issue #1701 (https://github.com/galaxyproject/galaxy/issues/1701) that there's lots of work happening on the workflow side of Galaxy. This is an area of interest at SANBI too, so we'd like to coordinate development efforts as much as possible. To this end:
- Are there forks to track so we can see what new code is landing?
- Is there a roadmap for workflow work or perhaps can we have a Hangout to
talk about this? 3) Specifically in terms of workflows and parallelisation: are there any plans to work on running workflows as opposed to just generating lots of jobs? I know this is a major change to how Galaxy works - it would mean something like submitting a workflow specification to a job runner that is located on the cluster, and then returning the results of workflow execution. 4) Currently parallelisation in Galaxy is supported using two mechanisms: collections and dataset splitters/tasks. Are there plans on extending and harmonising Galaxy's parallelisation capabilities?
Thanks, Peter
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
galaxy-dev@lists.galaxyproject.org