Tool code for symlinking a data collection from input to output?
Has anyone done a tool part of which simply takes in a data collection and provides it (symlinked) in a corresponding output data collection? We have a quality control tool that enables us to stop workflow, preventing subsequent jobs to be run if a problem has been detected. Currently our tool only sym-links through an individual dataset. It would be great if it could work on a whole data collection of any sort. I see there’s a <collections> tag now – is there an easy solution using this – and that doesn’t loose metadata? <collection name=“genericCollectionSymlink" label=“Workflow datasets”> <discover_datasets pattern=???????????? visible="true" ......................????????? </collection> At moment we use parameters for format_source="workflow_files" metadata_source=“workflow_files” to pass through all the info on an input dataset... <inputs> … <param name="workflow_files" type="data" optional="True" multiple=“False" label="Workflow data" help="Select dataset(s) that subsequent workflow stages can consume if report status is not fail'." /> </inputs> <outputs> <data name="workflow_files_pass" format_source="workflow_files" label="Workflow datasets" metadata_source="workflow_files” > <filter>(workflow_files)</filter> </data> </outputs> Much obliged! Damion Hsiao lab, BC Public Health Microbiology & Reference Laboratory, BC Centre for Disease Control 655 West 12th Avenue, Vancouver, British Columbia, V5Z 4R4 Canada
Slowly trying to catch up on e-mail after a lot of travel in November and I answered a variant of this to Damion directly, the most relevant snippet was: " I would not symbolic link the files though. I would just take the original collection and pipe it into the next tool and add a dummy input to the next tool ("passing_qc_text_file") that would cause the workflow to fail if the qc fails. This is a bit hacky, but symbolic linking will break Galaxy's deletion, purging, etc.... You can delete the original dataset collection and the result would affect the files on disk for the output collection without Galaxy having anyway to know. The workflow subsystem has the ability to define a connection like this (just wait for one tool to pass before calling the next without a input/output relationship) but it hasn't been exposed in the workflow editor yet." -John On Sun, Nov 1, 2015 at 8:12 PM, Dooley, Damion <Damion.Dooley@bccdc.ca> wrote:
Has anyone done a tool part of which simply takes in a data collection and provides it (symlinked) in a corresponding output data collection? We have a quality control tool that enables us to stop workflow, preventing subsequent jobs to be run if a problem has been detected. Currently our tool only sym-links through an individual dataset. It would be great if it could work on a whole data collection of any sort.
I see there’s a <collections> tag now – is there an easy solution using this – and that doesn’t loose metadata?
<collection name=“genericCollectionSymlink" label=“Workflow datasets”> <discover_datasets pattern=???????????? visible="true" ......................????????? </collection>
At moment we use parameters for format_source="workflow_files" metadata_source=“workflow_files” to pass through all the info on an input dataset...
<inputs> … <param name="workflow_files" type="data" optional="True" multiple=“False" label="Workflow data" help="Select dataset(s) that subsequent workflow stages can consume if report status is not fail'." /> </inputs> <outputs> <data name="workflow_files_pass" format_source="workflow_files" label="Workflow datasets" metadata_source="workflow_files” > <filter>(workflow_files)</filter> </data> </outputs>
Much obliged!
Damion
Hsiao lab, BC Public Health Microbiology & Reference Laboratory, BC Centre for Disease Control 655 West 12th Avenue, Vancouver, British Columbia, V5Z 4R4 Canada ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
On 11/17/2015 01:18 PM, John Chilton wrote:
The workflow subsystem has the ability to define a connection like this (just wait for one tool to pass before calling the next without a input/output relationship) but it hasn't been exposed in the workflow editor yet. That's incredibly exciting John. Can't wait for workflow availability of that feature. It was a huge problem for my tools when I was writing internal wrappers which did ordering dependent operations on external databases. Good to hear it's going to be possible. -John
On Sun, Nov 1, 2015 at 8:12 PM, Dooley, Damion <Damion.Dooley@bccdc.ca> wrote:
Has anyone done a tool part of which simply takes in a data collection and provides it (symlinked) in a corresponding output data collection? We have a quality control tool that enables us to stop workflow, preventing subsequent jobs to be run if a problem has been detected. Currently our tool only sym-links through an individual dataset. It would be great if it could work on a whole data collection of any sort.
I see there’s a <collections> tag now – is there an easy solution using this – and that doesn’t loose metadata?
<collection name=“genericCollectionSymlink" label=“Workflow datasets”> <discover_datasets pattern=???????????? visible="true" ......................????????? </collection>
At moment we use parameters for format_source="workflow_files" metadata_source=“workflow_files” to pass through all the info on an input dataset...
<inputs> … <param name="workflow_files" type="data" optional="True" multiple=“False" label="Workflow data" help="Select dataset(s) that subsequent workflow stages can consume if report status is not fail'." /> </inputs> <outputs> <data name="workflow_files_pass" format_source="workflow_files" label="Workflow datasets" metadata_source="workflow_files” > <filter>(workflow_files)</filter> </data> </outputs>
Much obliged!
Damion
Hsiao lab, BC Public Health Microbiology & Reference Laboratory, BC Centre for Disease Control 655 West 12th Avenue, Vancouver, British Columbia, V5Z 4R4 Canada ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-- Eric Rasche Programmer II Center for Phage Technology Rm 312A, BioBio Texas A&M University College Station, TX 77843 404-692-2048 esr@tamu.edu
Eric - yeah - it should be straight forward to add these to workflow editor - it is just sort of a matter of how to represent this I think. I don't have any clues. The relevant commit is at: https://github.com/galaxyproject/galaxy/commit/e0a5e82bdae407535b9d7c98e3dcf... This includes example workflows built up using YAML that establish these kinds of connections. This connections are what enabled Dan Blankenberg's work on workflows for data managers that he presented at the GCC 2015 and that he will post to Github within in a week or two of GCC 2015 ;). More background on building up workflows from YAML can be found at https://github.com/galaxyproject/galaxy/pull/1096 and https://github.com/galaxyproject/bioblend/pull/143. -John On Tue, Nov 17, 2015 at 7:24 PM, Eric Rasche <esr@tamu.edu> wrote:
On 11/17/2015 01:18 PM, John Chilton wrote:
The workflow subsystem has the ability to define a connection like this (just wait for one tool to pass before calling the next without a input/output relationship) but it hasn't been exposed in the workflow editor yet. That's incredibly exciting John. Can't wait for workflow availability of that feature. It was a huge problem for my tools when I was writing internal wrappers which did ordering dependent operations on external databases. Good to hear it's going to be possible. -John
On Sun, Nov 1, 2015 at 8:12 PM, Dooley, Damion <Damion.Dooley@bccdc.ca> wrote:
Has anyone done a tool part of which simply takes in a data collection and provides it (symlinked) in a corresponding output data collection? We have a quality control tool that enables us to stop workflow, preventing subsequent jobs to be run if a problem has been detected. Currently our tool only sym-links through an individual dataset. It would be great if it could work on a whole data collection of any sort.
I see there’s a <collections> tag now – is there an easy solution using this – and that doesn’t loose metadata?
<collection name=“genericCollectionSymlink" label=“Workflow datasets”> <discover_datasets pattern=???????????? visible="true" ......................????????? </collection>
At moment we use parameters for format_source="workflow_files" metadata_source=“workflow_files” to pass through all the info on an input dataset...
<inputs> … <param name="workflow_files" type="data" optional="True" multiple=“False" label="Workflow data" help="Select dataset(s) that subsequent workflow stages can consume if report status is not fail'." /> </inputs> <outputs> <data name="workflow_files_pass" format_source="workflow_files" label="Workflow datasets" metadata_source="workflow_files” > <filter>(workflow_files)</filter> </data> </outputs>
Much obliged!
Damion
Hsiao lab, BC Public Health Microbiology & Reference Laboratory, BC Centre for Disease Control 655 West 12th Avenue, Vancouver, British Columbia, V5Z 4R4 Canada ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-- Eric Rasche Programmer II
Center for Phage Technology Rm 312A, BioBio Texas A&M University College Station, TX 77843 404-692-2048 esr@tamu.edu
On 11/17/2015 01:32 PM, John Chilton wrote:
Eric - yeah - it should be straight forward to add these to workflow editor - it is just sort of a matter of how to represent this I think. I don't have any clues. I do!
Advanced settings -> "Enable tool ordering/dummy IO" -> has a checkbox When checked an extra input + output would appear at the bottom of the tool (ideally both at the same vertical height for a nice compact ordering selector). Could then connect just like regular IO.
The relevant commit is at:
https://github.com/galaxyproject/galaxy/commit/e0a5e82bdae407535b9d7c98e3dcf...
This includes example workflows built up using YAML that establish these kinds of connections. This connections are what enabled Dan Blankenberg's work on workflows for data managers that he presented at the GCC 2015 and that he will post to Github within in a week or two of GCC 2015 ;).
More background on building up workflows from YAML can be found at https://github.com/galaxyproject/galaxy/pull/1096 and https://github.com/galaxyproject/bioblend/pull/143.
-John
On Tue, Nov 17, 2015 at 7:24 PM, Eric Rasche <esr@tamu.edu> wrote:
On 11/17/2015 01:18 PM, John Chilton wrote:
The workflow subsystem has the ability to define a connection like this (just wait for one tool to pass before calling the next without a input/output relationship) but it hasn't been exposed in the workflow editor yet. That's incredibly exciting John. Can't wait for workflow availability of that feature. It was a huge problem for my tools when I was writing internal wrappers which did ordering dependent operations on external databases. Good to hear it's going to be possible. -John
On Sun, Nov 1, 2015 at 8:12 PM, Dooley, Damion <Damion.Dooley@bccdc.ca> wrote:
Has anyone done a tool part of which simply takes in a data collection and provides it (symlinked) in a corresponding output data collection? We have a quality control tool that enables us to stop workflow, preventing subsequent jobs to be run if a problem has been detected. Currently our tool only sym-links through an individual dataset. It would be great if it could work on a whole data collection of any sort.
I see there’s a <collections> tag now – is there an easy solution using this – and that doesn’t loose metadata?
<collection name=“genericCollectionSymlink" label=“Workflow datasets”> <discover_datasets pattern=???????????? visible="true" ......................????????? </collection>
At moment we use parameters for format_source="workflow_files" metadata_source=“workflow_files” to pass through all the info on an input dataset...
<inputs> … <param name="workflow_files" type="data" optional="True" multiple=“False" label="Workflow data" help="Select dataset(s) that subsequent workflow stages can consume if report status is not fail'." /> </inputs> <outputs> <data name="workflow_files_pass" format_source="workflow_files" label="Workflow datasets" metadata_source="workflow_files” > <filter>(workflow_files)</filter> </data> </outputs>
Much obliged!
Damion
Hsiao lab, BC Public Health Microbiology & Reference Laboratory, BC Centre for Disease Control 655 West 12th Avenue, Vancouver, British Columbia, V5Z 4R4 Canada ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-- Eric Rasche Programmer II
Center for Phage Technology Rm 312A, BioBio Texas A&M University College Station, TX 77843 404-692-2048 esr@tamu.edu
-- Eric Rasche Programmer II Center for Phage Technology Rm 312A, BioBio Texas A&M University College Station, TX 77843 404-692-2048 esr@tamu.edu
Ah, I can see how symlinking could lead to file management issues. Well, we were trying to avoid the situation where use of our qc tool would require customizing any subsequent tools in a workflow, and as well, reduce disk overhead of hundred megabyte files being passed along in a workflow. So wow on the second paragraph - enabling dependencies outside of tool file I/o. I agree with Eric, this will be great. Now in our current canned workflows we actually don't need this to be edited via the interface - so are there details on how to edit a workflow file directly to get this dependency of tool B on tool A in place? Thanks, Damion On 2015-11-17, 11:18 AM, "John Chilton" <jmchilton@gmail.com> wrote:
Slowly trying to catch up on e-mail after a lot of travel in November and I answered a variant of this to Damion directly, the most relevant snippet was:
" I would not symbolic link the files though. I would just take the original collection and pipe it into the next tool and add a dummy input to the next tool ("passing_qc_text_file") that would cause the workflow to fail if the qc fails. This is a bit hacky, but symbolic linking will break Galaxy's deletion, purging, etc.... You can delete the original dataset collection and the result would affect the files on disk for the output collection without Galaxy having anyway to know.
The workflow subsystem has the ability to define a connection like this (just wait for one tool to pass before calling the next without a input/output relationship) but it hasn't been exposed in the workflow editor yet."
-John
participants (4)
-
Aaron Petkau
-
Dooley, Damion
-
Eric Rasche
-
John Chilton