Using input dataset names in output dataset names

newer
DRMAA/SGE job handling regression?

Peter Cock

7 Nov 2013 7 Nov '13

12:21 p.m.

Hi all, I'd like to change the output dataset labelling in Galaxy file format conversion tools. e.g. If the input is history entry 1 (e.g "My Genes") then the output from tabular_to_fasta.xml is currently named "FASTA-to-Tabular on data 1". I would prefer this was "FASTA-to-Tabular on data My Genes" or better "My Genes (as tabular)". I've just done this for my BLAST XML to tabular tool, using the .display_name trick: https://github.com/peterjc/galaxy_blast/commit/31e31c4b5deadd60828ce6e6a381a... Would a pull request doing this to the built-in conversion tools be favourably received? Alternatively, would it be preferable to simply reused the input dataset's name unchanged for simple format conversion tools (without text about the conversion)? Related to this, would people prefer if the $on_string in the case of a single input file was the input file's name (e.g. "My Genes") rather than "data 1"? (When there are multiple input files, $on_string needs to be kept short). Regards, Peter

Show replies by date

Bjoern Gruening

7 Nov 7 Nov

12:29 p.m.

Hi Peter, thanks for raising this important topic. I think the following trello card has a similar idea and a patch attached. https://trello.com/c/JnhOEqow It would be great if we can simplify the naming of datasets, especially if you run a workflow with several input, you would like to keep the input name through the whole workflow to the end. Cheers, Bjoern

...

Hi all,

I'd like to change the output dataset labelling in Galaxy file format conversion tools.

e.g. If the input is history entry 1 (e.g "My Genes") then the output from tabular_to_fasta.xml is currently named "FASTA-to-Tabular on data 1". I would prefer this was "FASTA-to-Tabular on data My Genes" or better "My Genes (as tabular)".

I've just done this for my BLAST XML to tabular tool, using the .display_name trick: https://github.com/peterjc/galaxy_blast/commit/31e31c4b5deadd60828ce6e6a381a...

Would a pull request doing this to the built-in conversion tools be favourably received?

Alternatively, would it be preferable to simply reused the input dataset's name unchanged for simple format conversion tools (without text about the conversion)?

Related to this, would people prefer if the $on_string in the case of a single input file was the input file's name (e.g. "My Genes") rather than "data 1"? (When there are multiple input files, $on_string needs to be kept short).

Regards,

Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Peter Cock

12:40 p.m.

On Thu, Nov 7, 2013 at 12:29 PM, Bjoern Gruening <bjoern.gruening@gmail.com> wrote:

...

Hi Peter,

thanks for raising this important topic.

I think the following trello card has a similar idea and a patch attached.

https://trello.com/c/JnhOEqow

It would be great if we can simplify the naming of datasets, especially if you run a workflow with several input, you would like to keep the input name through the whole workflow to the end.

Yes, I was aware of some more general ideas like that - and I agree this is important. However, with the conversion tool naming we can make a small improvement right now, without having to modify the Galaxy core. Peter

Peter Cock

3:18 p.m.

On Thu, Nov 7, 2013 at 12:21 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

Related to this, would people prefer if the $on_string in the case of a single input file was the input file's name (e.g. "My Genes") rather than "data 1"? (When there are multiple input files, $on_string needs to be kept short).

That turned out to be quite an easy change (patch below), and personally I think this makes the $on_string much nicer. Peter -- $ hg diff lib/galaxy/tools/actions/__init__.py diff -r 77d58fdd1c2e lib/galaxy/tools/actions/__init__.py --- a/lib/galaxy/tools/actions/__init__.py Tue Oct 29 14:21:48 2013 -0400 +++ b/lib/galaxy/tools/actions/__init__.py Thu Nov 07 15:15:42 2013 +0000 @@ -181,6 +181,7 @@ input_names = [] input_ext = 'data' input_dbkey = incoming.get( "dbkey", "?" ) + on_text = '' for name, data in inp_data.items(): if not data: data = NoneDataset( datatypes_registry = trans.app.datatypes_registry ) @@ -194,6 +195,7 @@ else: # HDA if data.hid: input_names.append( 'data %s' % data.hid ) + on_text = data.name # Will use below if only one input dataset input_ext = data.ext if data.dbkey not in [None, '?']: @@ -230,7 +232,10 @@ output_permissions = trans.app.security_agent.history_get_default_permissions( history ) # Build name for output datasets based on tool name and input names if len( input_names ) == 1: - on_text = input_names[0] + #We recorded the dataset name as on_text earlier... + if not on_text: + #Fall back on the shorter 'data %i' style: + on_text = input_names[0] elif len( input_names ) == 2: on_text = '%s and %s' % tuple(input_names[0:2]) elif len( input_names ) == 3:

Peter Cock

3:50 p.m.

On Thu, Nov 7, 2013 at 3:18 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Thu, Nov 7, 2013 at 12:21 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
Related to this, would people prefer if the $on_string in the case of a single input file was the input file's name (e.g. "My Genes") rather than "data 1"? (When there are multiple input files, $on_string needs to be kept short).

That turned out to be quite an easy change (patch below), and personally I think this makes the $on_string much nicer.

Peter

Getting back to my motivating example, since fasta_to_tabular.xml does not give the output a label and depends on the default, the small change to $on_string should result in the conversion of a file named "My Genes" as "FASTA-to-Tabular on My Genes", rather than "FASTA-to-Tabular on data 1" as now. Here's another variant to keep the "data 1" text in $on_string, if people are attached to this functionality. That would result in "FASTA-to-Tabular on data 1 (My Genes)". Also, here's an outline patch to explicitly produce my preferred label of "My Genes (as tabular)" etc. (Bjoern is right though - a more long term solution is needed to better address naming, like the tag idea on Trello.) Peter ------------------------------------------------------------------------------------------ $ hg diff lib/galaxy/tools/actions/__init__.py diff -r 77d58fdd1c2e lib/galaxy/tools/actions/__init__.py --- a/lib/galaxy/tools/actions/__init__.py Tue Oct 29 14:21:48 2013 -0400 +++ b/lib/galaxy/tools/actions/__init__.py Thu Nov 07 15:49:15 2013 +0000 @@ -181,6 +181,7 @@ input_names = [] input_ext = 'data' input_dbkey = incoming.get( "dbkey", "?" ) + on_text = '' for name, data in inp_data.items(): if not data: data = NoneDataset( datatypes_registry = trans.app.datatypes_registry ) @@ -194,6 +195,8 @@ else: # HDA if data.hid: input_names.append( 'data %s' % data.hid ) + #Will use this on_text if only one input dataset: + on_text = "data %s (%s)" % (data.id, data.name) input_ext = data.ext if data.dbkey not in [None, '?']: @@ -230,7 +233,10 @@ output_permissions = trans.app.security_agent.history_get_default_permissions( history ) # Build name for output datasets based on tool name and input names if len( input_names ) == 1: - on_text = input_names[0] + #We recorded the dataset name as on_text earlier... + if not on_text: + #Fall back on the shorter 'data %i' style: + on_text = input_names[0] elif len( input_names ) == 2: on_text = '%s and %s' % tuple(input_names[0:2]) elif len( input_names ) == 3: ------------------------------------------------------------------------------------------ $ hg diff tools diff -r 77d58fdd1c2e tools/fasta_tools/fasta_to_tabular.xml --- a/tools/fasta_tools/fasta_to_tabular.xml Tue Oct 29 14:21:48 2013 -0400 +++ b/tools/fasta_tools/fasta_to_tabular.xml Thu Nov 07 15:42:13 2013 +0000 @@ -11,7 +11,7 @@ </param> </inputs> <outputs> - <data name="output" format="tabular"/> + <data name="output" format="tabular" label="$input.display_name (as tabular)"/> </outputs> <tests> <test> diff -r 77d58fdd1c2e tools/fasta_tools/tabular_to_fasta.xml --- a/tools/fasta_tools/tabular_to_fasta.xml Tue Oct 29 14:21:48 2013 -0400 +++ b/tools/fasta_tools/tabular_to_fasta.xml Thu Nov 07 15:42:13 2013 +0000 @@ -7,7 +7,7 @@ <param name="seq_col" type="data_column" data_ref="input" numerical="False" label="Sequence column" /> </inputs> <outputs> - <data name="output" format="fasta"/> + <data name="output" format="fasta" label="$input.display_name (as FASTA)" /> </outputs> <tests> <test> @@ -40,4 +40,4 @@ GTGATATGTATGTTGACGGCCATAAGGCTGCTTCTT </help> -</tool> \ No newline at end of file +</tool> diff -r 77d58fdd1c2e tools/fastq/fastq_to_fasta.xml --- a/tools/fastq/fastq_to_fasta.xml Tue Oct 29 14:21:48 2013 -0400 +++ b/tools/fastq/fastq_to_fasta.xml Thu Nov 07 15:42:13 2013 +0000 @@ -5,7 +5,7 @@ <param name="input_file" type="data" format="fastq" label="FASTQ file to convert" /> </inputs> <outputs> - <data name="output_file" format="fasta" /> + <data name="output_file" format="fasta" label="$input_file.name (as FASTA)" /> </outputs> <tests>  diff -r 77d58fdd1c2e tools/fastq/fastq_to_tabular.xml --- a/tools/fastq/fastq_to_tabular.xml Tue Oct 29 14:21:48 2013 -0400 +++ b/tools/fastq/fastq_to_tabular.xml Thu Nov 07 15:42:13 2013 +0000 @@ -8,7 +8,7 @@ </param> </inputs> <outputs> - <data name="output_file" format="tabular" /> + <data name="output_file" format="tabular" label="$input_file.name (as tabular)" /> </outputs> <tests>  diff -r 77d58fdd1c2e tools/fastq/tabular_to_fastq.xml --- a/tools/fastq/tabular_to_fastq.xml Tue Oct 29 14:21:48 2013 -0400 +++ b/tools/fastq/tabular_to_fastq.xml Thu Nov 07 15:42:13 2013 +0000 @@ -8,7 +8,7 @@ <param name="quality" label="Quality column" type="data_column" data_ref="input_file" /> </inputs> <outputs> - <data name="output_file" format="fastq" /> + <data name="output_file" format="fastq" label="$input_file.name (as FASTQ)" /> </outputs> <tests>  diff -r 77d58fdd1c2e tools/filters/axt_to_concat_fasta.xml --- a/tools/filters/axt_to_concat_fasta.xml Tue Oct 29 14:21:48 2013 -0400 +++ b/tools/filters/axt_to_concat_fasta.xml Thu Nov 07 15:42:13 2013 +0000 @@ -14,7 +14,7 @@ <param name="axt_input" value="1.axt" ftype="axt" /> <param name="dbkey_1" value='hg17' /> <param name="dbkey_2" value="panTro1" /> - <output name="out_file1" file="axt_to_concat_fasta.dat" /> + <output name="out_file1" file="axt_to_concat_fasta.dat" label="$axt_input.name (as FASTA)"/> </test> </tests> <help> diff -r 77d58fdd1c2e tools/filters/wig_to_bigwig.xml --- a/tools/filters/wig_to_bigwig.xml Tue Oct 29 14:21:48 2013 -0400 +++ b/tools/filters/wig_to_bigwig.xml Thu Nov 07 15:42:13 2013 +0000 @@ -29,7 +29,7 @@ </conditional> </inputs> <outputs> - <data format="bigwig" name="out_file1" /> + <data format="bigwig" name="out_file1" label="$input1.name (as bigwig)" /> </outputs> <tests> <test> diff -r 77d58fdd1c2e tools/filters/wiggle_to_simple.xml --- a/tools/filters/wiggle_to_simple.xml Tue Oct 29 14:21:48 2013 -0400 +++ b/tools/filters/wiggle_to_simple.xml Thu Nov 07 15:42:13 2013 +0000 @@ -5,7 +5,7 @@ <param format="wig" name="input" type="data" label="Convert"/> </inputs> <outputs> - <data format="interval" name="out_file1" /> + <data format="interval" name="out_file1" label="$input.name (as interval)" /> </outputs> <tests> <test> diff -r 77d58fdd1c2e tools/stats/wiggle_to_simple.xml --- a/tools/stats/wiggle_to_simple.xml Tue Oct 29 14:21:48 2013 -0400 +++ b/tools/stats/wiggle_to_simple.xml Thu Nov 07 15:42:13 2013 +0000 @@ -5,7 +5,7 @@ <param format="wig" name="input" type="data" label="Convert"/> </inputs> <outputs> - <data format="interval" name="out_file1" /> + <data format="interval" name="out_file1" label="$input.name (as interval)" /> </outputs> <tests> <test>

Peter Cock

11 Nov 11 Nov

10:42 a.m.

On Thu, Nov 7, 2013 at 3:50 PM, Peter Cock wrote:

...

Getting back to my motivating example, since fasta_to_tabular.xml does not give the output a label and depends on the default, the small change to $on_string should result in the conversion of a file named "My Genes" as "FASTA-to-Tabular on My Genes", rather than "FASTA-to-Tabular on data 1" as now.

Here's another variant to keep the "data 1" text in $on_string, if people are attached to this functionality. That would result in "FASTA-to-Tabular on data 1 (My Genes)".

...

------------------------------------------------------------------------------------------

$ hg diff lib/galaxy/tools/actions/__init__.py diff -r 77d58fdd1c2e lib/galaxy/tools/actions/__init__.py --- a/lib/galaxy/tools/actions/__init__.py Tue Oct 29 14:21:48 2013 -0400 +++ b/lib/galaxy/tools/actions/__init__.py Thu Nov 07 15:49:15 2013 +0000 @@ -181,6 +181,7 @@ input_names = [] input_ext = 'data' input_dbkey = incoming.get( "dbkey", "?" ) + on_text = '' for name, data in inp_data.items(): if not data: data = NoneDataset( datatypes_registry = trans.app.datatypes_registry ) @@ -194,6 +195,8 @@ else: # HDA if data.hid: input_names.append( 'data %s' % data.hid ) + #Will use this on_text if only one input dataset: + on_text = "data %s (%s)" % (data.id, data.name) input_ext = data.ext

if data.dbkey not in [None, '?']: @@ -230,7 +233,10 @@ output_permissions = trans.app.security_agent.history_get_default_permissions( history ) # Build name for output datasets based on tool name and input names if len( input_names ) == 1: - on_text = input_names[0] + #We recorded the dataset name as on_text earlier... + if not on_text: + #Fall back on the shorter 'data %i' style: + on_text = input_names[0] elif len( input_names ) == 2: on_text = '%s and %s' % tuple(input_names[0:2]) elif len( input_names ) == 3:

Would this patch be welcomed as a pull request? (Expanding $on_string to include the name as well as dataset number when there is only one input dataset) How about renaming the outputs of the conversion tools? Peter

James Taylor

4:09 p.m.

I have not tested the patch, just read it, but won't this result in dataset names like: "Some operation on data 27 (Some operation on data 26 (Some other operation on data 25 (...(...(...))))" (avoiding this is why we came up with HIDs in the first place). -- James Taylor, Associate Professor, Biology/CS, Emory University On Mon, Nov 11, 2013 at 5:42 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Thu, Nov 7, 2013 at 3:50 PM, Peter Cock wrote:

...
Getting back to my motivating example, since fasta_to_tabular.xml does not give the output a label and depends on the default, the small change to $on_string should result in the conversion of a file named "My Genes" as "FASTA-to-Tabular on My Genes", rather than "FASTA-to-Tabular on data 1" as now.

Here's another variant to keep the "data 1" text in $on_string, if people are attached to this functionality. That would result in "FASTA-to-Tabular on data 1 (My Genes)".

...

------------------------------------------------------------------------------------------

$ hg diff lib/galaxy/tools/actions/__init__.py diff -r 77d58fdd1c2e lib/galaxy/tools/actions/__init__.py --- a/lib/galaxy/tools/actions/__init__.py Tue Oct 29 14:21:48 2013 -0400 +++ b/lib/galaxy/tools/actions/__init__.py Thu Nov 07 15:49:15 2013 +0000 @@ -181,6 +181,7 @@ input_names = [] input_ext = 'data' input_dbkey = incoming.get( "dbkey", "?" ) + on_text = '' for name, data in inp_data.items(): if not data: data = NoneDataset( datatypes_registry = trans.app.datatypes_registry ) @@ -194,6 +195,8 @@ else: # HDA if data.hid: input_names.append( 'data %s' % data.hid ) + #Will use this on_text if only one input dataset: + on_text = "data %s (%s)" % (data.id, data.name) input_ext = data.ext

if data.dbkey not in [None, '?']: @@ -230,7 +233,10 @@ output_permissions = trans.app.security_agent.history_get_default_permissions( history ) # Build name for output datasets based on tool name and input names if len( input_names ) == 1: - on_text = input_names[0] + #We recorded the dataset name as on_text earlier... + if not on_text: + #Fall back on the shorter 'data %i' style: + on_text = input_names[0] elif len( input_names ) == 2: on_text = '%s and %s' % tuple(input_names[0:2]) elif len( input_names ) == 3:

Would this patch be welcomed as a pull request? (Expanding $on_string to include the name as well as dataset number when there is only one input dataset)

How about renaming the outputs of the conversion tools?

Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Peter Cock

4:22 p.m.

On Mon, Nov 11, 2013 at 4:09 PM, James Taylor <james@jamestaylor.org> wrote:

...

I have not tested the patch, just read it, but won't this result in dataset names like:

"Some operation on data 27 (Some operation on data 26 (Some other operation on data 25 (...(...(...))))"

Potentially - it depends on how the tools use $on_string. If the tools added a postscript you'd get: "Original dataset (as tabular) (filtered) (...)" Neither is ideal. I'd prefer to see something more like this tag idea: https://trello.com/c/JnhOEqow What about my suggestion that for simple format conversion tools we simply reuse the input dataset's name unchanged (without text about the conversion)? That seems a good compromise.

...

(avoiding this is why we came up with HIDs in the first place).

I don't like the HIDs - unlike dataset names, the HIDs are not entirely reproducible - they depend on the order of upload, was it a clear history, etc. Peter

John Chilton

4:30 p.m.

*I had been composing this e-mail for a while so it is a lot awkward given this mornings responses, but felt it best to just get it out there rather than continue to bake the ideas :)* If I were not employed by Penn State, I would say you guys should be using galaxy-extras - these problems are all solved by multiple file datasets :), but since I am I am not going to mention that. I agree with Peter, the "tag" idea is probably a better way to get around this and probably represents an improvement on HIDs. There are a lot of open tickets related to things like this so I have picked one at random and sketched out what I think the path forward should maybe be. https://trello.com/c/dQA7Y5vS As mentioned by James, the problem with Peter's first attached patch is that after several iterations the name gets bigger and bigger. The tags patch put together or at least linked to by Bjoern does limits should limit the size of output names over a workflow right? The down side is that it is not used by default - tool authors have use it. So, my vote would be to combine the approaches. Specify this new labeling attribute (I would call it on_name_tag_string instead of on_tag_string because tags have other meanings in Galaxy), then provide a Galaxy configuration option that would use this instead of on_string by default for all tools (or maybe just replace on_string with on_name_tag_string) so that tools that explicitly use on_string would pick up the enhancements as well. Galaxy Main wouldn't have to change its default, but institutions who deem the name tag more important could. -John On Mon, Nov 11, 2013 at 10:22 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Mon, Nov 11, 2013 at 4:09 PM, James Taylor <james@jamestaylor.org> wrote:

...
I have not tested the patch, just read it, but won't this result in dataset names like:

"Some operation on data 27 (Some operation on data 26 (Some other operation on data 25 (...(...(...))))"

Potentially - it depends on how the tools use $on_string.

If the tools added a postscript you'd get:

"Original dataset (as tabular) (filtered) (...)"

Neither is ideal. I'd prefer to see something more like this tag idea: https://trello.com/c/JnhOEqow

What about my suggestion that for simple format conversion tools we simply reuse the input dataset's name unchanged (without text about the conversion)? That seems a good compromise.

...
(avoiding this is why we came up with HIDs in the first place).

I don't like the HIDs - unlike dataset names, the HIDs are not entirely reproducible - they depend on the order of upload, was it a clear history, etc.

Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

4267

Age (days ago)

4271

Last active (days ago)

List overview

Download

8 comments

4 participants

participants (4)

Bjoern Gruening
James Taylor
John Chilton
Peter Cock