All - Certainly this is a very real and important problem. The devteam hasn't moved on the tagging approach outlined in the dev thread referenced by Peter and I suspect that is because I the prevailing thought on the team is that dataset naming is not the most appropriate abstraction to use to address that (though personally I would be keen to merge a pull request for the compromise approach I outlined if someone wants to put it together). Outside the realm of dataset naming however - the devteam is actively working on this problem in at least two ways. - If one is performing an interactive analysis with a few initial inputs - showing the structure and connection between datasets in the history I suspect will be a more robust way to track connections and inputs throughout an analysis than dataset names. Carl has prototyped and demonstrated some stuff internally for showing such structures - I would assume it is coming in a future release. - If you have many samples - I suspect no approach based around individual datasets will be sufficient. Dataset collections however have been designed from the ground up with sample tracking in mind and I think with very little effort on the part of tool developers users get a very effective sample tracking. Dataset lists and lists of paired datasets (say representing replicates, samples, conditions, or patients, etc...) or more deeply nested data structures (representing hierarchical combinations of those things) are created with element identifiers at each level of the hierarchy that are preserved throughout a complex analysis transparently in a way that names are not - and with very little effort tool developers can leverage these at merging steps - to produce reports, etc.... (bit.ly/gcc2014workflows). I am not claiming the problem has been solved - but I did want to express that the devteam is working on it very actively and things will continue to improve in this realm. Thanks for the comments, -John On Fri, Nov 21, 2014 at 4:20 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Yes :(
There's been some past discussion of this from a tool developer perspective, e.g. https://trello.com/c/JnhOEqow and http://dev.list.galaxyproject.org/Using-input-dataset-names-in-output-datase...
The best individual tool authors can do is something like "$input.name processed with XXX" or "XXX on $input.name" which in a long pipeline results in extremely long names with tools sometimes prefixed and sometimes postfixed. :(
Of course, things get really complicated when a tool has multiple input files - in some cases the tool author could regard one set of files as primary and preserve their name/tag only,
Naming things is hard.
Peter
On Wed, Nov 19, 2014 at 8:34 PM, Curtis Hendrickson (Campus) <curtish@uab.edu> wrote:
Brad et al,
I would like second the issue you raise so succinctly. The failure to automatically track the original sample name throughout the analysis (that and array selection of paired end reads) is one of the biggest barriers people face for doing work on many samples in galaxy. It just gets very confusing unless you spend a lot of time workarounds (creating workflows to rename things, editing datasets individually, etc) – especially for non-programmer users, for whom workflows with variables and API calls are beyond the pale.
Regards,
Curtis
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/