On Wed, Jun 10, 2015 at 4:04 PM, Alexander Vowinkel <vowinkel.alexander@gmail.com> wrote:
Hi Folks,
thank you so far for the previous help. I got much further. Now I'm stuck with data collections.
Because this is quite a list, I appreciate also answers to parts of my questions ;)
I have two issues: A) manual definition of data collections (any type) by user and/or admin B) definition of data collections as input/output of a tool and inside a workflow
A) manual Basically I would like to create i) a list of fastq files (unpaired) ii) a paired set of two fastq files iii) a list of each two paired fastq files
How can I do that? By using the web app? As user? As admin? By working via ssh on the server?
So each of these got much easier/more robust with the most recent release. For the user perspective - for any of these options you will want to load the fastq files into a history, open the manage multiple datasets option (https://wiki.galaxyproject.org/Histories#Managing_Multiple_Datasets_Easily), select the datasets, and then choose the list type from the menu. Each will cause a widget to pop up allowing you to group the datasets (into a list, a pair, or a list of pairs depending on your selection). The most complicated option is the list of pairs - this option is demonstrated in a the first video in Anton's recent NGS 101 - Reference-based RNA-seq series (https://vimeo.com/channels/884356/128265983). More information at https://wiki.galaxyproject.org/Learn/GalaxyNGS101. For all user-centric scenarios - you will need to get the plain datasets into a history first. FTP upload for instance doesn't support creating collections directly - you can import datasets and then create them. Likewise - data libraries do not currently support dataset collections. I believe there are Trello cards for both of these issues. For admins - there is a dataset collection API - I can point you at examples if you want - but this doesn't seem to be your interest.
B) in tool/workflow Here I also have different approaches I would like to realize: i) use a collection as input for a tool ii) create a collection as output of a tool ii.1) from known # of output parameters ii.2) from unknown # of output parameters
For these things I was trying to find some tools in toolshed to see how they do it, but I couldn't quite adopt it.
I would look in the following directory instead of the tool shed - https://github.com/galaxyproject/galaxy/tree/dev/test/functional/tools. These are the tools used to drive the testing of the collections implementation and contain some very stripped down examples of what is possible.
i) use a collection as input for a tool this is good documented - realizable by type="data_collection" and the collection_type. Unfortunately I can't test this because I can't create a collection so far ;) - see A
Indeed :). Here some good examples are like the tools in the RNA-seq pipeline - Tophat, Bowtie2, etc....
ii) create a collection as output of a tool Here it gets blurry for me.
So one can get very far without ever creating an output from a tool explicitly. I contend most of the time - if you have a list of bam files and you want to create another list of bam files - you just want to map some operation over them. This is demonstrated in that RNA-seq outline - and talked about in a more theoretical way in my GCC talk from last year http://bit.ly/gcc2014workflows. There are definitely cases when you want to explicitly create collections though - the current best documentation on this is going to be the pull request that added them - not the implementation but the description which actually lays out these same categories and how to handle them with explicit complete examples. https://bitbucket.org/galaxy/galaxy-central/pull-request/634/allow-tools-to-... Hopefully this helps - please follow up with additional questions as you have them. I am keen to see more developers leveraging dataset collections. Thanks a bunch. -John
ii.1) from known # of output parameters Here I didn't find a tool. I just thought, it might be a simpler case than ii.2 and good to understand the concept. I would be glad if someone could explain the way(s) to do this.
ii.2) from unknown # of output parameters For this I found barcode splitter tools (also from devteam) that have different approaches. But. Their output (defined in xml) is only some report file. The output files seem to be fed into the history. And here I don't know how to get hands on these files when I want to use them to feed them into the next step during a workflow.
Help highly appreciated!
Thanks! Alexander
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/