Thank you for this detailed descriptions!

I already have a followup question.
I'm working on Galaxy Cloudman:
Galaxy is at revision: 93cda3eb81 (master branch) from 11 Jun 2015)

But I just can find "Build dataset pair|list", not "List of Dataset Pairs" like
in the video. At what version is that implemented?

Best,
Alexander

2015-06-15 10:17 GMT-05:00 John Chilton <jmchilton@gmail.com>:
On Wed, Jun 10, 2015 at 4:04 PM, Alexander Vowinkel
<vowinkel.alexander@gmail.com> wrote:
> Hi Folks,
>
> thank you so far for the previous help. I got much further.
> Now I'm stuck with data collections.
>
> Because this is quite a list, I appreciate also answers to parts of my
> questions ;)
>
> I have two issues:
> A) manual definition of data collections (any type) by user and/or admin
> B) definition of data collections as input/output of a tool and inside a
> workflow
>
>
> A) manual
> Basically I would like to create
> i) a list of fastq files (unpaired)
> ii) a paired set of two fastq files
> iii) a list of each two paired fastq files
>
> How can I do that?
> By using the web app? As user? As admin?
> By working via ssh on the server?

So each of these got much easier/more robust with the most recent release.

For the user perspective - for any of these options you will want to
load the fastq files into a history, open the manage multiple datasets
option (https://wiki.galaxyproject.org/Histories#Managing_Multiple_Datasets_Easily),
select the datasets, and then choose the list type from the menu. Each
will cause a widget to pop up allowing you to group the datasets (into
a list, a pair, or a list of pairsĀ  depending on your selection).

The most complicated option is the list of pairs - this option is
demonstrated in a the first video in Anton's recent NGS 101 -
Reference-based RNA-seq series
(https://vimeo.com/channels/884356/128265983). More information at
https://wiki.galaxyproject.org/Learn/GalaxyNGS101.

For all user-centric scenarios - you will need to get the plain
datasets into a history first. FTP upload for instance doesn't support
creating collections directly - you can import datasets and then
create them. Likewise - data libraries do not currently support
dataset collections. I believe there are Trello cards for both of
these issues.

For admins - there is a dataset collection API - I can point you at
examples if you want - but this doesn't seem to be your interest.

>
>
> B) in tool/workflow
> Here I also have different approaches I would like to realize:
> i) use a collection as input for a tool
> ii) create a collection as output of a tool
> ii.1) from known # of output parameters
> ii.2) from unknown # of output parameters
>
> For these things I was trying to find some tools in toolshed to see how they
> do it, but I couldn't quite adopt it.

I would look in the following directory instead of the tool shed -
https://github.com/galaxyproject/galaxy/tree/dev/test/functional/tools.
These are the tools used to drive the testing of the collections
implementation and contain some very stripped down examples of what is
possible.

>
> i) use a collection as input for a tool
> this is good documented - realizable by type="data_collection" and the
> collection_type.
> Unfortunately I can't test this because I can't create a collection so far
> ;) - see A

Indeed :). Here some good examples are like the tools in the RNA-seq
pipeline - Tophat, Bowtie2, etc....

>
> ii) create a collection as output of a tool
> Here it gets blurry for me.

So one can get very far without ever creating an output from a tool
explicitly. I contend most of the time - if you have a list of bam
files and you want to create another list of bam files - you just want
to map some operation over them. This is demonstrated in that RNA-seq
outline - and talked about in a more theoretical way in my GCC talk
from last year http://bit.ly/gcc2014workflows.

There are definitely cases when you want to explicitly create
collections though - the current best documentation on this is going
to be the pull request that added them - not the implementation but
the description which actually lays out these same categories and how
to handle them with explicit complete examples.
https://bitbucket.org/galaxy/galaxy-central/pull-request/634/allow-tools-to-explicitly-produce-dataset

Hopefully this helps - please follow up with additional questions as
you have them. I am keen to see more developers leveraging dataset
collections.

Thanks a bunch.
-John

>
> ii.1) from known # of output parameters
> Here I didn't find a tool. I just thought, it might be a simpler case than
> ii.2 and
> good to understand the concept.
> I would be glad if someone could explain the way(s) to do this.
>
> ii.2) from unknown # of output parameters
> For this I found barcode splitter tools (also from devteam) that have
> different approaches.
> But. Their output (defined in xml) is only some report file.
> The output files seem to be fed into the history.
> And here I don't know how to get hands on these files when I want to use
> them to feed them into the next step during a workflow.
>
> Help highly appreciated!
>
> Thanks!
> Alexander
>