Re: [galaxy-dev] dataset collections

21 Jan 2015

      On Wed, Jan 21, 2015 at 11:02 AM, Jorrit Boekel
<jorrit.boekel@scilifelab.se> wrote:
...
Hi all,
I’m toying around a little in galaxy-dist with the dataset collections feature. Since I know this is work in progress, I was wondering about some things I haven’t really found online.
It seems to work really well to run a tool on a list of datasets, and a new job is run for each list item. But when I want to reduce to a smaller amount of list items, I understand I need to write some sort of merge tool myself, dependent on the data (all proteomics data here currently). This works well for reducing a dataset to a single file, but I am not sure about how to reduce to a new smaller collection. In the tool I’m writing, I let the user choose the size of the collection.
Is there some way to tell galaxy dynamically how many outputs to expect AND put them in a collection? Something like:
<outputs>
<output type=“data_collection” amount_of_files=“3”/>
</outputs>
Where 3 is set by the user in a param also.
Through the January 2015 release - this is not possible. It is now
possible in central for tools to explicitly produce collections and
this functionality will be in the next release (I think it is still an
open question as to whether the team is aiming for February or March
2015 for that). There are a lot of details and examples linked to in
the following pull request that I merged last week:

https://bitbucket.org/galaxy/galaxy-central/pull-request/634/allow-tools-to-...

There are three style of outputs of increasing complexity - tools that
produce static collection (pairs or lists of fixed size), N->N
operations (like normalization), and finally fully dynamic
collections. Since one can pre-determine the size this would ideally
fall under the first type but I had not foreseen this use case so
there is currently no syntax like you described so you have to use the
third most complicated style.

I am going to assume this is some sort of binning operation? Lets
imagine - the user selects 3 bins and your tools creates a directory
and populates it with mzml files :

output/bin1.mzml
output/bin2.mzml
output/bin3.mzml

Then you could create an output collection like this:

<outputs>
  <collection name="binned_output" type="list" label="Binned Outputs">
    <discover_datasets pattern="__name_and_ext__" directory="outputs" />
  </collection>
</outputs>

After the job is complete a dataset collection will be populated with
three elements of type mzml and element identifiers bin1, bin2, and
bin3 (inferred from the name).

The syntax of the discover_datasets thing can be quite complex and you
can variously hard code properties like datatype, name, etc... or
infer them from the name on the file system.
...
Also, when running with two or more lists as input, is there some sort of correlation between the lists? It seems like it takes the files in dataset no order, so just checking.
Yes definitely. The UI for creating lists is pretty limited and needs
to be updated to look a lot more like the UI for creating lists of
paired datasets and this would become a lot more clear I think. So
lists are ordered data structures and element identifiers are
preserved across executions.

So if you start with a list of raw files with identifiers sample1,
sample2, and sample3 and map and operation  like peak picking over
them you would get a new list with the same order and identifiers
(sample1, sample2, and sample3 in that order), then if you map and
operation like peptide identification on the picked files again you
would get a list with identifiers (sample1, sample2, and sample3).
Then if you have some sort of summary operation that takes in a raw
file and identification and you pass it the original list and the
result of the identifications - Galaxy should match everything up and
assign the create a resulting list with sample1, sample2, and sample3.
(The API lets you do more complicated things like cross-products over
subsets of inputs - but this isn't exposed in the GUI yet).

If you have two ordered lists and identifiers don't match up -
Galaxy's behavior should be considered undefined but it will assign
the resulting list the identifiers from one or the other inputs.
...
By the way, thanks very much John and everyone else involved in collections for doing and pushing this stuff.
Thanks - and it would be just a playground for me to write API tests
against without all the excellent work Carl has put into the UI to
make everything actually usable and useful.
...
If there are smaller issues I can help with, I’d be thrilled.
The number one thing I encourage everyone to do to help is to build
awesome tools and workflows and put them in the tool shed.
...
Can’t stress enough how much this feature means for galaxy adoption in our lab and possibly field.
Shhh... don't tell them I am secretly wasting money they would like to
put into building a platform for sequencing data analysis to address
mass spec use cases - they will stop paying me.

Thanks for the kind note,
-John
...
cheers,
—
Jorrit Boekel
Proteomics systems developer
BILS / Lehtiö lab
Scilifelab Stockholm, Sweden
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] dataset collections

John Chilton