On Wed, Jan 21, 2015 at 11:02 AM, Jorrit Boekel email@example.com wrote:
I’m toying around a little in galaxy-dist with the dataset collections feature. Since I know this is work in progress, I was wondering about some things I haven’t really found online.
It seems to work really well to run a tool on a list of datasets, and a new job is run for each list item. But when I want to reduce to a smaller amount of list items, I understand I need to write some sort of merge tool myself, dependent on the data (all proteomics data here currently). This works well for reducing a dataset to a single file, but I am not sure about how to reduce to a new smaller collection. In the tool I’m writing, I let the user choose the size of the collection.
Is there some way to tell galaxy dynamically how many outputs to expect AND put them in a collection? Something like:
<outputs> <output type=“data_collection” amount_of_files=“3”/> </outputs> Where 3 is set by the user in a param also.
Through the January 2015 release - this is not possible. It is now possible in central for tools to explicitly produce collections and this functionality will be in the next release (I think it is still an open question as to whether the team is aiming for February or March 2015 for that). There are a lot of details and examples linked to in the following pull request that I merged last week:
There are three style of outputs of increasing complexity - tools that produce static collection (pairs or lists of fixed size), N->N operations (like normalization), and finally fully dynamic collections. Since one can pre-determine the size this would ideally fall under the first type but I had not foreseen this use case so there is currently no syntax like you described so you have to use the third most complicated style.
I am going to assume this is some sort of binning operation? Lets imagine - the user selects 3 bins and your tools creates a directory and populates it with mzml files :
output/bin1.mzml output/bin2.mzml output/bin3.mzml
Then you could create an output collection like this:
<outputs> <collection name="binned_output" type="list" label="Binned Outputs"> <discover_datasets pattern="__name_and_ext__" directory="outputs" /> </collection> </outputs>
After the job is complete a dataset collection will be populated with three elements of type mzml and element identifiers bin1, bin2, and bin3 (inferred from the name).
The syntax of the discover_datasets thing can be quite complex and you can variously hard code properties like datatype, name, etc... or infer them from the name on the file system.
Also, when running with two or more lists as input, is there some sort of correlation between the lists? It seems like it takes the files in dataset no order, so just checking.
Yes definitely. The UI for creating lists is pretty limited and needs to be updated to look a lot more like the UI for creating lists of paired datasets and this would become a lot more clear I think. So lists are ordered data structures and element identifiers are preserved across executions.
So if you start with a list of raw files with identifiers sample1, sample2, and sample3 and map and operation like peak picking over them you would get a new list with the same order and identifiers (sample1, sample2, and sample3 in that order), then if you map and operation like peptide identification on the picked files again you would get a list with identifiers (sample1, sample2, and sample3). Then if you have some sort of summary operation that takes in a raw file and identification and you pass it the original list and the result of the identifications - Galaxy should match everything up and assign the create a resulting list with sample1, sample2, and sample3. (The API lets you do more complicated things like cross-products over subsets of inputs - but this isn't exposed in the GUI yet).
If you have two ordered lists and identifiers don't match up - Galaxy's behavior should be considered undefined but it will assign the resulting list the identifiers from one or the other inputs.
By the way, thanks very much John and everyone else involved in collections for doing and pushing this stuff.
Thanks - and it would be just a playground for me to write API tests against without all the excellent work Carl has put into the UI to make everything actually usable and useful.
If there are smaller issues I can help with, I’d be thrilled.
The number one thing I encourage everyone to do to help is to build awesome tools and workflows and put them in the tool shed.
Can’t stress enough how much this feature means for galaxy adoption in our lab and possibly field.
Shhh... don't tell them I am secretly wasting money they would like to put into building a platform for sequencing data analysis to address mass spec use cases - they will stop paying me.
Thanks for the kind note, -John
cheers, — Jorrit Boekel Proteomics systems developer BILS / Lehtiö lab Scilifelab Stockholm, Sweden
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/