Re: [galaxy-dev] Consuming dataset collections

26 Jul 2016

      Thanks for the questions - I have tried to revise the planemo docs to
be more explicit about what collection identifiers are and where they
come from (https://github.com/galaxyproject/planemo/commit/a811e652f23d31682f862f858dc7...).
http://planemo.readthedocs.io/en/latest/writing_advanced.html#collections

I think this might be a case where I'm too close the problem - I
implemented collections, the tooling around them, and planemo docs so
there is probably a lot that I just assume is implicit when it is
completely non-obvious.

The collection identifier in your case is going to be something like:
1_1308_1_2_092-ch6-speaker16. The designation in the previous step -
the outputing a collection with discovered datasets - if it is
producing a collection should actually be called "identifier". The
terms "desigination" and "identifier" are inter-changable from a
Galaxy perspective - but I prefer using the term "identifier" for
collections and the older "desigination" when discovered un-collected
individual datasets.

There was a little warning explaining the odd whitespace replacement
stuff that got shifted around at some point in the planemo docs - I
think I have corrected that now. The explanation for fixing up the
identifier was this:

"Here we are rewriting the element identifiers to assure everything is
safe to put on the command-line. In the future, collections will not
be able to contain keys that are potentially harmful and this won't be
necessary."

So yes this is the name you are after.

As for the quesstion, "Is a manifest-based approach a silly idea?" -
not at all, not in the least. I'd prefer to have both options
available - this current option is nice because it doesn't require a
"wrapper" script - you can build command lines and such from the
cheetah template - but definitely people already working with and
thinking about collections from inside some sort of script should have
the option to consume and produce manifests of files.

I've created an issue for this idea here -
https://github.com/galaxyproject/galaxy/issues/2658. I'm not sure if
I'll have time to get to it anytime soon - but if you or someone else
is eager to tackle the problem I could scope out an implementation
plan for this.

Thanks for the e-mail and I hope this helps,

-John

On Tue, Jul 26, 2016 at 1:59 AM, Steve Cassidy <steve.cassidy@mq.edu.au> wrote:
...
Hi all,
   I’m staring at the discussion of handling dataset collections:
http://planemo.readthedocs.io/en/latest/_writing_collections.html
but failing to see the solution to my problem.
I have a tool that creates a dataset collection, a group of files with names
like 1_1308_1_2_092-ch6-speaker16.TextGrid where the 1_1308_1_2_092 part is
a unique identifier that I’d like to keep track of.  I’ve used a
discover_datasets tag in the tool xml file to match my output filenames and
extract the designation (1_1308_1_2_092-ch6-speaker16.TextGrid) and the ext
(TextGrid).
I have another tool that runs a query over these files and generates a
single tabular result that will ideally include the identifier in some form.
Here’s the command section for that tool:
query_textgrids.py --textgrid "${",".join(map(str, $textgrid))}"
--tier $tier --regex '$regex' --output_path $output
where ‘$textgrid’ is one of my input parameters that has multiple=“true” set
so that it can be a dataset collection.  That works ok but the input I get
are the filenames (dataset_1.dat, etc.) not the name of the datasets.
The page above mentions something called the ‘element_identifier’ and gives
this funky example:
merge_rows --name "${re.sub('[^\w\-_]', '_', $input.element_identifier)}"
--file "$input" --to $output;
I can’t see what this element_identifier thing is - the suggestion is that
it might be the dataset name, but I’m not sure.  Also I don’t understand why
the command above is doing replacement of whitespace with underscores.
If this is the name I’m after, it would seem that I’d need to pass these
names along with the textgrid files and then pair them up inside my script -
is that what I need to do?
All of this cries out to me for a more explicit representation of a dataset
collection that my tool can create and consume rather than this hacky
treatment of filenames.   If I could generate a manifest file of some kind
describing my dataset collection then none of this parsing of filenames
would be needed.  I could also consume the manifest file as well and it
could be used for collection level metadata.  Is this a silly idea?
Anyway, any help with my immediate problem would be appreciated.
Thanks,
Steve
—
Department of Computing, Macquarie University
http://web.science.mq.edu.au/~cassidy
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Consuming dataset collections

John Chilton