Hi all, I’m staring at the discussion of handling dataset collections: http://planemo.readthedocs.io/en/latest/_writing_collections.html but failing to see the solution to my problem. I have a tool that creates a dataset collection, a group of files with names like 1_1308_1_2_092-ch6-speaker16.TextGrid where the 1_1308_1_2_092 part is a unique identifier that I’d like to keep track of. I’ve used a discover_datasets tag in the tool xml file to match my output filenames and extract the designation (1_1308_1_2_092-ch6-speaker16.TextGrid) and the ext (TextGrid). I have another tool that runs a query over these files and generates a single tabular result that will ideally include the identifier in some form. Here’s the command section for that tool: query_textgrids.py --textgrid "${",".join(map(str, $textgrid))}" --tier $tier --regex '$regex' --output_path $output where ‘$textgrid’ is one of my input parameters that has multiple=“true” set so that it can be a dataset collection. That works ok but the input I get are the filenames (dataset_1.dat, etc.) not the name of the datasets. The page above mentions something called the ‘element_identifier’ and gives this funky example: merge_rows --name "${re.sub('[^\w\-_]', '_', $input.element_identifier)}" --file "$input" --to $output; I can’t see what this element_identifier thing is - the suggestion is that it might be the dataset name, but I’m not sure. Also I don’t understand why the command above is doing replacement of whitespace with underscores. If this is the name I’m after, it would seem that I’d need to pass these names along with the textgrid files and then pair them up inside my script - is that what I need to do? All of this cries out to me for a more explicit representation of a dataset collection that my tool can create and consume rather than this hacky treatment of filenames. If I could generate a manifest file of some kind describing my dataset collection then none of this parsing of filenames would be needed. I could also consume the manifest file as well and it could be used for collection level metadata. Is this a silly idea? Anyway, any help with my immediate problem would be appreciated. Thanks, Steve — Department of Computing, Macquarie University http://web.science.mq.edu.au/~cassidy
Thanks for the questions - I have tried to revise the planemo docs to be more explicit about what collection identifiers are and where they come from (https://github.com/galaxyproject/planemo/commit/a811e652f23d31682f862f858dc7...). http://planemo.readthedocs.io/en/latest/writing_advanced.html#collections I think this might be a case where I'm too close the problem - I implemented collections, the tooling around them, and planemo docs so there is probably a lot that I just assume is implicit when it is completely non-obvious. The collection identifier in your case is going to be something like: 1_1308_1_2_092-ch6-speaker16. The designation in the previous step - the outputing a collection with discovered datasets - if it is producing a collection should actually be called "identifier". The terms "desigination" and "identifier" are inter-changable from a Galaxy perspective - but I prefer using the term "identifier" for collections and the older "desigination" when discovered un-collected individual datasets. There was a little warning explaining the odd whitespace replacement stuff that got shifted around at some point in the planemo docs - I think I have corrected that now. The explanation for fixing up the identifier was this: "Here we are rewriting the element identifiers to assure everything is safe to put on the command-line. In the future, collections will not be able to contain keys that are potentially harmful and this won't be necessary." So yes this is the name you are after. As for the quesstion, "Is a manifest-based approach a silly idea?" - not at all, not in the least. I'd prefer to have both options available - this current option is nice because it doesn't require a "wrapper" script - you can build command lines and such from the cheetah template - but definitely people already working with and thinking about collections from inside some sort of script should have the option to consume and produce manifests of files. I've created an issue for this idea here - https://github.com/galaxyproject/galaxy/issues/2658. I'm not sure if I'll have time to get to it anytime soon - but if you or someone else is eager to tackle the problem I could scope out an implementation plan for this. Thanks for the e-mail and I hope this helps, -John On Tue, Jul 26, 2016 at 1:59 AM, Steve Cassidy <steve.cassidy@mq.edu.au> wrote:
Hi all, I’m staring at the discussion of handling dataset collections:
http://planemo.readthedocs.io/en/latest/_writing_collections.html
but failing to see the solution to my problem.
I have a tool that creates a dataset collection, a group of files with names like 1_1308_1_2_092-ch6-speaker16.TextGrid where the 1_1308_1_2_092 part is a unique identifier that I’d like to keep track of. I’ve used a discover_datasets tag in the tool xml file to match my output filenames and extract the designation (1_1308_1_2_092-ch6-speaker16.TextGrid) and the ext (TextGrid).
I have another tool that runs a query over these files and generates a single tabular result that will ideally include the identifier in some form. Here’s the command section for that tool:
query_textgrids.py --textgrid "${",".join(map(str, $textgrid))}" --tier $tier --regex '$regex' --output_path $output
where ‘$textgrid’ is one of my input parameters that has multiple=“true” set so that it can be a dataset collection. That works ok but the input I get are the filenames (dataset_1.dat, etc.) not the name of the datasets.
The page above mentions something called the ‘element_identifier’ and gives this funky example:
merge_rows --name "${re.sub('[^\w\-_]', '_', $input.element_identifier)}" --file "$input" --to $output;
I can’t see what this element_identifier thing is - the suggestion is that it might be the dataset name, but I’m not sure. Also I don’t understand why the command above is doing replacement of whitespace with underscores.
If this is the name I’m after, it would seem that I’d need to pass these names along with the textgrid files and then pair them up inside my script - is that what I need to do?
All of this cries out to me for a more explicit representation of a dataset collection that my tool can create and consume rather than this hacky treatment of filenames. If I could generate a manifest file of some kind describing my dataset collection then none of this parsing of filenames would be needed. I could also consume the manifest file as well and it could be used for collection level metadata. Is this a silly idea?
Anyway, any help with my immediate problem would be appreciated.
Thanks,
Steve
— Department of Computing, Macquarie University http://web.science.mq.edu.au/~cassidy
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Hi John, thanks for the response. So based on your updated documentation I’ve modified my script to take the identifiers as a second argument and with a bit of juggling I now have the command line: query_textgrids.py --textgrid "${",".join(map(str, $textgrid))}" --identifier "${",".join(map(str, [t.element_identifier for t in $textgrid]))}" --tier $tier --regex '$regex' --output_path $output Note the juggling needed with list comprehension to get the list of identifiers from the textgrid argument. This works ok and I can now get a result from my tool that includes the identifier: start end duration label identifier 0.59 0.83 0.24 6: 1_1308_1_2_092-ch6-speaker16.TextGrid 1.56 1.77 0.21 I@ 1_1308_1_2_094-ch6-speaker16.TextGrid 1.64 1.87 0.23 3: 1_1308_1_2_173-ch6-speaker16.TextGrid in fact I’ll probably take the .TextGrid off the identifier so that it just names the recording I’m working with. What I’d like to do now is to write another tool that takes this as input along with another dataset collection who’s elements also have similar (or the same) identifiers but a different type (they will be acoustic features derived from an audio file. I think I can see how to do this, the input to this tool will be similar to query_textgrids above and I’ll work through the identifiers and the table together. I saw your note on the issue re. galaxy.json and took a look in the sources for it, so this is a secret way of communicating dataset metadata back to the system? Sounds like it might be useful. I may be able to get someone to work on this so if you have time to elaborate your ideas then please go ahead. It seems that if I was to write my python script in the .xml file (as cheetah) I’d get access to a bunch of things that are opaque to a separate script. Would it be a useful goal to have a richer galaxy-tool interface that could make all information available to the tool wrapper visible to my Python script? One way to do that would just be to bundle everything up in JSON and send it to the script. Again, thanks for the help. Steve — Department of Computing, Macquarie University http://web.science.mq.edu.au/~cassidy
On 27 Jul 2016, at 12:29 AM, John Chilton <jmchilton@gmail.com> wrote:
Thanks for the questions - I have tried to revise the planemo docs to be more explicit about what collection identifiers are and where they come from (https://github.com/galaxyproject/planemo/commit/a811e652f23d31682f862f858dc7...). http://planemo.readthedocs.io/en/latest/writing_advanced.html#collections
I think this might be a case where I'm too close the problem - I implemented collections, the tooling around them, and planemo docs so there is probably a lot that I just assume is implicit when it is completely non-obvious.
The collection identifier in your case is going to be something like: 1_1308_1_2_092-ch6-speaker16. The designation in the previous step - the outputing a collection with discovered datasets - if it is producing a collection should actually be called "identifier". The terms "desigination" and "identifier" are inter-changable from a Galaxy perspective - but I prefer using the term "identifier" for collections and the older "desigination" when discovered un-collected individual datasets.
There was a little warning explaining the odd whitespace replacement stuff that got shifted around at some point in the planemo docs - I think I have corrected that now. The explanation for fixing up the identifier was this:
"Here we are rewriting the element identifiers to assure everything is safe to put on the command-line. In the future, collections will not be able to contain keys that are potentially harmful and this won't be necessary."
So yes this is the name you are after.
As for the quesstion, "Is a manifest-based approach a silly idea?" - not at all, not in the least. I'd prefer to have both options available - this current option is nice because it doesn't require a "wrapper" script - you can build command lines and such from the cheetah template - but definitely people already working with and thinking about collections from inside some sort of script should have the option to consume and produce manifests of files.
I've created an issue for this idea here - https://github.com/galaxyproject/galaxy/issues/2658. I'm not sure if I'll have time to get to it anytime soon - but if you or someone else is eager to tackle the problem I could scope out an implementation plan for this.
Thanks for the e-mail and I hope this helps,
-John
On Tue, Jul 26, 2016 at 1:59 AM, Steve Cassidy <steve.cassidy@mq.edu.au> wrote:
Hi all, I’m staring at the discussion of handling dataset collections:
http://planemo.readthedocs.io/en/latest/_writing_collections.html
but failing to see the solution to my problem.
I have a tool that creates a dataset collection, a group of files with names like 1_1308_1_2_092-ch6-speaker16.TextGrid where the 1_1308_1_2_092 part is a unique identifier that I’d like to keep track of. I’ve used a discover_datasets tag in the tool xml file to match my output filenames and extract the designation (1_1308_1_2_092-ch6-speaker16.TextGrid) and the ext (TextGrid).
I have another tool that runs a query over these files and generates a single tabular result that will ideally include the identifier in some form. Here’s the command section for that tool:
query_textgrids.py --textgrid "${",".join(map(str, $textgrid))}" --tier $tier --regex '$regex' --output_path $output
where ‘$textgrid’ is one of my input parameters that has multiple=“true” set so that it can be a dataset collection. That works ok but the input I get are the filenames (dataset_1.dat, etc.) not the name of the datasets.
The page above mentions something called the ‘element_identifier’ and gives this funky example:
merge_rows --name "${re.sub('[^\w\-_]', '_', $input.element_identifier)}" --file "$input" --to $output;
I can’t see what this element_identifier thing is - the suggestion is that it might be the dataset name, but I’m not sure. Also I don’t understand why the command above is doing replacement of whitespace with underscores.
If this is the name I’m after, it would seem that I’d need to pass these names along with the textgrid files and then pair them up inside my script - is that what I need to do?
All of this cries out to me for a more explicit representation of a dataset collection that my tool can create and consume rather than this hacky treatment of filenames. If I could generate a manifest file of some kind describing my dataset collection then none of this parsing of filenames would be needed. I could also consume the manifest file as well and it could be used for collection level metadata. Is this a silly idea?
Anyway, any help with my immediate problem would be appreciated.
Thanks,
Steve
— Department of Computing, Macquarie University http://web.science.mq.edu.au/~cassidy
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Hey Steve - About sketching out some work on defining output collections this way: It took a while but I just got this pull request merged (https://github.com/galaxyproject/galaxy/pull/2697#issuecomment-240331966) which includes some fixes and examples for discovering datasets with galaxy.json. I also added some planemo documentation about this topic - http://planemo.readthedocs.io/en/latest/writing_advanced.html#tool-provided-.... I think there still needs to be more documentation and more examples but hopefully it is a start if you want to explore using galaxy.json to implement something that describes output collections. There are a number of formats that could work - maybe looking for lines in galaxy.json of the format: {"output_name": "out1", "collection_type": "list", "elements": [{"identifier": "a1", "path": "workdir/a1.fastq", "format": "fastqsanger", "dbkey": "hg19"}, ...]} Things like format and dbkey could be defaultable on the output element description and so totally optionally in the output galaxy.json. "output_name" could reference the output name in the tool xml to match this entry up to a entry described in the XML. If you are more just concerned about consuming collections this way - it might even be easier. I'd add say an index_file="input.json" to the param element in the tool XML and dump out the structure of the collection as JSON (maybe mirroring the API description - but with path elements). Here is a PR that adds some syntax to write json descriptions out to the tool working directory - it may be a template to follow. - https://github.com/galaxyproject/galaxy/pull/1405 -John On Wed, Jul 27, 2016 at 2:58 AM, Steve Cassidy <steve.cassidy@mq.edu.au> wrote:
Hi John, thanks for the response. So based on your updated documentation I’ve modified my script to take the identifiers as a second argument and with a bit of juggling I now have the command line:
query_textgrids.py --textgrid "${",".join(map(str, $textgrid))}" --identifier "${",".join(map(str, [t.element_identifier for t in $textgrid]))}" --tier $tier --regex '$regex' --output_path $output
Note the juggling needed with list comprehension to get the list of identifiers from the textgrid argument. This works ok and I can now get a result from my tool that includes the identifier:
start end duration label identifier 0.59 0.83 0.24 6: 1_1308_1_2_092-ch6-speaker16.TextGrid 1.56 1.77 0.21 I@ 1_1308_1_2_094-ch6-speaker16.TextGrid 1.64 1.87 0.23 3: 1_1308_1_2_173-ch6-speaker16.TextGrid
in fact I’ll probably take the .TextGrid off the identifier so that it just names the recording I’m working with.
What I’d like to do now is to write another tool that takes this as input along with another dataset collection who’s elements also have similar (or the same) identifiers but a different type (they will be acoustic features derived from an audio file. I think I can see how to do this, the input to this tool will be similar to query_textgrids above and I’ll work through the identifiers and the table together.
I saw your note on the issue re. galaxy.json and took a look in the sources for it, so this is a secret way of communicating dataset metadata back to the system? Sounds like it might be useful. I may be able to get someone to work on this so if you have time to elaborate your ideas then please go ahead.
It seems that if I was to write my python script in the .xml file (as cheetah) I’d get access to a bunch of things that are opaque to a separate script. Would it be a useful goal to have a richer galaxy-tool interface that could make all information available to the tool wrapper visible to my Python script? One way to do that would just be to bundle everything up in JSON and send it to the script.
Again, thanks for the help.
Steve
— Department of Computing, Macquarie University http://web.science.mq.edu.au/~cassidy
On 27 Jul 2016, at 12:29 AM, John Chilton <jmchilton@gmail.com> wrote:
Thanks for the questions - I have tried to revise the planemo docs to be more explicit about what collection identifiers are and where they come from (https://github.com/galaxyproject/planemo/commit/a811e652f23d31682f862f858dc7...). http://planemo.readthedocs.io/en/latest/writing_advanced.html#collections
I think this might be a case where I'm too close the problem - I implemented collections, the tooling around them, and planemo docs so there is probably a lot that I just assume is implicit when it is completely non-obvious.
The collection identifier in your case is going to be something like: 1_1308_1_2_092-ch6-speaker16. The designation in the previous step - the outputing a collection with discovered datasets - if it is producing a collection should actually be called "identifier". The terms "desigination" and "identifier" are inter-changable from a Galaxy perspective - but I prefer using the term "identifier" for collections and the older "desigination" when discovered un-collected individual datasets.
There was a little warning explaining the odd whitespace replacement stuff that got shifted around at some point in the planemo docs - I think I have corrected that now. The explanation for fixing up the identifier was this:
"Here we are rewriting the element identifiers to assure everything is safe to put on the command-line. In the future, collections will not be able to contain keys that are potentially harmful and this won't be necessary."
So yes this is the name you are after.
As for the quesstion, "Is a manifest-based approach a silly idea?" - not at all, not in the least. I'd prefer to have both options available - this current option is nice because it doesn't require a "wrapper" script - you can build command lines and such from the cheetah template - but definitely people already working with and thinking about collections from inside some sort of script should have the option to consume and produce manifests of files.
I've created an issue for this idea here - https://github.com/galaxyproject/galaxy/issues/2658. I'm not sure if I'll have time to get to it anytime soon - but if you or someone else is eager to tackle the problem I could scope out an implementation plan for this.
Thanks for the e-mail and I hope this helps,
-John
On Tue, Jul 26, 2016 at 1:59 AM, Steve Cassidy <steve.cassidy@mq.edu.au> wrote:
Hi all, I’m staring at the discussion of handling dataset collections:
http://planemo.readthedocs.io/en/latest/_writing_collections.html
but failing to see the solution to my problem.
I have a tool that creates a dataset collection, a group of files with names like 1_1308_1_2_092-ch6-speaker16.TextGrid where the 1_1308_1_2_092 part is a unique identifier that I’d like to keep track of. I’ve used a discover_datasets tag in the tool xml file to match my output filenames and extract the designation (1_1308_1_2_092-ch6-speaker16.TextGrid) and the ext (TextGrid).
I have another tool that runs a query over these files and generates a single tabular result that will ideally include the identifier in some form. Here’s the command section for that tool:
query_textgrids.py --textgrid "${",".join(map(str, $textgrid))}" --tier $tier --regex '$regex' --output_path $output
where ‘$textgrid’ is one of my input parameters that has multiple=“true” set so that it can be a dataset collection. That works ok but the input I get are the filenames (dataset_1.dat, etc.) not the name of the datasets.
The page above mentions something called the ‘element_identifier’ and gives this funky example:
merge_rows --name "${re.sub('[^\w\-_]', '_', $input.element_identifier)}" --file "$input" --to $output;
I can’t see what this element_identifier thing is - the suggestion is that it might be the dataset name, but I’m not sure. Also I don’t understand why the command above is doing replacement of whitespace with underscores.
If this is the name I’m after, it would seem that I’d need to pass these names along with the textgrid files and then pair them up inside my script - is that what I need to do?
All of this cries out to me for a more explicit representation of a dataset collection that my tool can create and consume rather than this hacky treatment of filenames. If I could generate a manifest file of some kind describing my dataset collection then none of this parsing of filenames would be needed. I could also consume the manifest file as well and it could be used for collection level metadata. Is this a silly idea?
Anyway, any help with my immediate problem would be appreciated.
Thanks,
Steve
— Department of Computing, Macquarie University http://web.science.mq.edu.au/~cassidy
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
participants (2)
-
John Chilton
-
Steve Cassidy