Hi Keith, Am 13.04.2015 um 20:12 schrieb Keith Suderman:
Hello,
Our group is investigating using Galaxy as a workflow engine for NLP (Natural Language Processing) tasks.
Good choice! :)
I have installed a local Galaxy instance and created wrappers for the services we use and so far everything is working great. I do have a few questions and they all fall under the “Advanced Topics” section as defined at the end of the tutorial for creating a Histogram [1]
1. parameter validation:
Many of our tools rely on additions made by previous tools in the workflow; for example, a tool that identifies noun phrases may require that the input has been run through a part of speech (POS) tagger, the POS tagger may require that the input has been run through a tokenizer, etc. Our tools can do this validation, I am just looking for a way to wire this into Galaxy so a user can only connect tools in the workflow editor if this validation passes.
I have been looking looking at the code for lib/galaxy/tools/parameters/validation.py and I don’t see anything that I can (easily) bend to our use case. What I was hoping for was something like:
<input type=“data” format=“our_custom_format” name=“input”> <validator type=“dataset_custom”> <command interpreter=“bash”>validate.sh $input</command> <!— OR —> <tool file=“custom_validator.xml”/> </validator> </input>
Can you tell me how your tool detects if it was processed before by an other tool? Metadata detection? Is this is different file type? If so you can define your own datatype(s). One of your tools can only consume the file types of an other tools output and so on.
I also see the tantalizing sentence, "Custom code execution at various time points of the workflow that allows a fine grained control over the execution process", but I can't find any examples of how this is done.
This is currently only accessible via the API I think. The backend is currently under testing and it will be integrated during the next releases afaik.
2. data repositories / data collections
I need to be able to process collections of data pulled from remote servers. I have been looking at DataManagers and data collections in Galaxy, but everything seems to assume the data is local to the server, or can be copied/uploaded to the server.
This is the preferred way, for reproducibility reasons.
For practical and legal reasons beyond my pay grade this is not a solution in our case.
For example, an organization may be willing to allow our users to query their service for documents, run the documents through our workflow, and store the intermediate results; but they will not allow us to copy their data to another server verbatim.
There are possibilities for me to cache data, but the general use case is that I have to call an external service to fetch documents one at a time and then run the same workflow on each document.
I don't think you can use data-collections for this :( What you can do is simply write a tool which takes an URL and consumes this document and do the first step. But in the end this resutl/document will be stored on a server.
Any suggestions on how to accomplish this in Galaxy? I can do single documents, I just need to expand this to include collections of documents. A typical workflow might look something like:
a) Query Tool -> Server, find all documents that contain the word “cheese” b) Server -> Here is the list of document IDs [ id1, id2, …, idn ] c) WorkFlow -> for each id in the list do c1) Download document c2 ) Work work work work… c3) Persist output
I can do all of the above except the most important bit; iterating…
Oh yes, this is simple. Just create one workflow that deals with one ID. This workflow you can run on multiple ids.
3. format conversion:
Is it possible for Galaxy to automatically convert between formats when designing a workflow? I see the <change_format/> tag, but that seems to change the output format of a tool based on the input (or some other condition) in the same tool; I need to be able to change the format based on the input requirements of the next tool in the workflow. For example, if Tool A produces format X, Tool B requires format Y, and a converter from X to Y has been defined in the datatypes_conf.xml; I would like for Galaxy to implicitly insert the converter from X to Y when I drag the output noodle from Tool A to Tool B in the designer. Is this possible?
Oh yes this is supported out of the box! See here for a small documentation: https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox#support... Here is a example of how you can write your own datatypes: https://github.com/bgruening/galaxytools/tree/master/chemicaltoolbox/datatyp...
4. OAuth 2.0 / OpenID Connect:
I need to be able to fetch documents from data providers that require an OAuth 2.0 access token. Currently, I use a separate service to go through the OAuth authentication/authorization process and then have the user copy/paste their access token into a text field in Galaxy. Is there a way to perform the OAuth authentication dance required by the remote service inside Galaxy itself?
I don't think so, but maybe someone else has an idea here.
I’ve looked at the Trello site for Galaxy and see that both OAuth 2.0 and OpenID Connect are on the radar, hopefully this use case is being considered as well.
I’m sure to have more questions after working through some visualization examples, but this should keep me busy for now.
Hope you are busy now :) Cheers and keep us up to date! Bjoern
Sincerely, Keith Suderman
REFERENCES
1. https://wiki.galaxyproject.org/Admin/Tools/AddingTools
------------------------------ Research Associate Department of Computer Science Vassar College Poughkeepsie, NY
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/