Galaxy for Natural Language Processing

13 Apr 2015

      Hello,

Our group is investigating using Galaxy as a workflow engine for NLP (Natural Language Processing) tasks. I have installed a local Galaxy instance and created wrappers for the services we use and so far everything is working great.  I do have a few questions and they all fall under the “Advanced Topics”  section as defined at the end of the tutorial for creating a Histogram [1]

1. parameter validation: 

Many of our tools rely on additions made by previous tools in the workflow; for example, a tool that identifies noun phrases may require that the input has been run through a part of speech (POS) tagger, the POS tagger may require that the input has been run through a tokenizer, etc.  Our tools can do this validation, I am just looking for a way to wire this into Galaxy so a user can only connect tools in the workflow editor if this validation passes.

I have been looking looking at the code for lib/galaxy/tools/parameters/validation.py and I don’t see anything that I can (easily) bend to our use case.  What I was hoping for was something like:

	<input type=“data” format=“our_custom_format” name=“input”>
		<validator type=“dataset_custom”>
			<command interpreter=“bash”>validate.sh $input</command>
			<!— OR —>
			<tool file=“custom_validator.xml”/>
		</validator>
	</input>

I also see the tantalizing sentence, "Custom code execution at various time points of the workflow that allows a fine grained control over the execution process", but I can't find any examples of how this is done.

2. data repositories / data collections

I need to be able to process collections of data pulled from remote servers. I have been looking at DataManagers and data collections in Galaxy, but everything seems to assume the data is local to the server, or can be copied/uploaded to the server.  For practical and legal reasons beyond my pay grade this is not a solution in our case.  For example, an organization may be willing to allow our users to query their service for documents, run the documents through our workflow, and store the intermediate results; but they will not allow us to copy their data to another server verbatim.  There are possibilities for me to cache data, but the general use case is that I have to call an external service to fetch documents one at a time and then run the same workflow on each document.

Any suggestions on how to accomplish this in Galaxy?  I can do single documents, I just need to expand this to include collections of documents.  A typical workflow might look something like:

a) Query Tool -> Server, find all documents that contain the word “cheese”
b) Server -> Here is the list of document IDs [ id1, id2, …, idn ]
c) WorkFlow -> for each id in the list do
	c1) Download document 
	c2 ) Work work work work…
	c3) Persist output

I can do all of the above except the most important bit; iterating…

3. format conversion:  

Is it possible for Galaxy to automatically convert between formats when designing a workflow?  I see the <change_format/> tag, but that seems to change the output format of a tool based on the input (or some other condition) in the same tool; I need to be able to change the format based on the input requirements of the next tool in the workflow. For example, if Tool A produces format X, Tool B requires format Y,  and a converter from X to Y has been defined in the datatypes_conf.xml; I would like for Galaxy to implicitly insert the converter from X to Y when I drag the output noodle from Tool A to Tool B in the designer.  Is this possible? 

4. OAuth 2.0 / OpenID Connect: 

I need to be able to fetch documents from data providers that require an OAuth 2.0 access token. Currently, I use a separate service to go through the OAuth authentication/authorization process and then have the user copy/paste their access token into a text field in Galaxy.   Is there a way to perform the OAuth authentication dance required by the remote service inside Galaxy itself?   I’ve looked at the Trello site for Galaxy and see that both OAuth 2.0 and OpenID Connect are on the radar, hopefully this use case is being considered as well.

I’m sure to have more questions after working through some visualization examples, but this should keep me busy for now.

Sincerely,
Keith Suderman

REFERENCES

1. https://wiki.galaxyproject.org/Admin/Tools/AddingTools

------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY

Keith Suderman

Björn Grüning

Keith Suderman

Björn Grüning

Keith Suderman

Björn Grüning

Keith Suderman

Nicola Soranzo

Bjoern Gruening

Keith Suderman

tags

participants (4)