Ease-of-Use with galaxy and workflows (long rant)

3 Oct 2009

      Hello,

I've recently tutored a genomics course, and part of the exercise in that course was to annotated short-reads libraries, and compare the number of exons/introns/UTRs/intergenic reads and base-pairs (the exact details are irrelevant).

Galaxy seemed like the perfect tool for the exercise, and indeed was used for deriving the answers.
However, as the students (and myself) used galaxy, some usability issues surfaced.

Here's the problem set (some details omitted for brevity):

Given three libraries of long and short RNAs, what is the number of reads (and base-pairs) that overlap annotated coding-exons (same strand and opposite strand), introns, 3'UTRs and intergenic regions.
The libraries were given as BED files (already mapped to hg18).

Ostensibly, it should be easy and straight forward to do this in Galaxy:
 1. upload a library BED file
 2. download a gene table from UCSC
 3. Use "Extract Features" to extract exons/introns/UTR3
 4. Use "Filter" to separate negative/positive genes and intervals
 5. Intersect above datasets
 6. use "Base Coverage" to count covered base-pairs.

For the first library, the above process is fine. 
It takes longer if you've never used Galaxy, but that's understandable.
One annoyance is that the labels of the datasets are not useful ("Intersect on data 12 and data 4").
If you're very organized, immediately after running each tool you rename the dataset to something meaningful.
It's annoying, because you waste more time renaming and organizing you datasets than actually running tools.
It took me no less than 30 minutes to run everything, rename and organize it (and I'm pretty good with Galaxy...).

The real problem is with the other two libraries:
I already know exactly what needs to be done - so doing it manually will be very frustrating.

But, constructing a workflow from the current history is not as useful as it first seems.

First issue (minor but annoying):
When I clicked "Extract Workflow", galaxy did a perfect job of reconstructing all my steps.
The workflow takes two input datasets, and produces many many others.
The only problem:
 The two datasets (one - my library, the other - UCSC's Gene track) are indistinguishable in the workflow editor.
 which one is which ?
 There's the very useful "name" attribute to each input dataset, but it is not set...
 The only way to understand which one is which is to follow the workflow and see what's the next tool that's connected.
 A novice galaxy user might not even notice this issue - and might mix-up the two input datasets.

 A possible work-around (in the "extract workflow step") is to name each input dataset by the label of the dataset in the history (something like "Input data, based on UCSC KnownGenes dataset").

Second issue:
 I have extracted the workflow and used it on the other two libraries.
 But the generated datasets - oh boy. Making sense out of them is a real pain.
 I have uploaded the library, imported the UCSC KnownGene track, and executed the workflow.
 Take a look at the following history: http://main.g2.bx.psu.edu/history/imp?id=93de35e1ee121686 .
 Can you tell me which datasets answers the question of how many of my reads intersect introns ? intergenic regions ? same-strand exons ?
 (13, 15 and 22, respectively).
 Using the workflow and trying to understand which dataset is which can take almost as long as just running everything from scratch without a workflow.

Third issue:
 I personally created this workflow six days ago, by extracting it from a history that I've made.
 Looking at it now, I have no way of adding/changing/updating it. It's almost "write once - read never".
 http://main.g2.bx.psu.edu/workflow/imp?id=6e55935a2b1f3d59

 I will need to carefully trace each chain of datasets and tools to understand it - that's really annoying (I realize nobody cares about my annoyance level. but in practice it means I won't use it and won't share it with others - it's not useful).

I don't have a good solution (this is rant, not a bugfix...), but I think that as galaxy goes into the Next-Gen sequencing arena,
this situation will become more common and problematic. This kind of basic analysis (taking a library and annotating it) is a very standard procedure - in our lab it is done almost automatically on every next-gen sequenced library. To do it in Galaxy, the process needs to be much simpler...

A small improvement would be to add two description labels for each tool in the workflow: one (detailed) will be shown inside the workflow editor, and the other to use a output label for the generated dataset.

Thanks for reading so far,
  gordon.

Assaf Gordon

tags

participants (1)