critical feedback

26 Sep 2006

      This student was more adventurous. I think he actually could do more  
of what he tried with more experience (if he wants to get GNF data he  
should use that table), but there is some good feedback.

I do agree with one comment - that we should mask out buttons that  
don't work on the Table Browser proxy.

"My goal is to use Galaxy to retrieve gene expression data from GNF  
human gene expression atlas from UCSC genome table browser. If  
possible, I will do a gene co-expression analysis for a group of  
genes that are annotated as G-protein coupled receptors (GPCRs).

The first problem I met was that I could not find a place in UCSC  
genome table browser to retrieve genes with annotations. I think it  
is reasonable for a biologist to ask questions like: what are the  
annotated G-protein-coupled receptors or MAP kinases in the human  
genome. I went though a number of table schema of UCSC genome table  
browser, there are no annotation in those tables. The lack of gene- 
annotations in the UCSC table browser may reflect the scope of UCSC  
designers and is not related to Galaxy, so I will not further discuss  
this issue. I went to the GNF atlas official website downloaded  
annotated probe file, and extracted all the genes annotated as G  
protein coupled receptors. There are totally 385 such annotated  
probesets in the public available annotation file (by custom Python  
code search for annotated as “coupled” and “receptor”). I thought it  
may be possible to do this operation in Galaxy, but I found in the  
“Filter, sort, join and compare: Select” tool, there is no  
combinatorial pattern search. It would be helpful to add a place  
where people with more expertise can customize their query commands  
and save them. In another word, if people already know a little about  
PERL or Python, it may be desirable for them to put a short query in  
those programming languages.

After obtaining the list of GPCRs, I was trying to use Galaxy to  
retrieve data from UCSC table browser. I clicked on GET DATA à UCSC  
main à Execute, which brought up a proxy page in the middle panel of  
Galaxy interface. I found at least three functions on this proxy did  
not work properly. First, the button of “describe table schema” did  
not work. Second, when I clicked on “upload list”, an uploading  
interface appeared in the middle panel. When I clicked “submit” after  
I specified the file name, the middle panel showed “This is a proxy  
to the data services provided by the UCSC Genome Browser's Table  
Browser.”. and no data were received by Galaxy. Third problem was  
when I specified a file name as output file in the “output file:  
(leave blank to keep output in browser)”, I could not get the output  
file anywhere. Clearly there are some “mis-communication” between  
Galaxy and UCSC browser. It may be useful to “mask” these functions  
in Galaxy to avoid frustration. What I finally did was to “PASTE  
LIST” of all the probe names I got and retrieve the data from UCSC  
browser. The resulted file in Galaxy is the same as what I downloaded  
directly from UCSC browser website use the same set of 385 probesets  
names. Totally 405 expression data were extracted from the UCSC table  
browser.

The gene expression data was in the column 16 of the output file. I  
did “cut columns from a table” then “Remove beginning of a file” and  
then “convert delimiters to tab”. These operations gave me a file  
with table delimited expression data of 405 probesets(genes). I found  
that in the data browsers of Galaxy, the maximum number of columns is  
27, above that, there was no straight forward way of finding out how  
many columns were in the data file. Although there is a statistical  
tool in the Galaxy, we can only calculate “correlation for numeric  
COLUMNS”. And there is no way to transpose the data matrix.  
Therefore, I could not find a way to finish my goal completely in  
Galaxy.
I have learned one important lesson from the experience of using  
Galaxy: the diversity of GENOMICS data is a great challenge for  
people working on genomic research and for any software. Clearly,  
Galaxy is a sequence-analysis centric environment. I am sure Galaxy  
can do a great job in sequence analysis. However, when we are looking  
for other type of genomics data, for example, gene expression data  
from microarrays(or Masspec spectrum of proteome data), Galaxy may  
not be a suitable tool. For example, microarray data usually comes in  
a format of M by N matrix of numerical data, with more than 10,000  
rows and hundreds of columns. The display of Galaxy history frame is  
not suitable for the display of such data. The statistical tools and  
graph data tools in Galaxy are clearly designed for sequence data,  
because in those tools, we cab only do column-wise statistics/ 
graphics. The query tools in Galaxy are limited to simple queries,  
where combinatorial query such as “protein AND receptor” can not be  
effectively performed. Again, this is not a drawback for Galaxy if  
our starting point is a set of “gene identifiers” and our goal is  
“sequence analysis”. Yet, I think biologist would like to start  
explore the genomics data from a couple of words that describe the  
“function” of a group of genes. It will also be useful to extract  
data of gene expression, and to search for the correlation between  
the gene expression pattern and the sequence features of those genes,  
in a single platform, such as Galaxy."

Ross Hardison

tags

participants (1)