This student was more adventurous. I think he actually could do more
of what he tried with more experience (if he wants to get GNF data he
should use that table), but there is some good feedback.
I do agree with one comment - that we should mask out buttons that
don't work on the Table Browser proxy.
"My goal is to use Galaxy to retrieve gene expression data from GNF
human gene expression atlas from UCSC genome table browser. If
possible, I will do a gene co-expression analysis for a group of
genes that are annotated as G-protein coupled receptors (GPCRs).
The first problem I met was that I could not find a place in UCSC
genome table browser to retrieve genes with annotations. I think it
is reasonable for a biologist to ask questions like: what are the
annotated G-protein-coupled receptors or MAP kinases in the human
genome. I went though a number of table schema of UCSC genome table
browser, there are no annotation in those tables. The lack of gene-
annotations in the UCSC table browser may reflect the scope of UCSC
designers and is not related to Galaxy, so I will not further discuss
this issue. I went to the GNF atlas official website downloaded
annotated probe file, and extracted all the genes annotated as G
protein coupled receptors. There are totally 385 such annotated
probesets in the public available annotation file (by custom Python
code search for annotated as “coupled” and “receptor”). I thought it
may be possible to do this operation in Galaxy, but I found in the
“Filter, sort, join and compare: Select” tool, there is no
combinatorial pattern search. It would be helpful to add a place
where people with more expertise can customize their query commands
and save them. In another word, if people already know a little about
PERL or Python, it may be desirable for them to put a short query in
those programming languages.
After obtaining the list of GPCRs, I was trying to use Galaxy to
retrieve data from UCSC table browser. I clicked on GET DATA à UCSC
main à Execute, which brought up a proxy page in the middle panel of
Galaxy interface. I found at least three functions on this proxy did
not work properly. First, the button of “describe table schema” did
not work. Second, when I clicked on “upload list”, an uploading
interface appeared in the middle panel. When I clicked “submit” after
I specified the file name, the middle panel showed “This is a proxy
to the data services provided by the UCSC Genome Browser's Table
Browser.”. and no data were received by Galaxy. Third problem was
when I specified a file name as output file in the “output file:
(leave blank to keep output in browser)”, I could not get the output
file anywhere. Clearly there are some “mis-communication” between
Galaxy and UCSC browser. It may be useful to “mask” these functions
in Galaxy to avoid frustration. What I finally did was to “PASTE
LIST” of all the probe names I got and retrieve the data from UCSC
browser. The resulted file in Galaxy is the same as what I downloaded
directly from UCSC browser website use the same set of 385 probesets
names. Totally 405 expression data were extracted from the UCSC table
browser.
The gene expression data was in the column 16 of the output file. I
did “cut columns from a table” then “Remove beginning of a file” and
then “convert delimiters to tab”. These operations gave me a file
with table delimited expression data of 405 probesets(genes). I found
that in the data browsers of Galaxy, the maximum number of columns is
27, above that, there was no straight forward way of finding out how
many columns were in the data file. Although there is a statistical
tool in the Galaxy, we can only calculate “correlation for numeric
COLUMNS”. And there is no way to transpose the data matrix.
Therefore, I could not find a way to finish my goal completely in
Galaxy.
I have learned one important lesson from the experience of using
Galaxy: the diversity of GENOMICS data is a great challenge for
people working on genomic research and for any software. Clearly,
Galaxy is a sequence-analysis centric environment. I am sure Galaxy
can do a great job in sequence analysis. However, when we are looking
for other type of genomics data, for example, gene expression data
from microarrays(or Masspec spectrum of proteome data), Galaxy may
not be a suitable tool. For example, microarray data usually comes in
a format of M by N matrix of numerical data, with more than 10,000
rows and hundreds of columns. The display of Galaxy history frame is
not suitable for the display of such data. The statistical tools and
graph data tools in Galaxy are clearly designed for sequence data,
because in those tools, we cab only do column-wise statistics/
graphics. The query tools in Galaxy are limited to simple queries,
where combinatorial query such as “protein AND receptor” can not be
effectively performed. Again, this is not a drawback for Galaxy if
our starting point is a set of “gene identifiers” and our goal is
“sequence analysis”. Yet, I think biologist would like to start
explore the genomics data from a couple of words that describe the
“function” of a group of genes. It will also be useful to extract
data of gene expression, and to search for the correlation between
the gene expression pattern and the sequence features of those genes,
in a single platform, such as Galaxy."