The first problem I met was that I could not find a place in UCSC genome table browser to retrieve genes with annotations. I think it is reasonable for a biologist to ask questions like: what are the annotated G-protein-coupled receptors or MAP kinases in the human genome. I went though a number of table schema of UCSC genome table browser, there are no annotation in those tables. The lack of gene-annotations in the UCSC table browser may reflect the scope of UCSC designers and is not related to Galaxy, so I will not further discuss this issue. I went to the GNF atlas official website downloaded annotated probe file, and extracted all the genes annotated as G protein coupled receptors. There are totally 385 such annotated probesets in the public available annotation file (by custom Python code search for annotated as “coupled” and “receptor”). I thought it may be possible to do this operation in Galaxy, but I found in the “Filter, sort, join and compare: Select” tool, there is no combinatorial pattern search. It would be helpful to add a place where people with more expertise can customize their query commands and save them. In another word, if people already know a little about PERL or Python, it may be desirable for them to put a short query in those programming languages.

After obtaining the list of GPCRs, I was trying to use Galaxy to retrieve data from UCSC table browser. I clicked on GET DATA à UCSC main à Execute, which brought up a proxy page in the middle panel of Galaxy interface. I found at least three functions on this proxy did not work properly. First, the button of “describe table schema” did not work. Second, when I clicked on “upload list”, an uploading interface appeared in the middle panel. When I clicked “submit” after I specified the file name, the middle panel showed “This is a proxy to the data services provided by the UCSC Genome Browser's Table Browser.”. and no data were received by Galaxy. Third problem was when I specified a file name as output file in the “output file: (leave blank to keep output in browser)”, I could not get the output file anywhere. Clearly there are some “mis-communication” between Galaxy and UCSC browser. It may be useful to “mask” these functions in Galaxy to avoid frustration. What I finally did was to “PASTE LIST” of all the probe names I got and retrieve the data from UCSC browser. The resulted file in Galaxy is the same as what I downloaded directly from UCSC browser website use the same set of 385 probesets names. Totally 405 expression data were extracted from the UCSC table browser.

The gene expression data was in the column 16 of the output file. I did “cut columns from a table” then “Remove beginning of a file” and then “convert delimiters to tab”. These operations gave me a file with table delimited expression data of 405 probesets(genes). I found that in the data browsers of Galaxy, the maximum number of columns is 27, above that, there was no straight forward way of finding out how many columns were in the data file. Although there is a statistical tool in the Galaxy, we can only calculate “correlation for numeric COLUMNS”. And there is no way to transpose the data matrix. Therefore, I could not find a way to finish my goal completely in Galaxy.

I have learned one important lesson from the experience of using Galaxy: the diversity of GENOMICS data is a great challenge for people working on genomic research and for any software. Clearly, Galaxy is a sequence-analysis centric environment. I am sure Galaxy can do a great job in sequence analysis. However, when we are looking for other type of genomics data, for example, gene expression data from microarrays(or Masspec spectrum of proteome data), Galaxy may not be a suitable tool. For example, microarray data usually comes in a format of M by N matrix of numerical data, with more than 10,000 rows and hundreds of columns. The display of Galaxy history frame is not suitable for the display of such data. The statistical tools and graph data tools in Galaxy are clearly designed for sequence data, because in those tools, we cab only do column-wise statistics/graphics. The query tools in Galaxy are limited to simple queries, where combinatorial query such as “protein AND receptor” can not be effectively performed. Again, this is not a drawback for Galaxy if our starting point is a set of “gene identifiers” and our goal is “sequence analysis”. Yet, I think biologist would like to start explore the genomics data from a couple of words that describe the “function” of a group of genes. It will also be useful to extract data of gene expression, and to search for the correlation between the gene expression pattern and the sequence features of those genes, in a single platform, such as Galaxy."