[galaxy-dev] local galaxy - very slow - more information (long)

8 Dec 2009

      Hello,

As I wrote in my previous email, I'm trying to find reasons for my local Galaxy slowness.

First,
I'd like to say that the public galaxy (http://main.g2.bx.psu.edu/) is blazingly fast.
I'm curious as to how you did it, and what are differences between yours and my galaxy.
Is it a special postgresql configuration, apache configuration or galaxy options ?

It is faster than mine, and I'm sure your database is substantially bigger than my local database.
Any tips would be highly appreciated.

So far I have found two possible reasons for the slowness.

Minor reason is the number of datasets in the current history.
A history with many datasets causes tools to appear slow.
An empty history (or with just few items) allows tools to appear fast.

You can dismiss this as 'obvious' because each tool needs to enumerate possible inputs for the HTML <form>, 
But on my local galaxy, A history with 100 items it takes upto 7 seconds to display the tool's form.
On the public galaxy it takes just 3 seconds, and that's why I'm curious to the optimizations you have.
But even 3 seconds is much more than the instantaneous display of tools if the history is empty.

You can try this out:
History with 100 items:  http://main.g2.bx.psu.edu/history/imp?id=762ca4b287ff6fdf
History with 60 items:   http://main.g2.bx.psu.edu/history/imp?id=3466aa4202c8e460
History with 1 item:     http://main.g2.bx.psu.edu/history/imp?id=e1e13f98a402aaa2

As a 'control' for testing this issue, try clicking on a tool that doesn't require listing of input datasets (e.g. "Build Custom Track" in the "Graph/Display Data" category). It will display instantly regardless of how many items there are in the current history.

Please remember that 100 items is not unheard of: running a workflow with 25 steps on 4 FASTQ files will give you 100 items - this is a common scenario in our lab.

A bigger problem (IMHO) is the new set-metadata mechanism, especially in tabular.py which scans the entire file.

Try (cautiously) to upload the following file:
http://cancan.cshl.edu/labmembers/gordon/files/big_fake_tabular_file.txt.gz

It is a tabular file with 700,000 lines and 150 columns.
Compressed size is 89MB, uncompressed size is 300MB.

Uploading it to (my) galaxy takes less than a minute,
Then galaxy starts an uncompress job which takes couple of seconds,
and then galaxy spends thirty minutes (!!) at 100% CPU.
This is one of Galaxy's threads, not some external process.
Looking at the heartbeat log, the only thread which is not waiting on a lock is this:
==============
Thread 1098774864, <Thread(Thread-2, started)>:

  File "/usr/lib/python2.5/threading.py", line 462, in __bootstrap
    self.__bootstrap_inner()
  File "/usr/lib/python2.5/threading.py", line 486, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.5/threading.py", line 446, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/home/gordon/temp/slowgalaxy/lib/galaxy/jobs/runners/local.py", line 46, in run_next
    self.run_job( job_wrapper )
  File "/home/gordon/temp/slowgalaxy/lib/galaxy/jobs/runners/local.py", line 122, in run_job
    job_wrapper.finish( stdout, stderr )
  File "/home/gordon/temp/slowgalaxy/lib/galaxy/jobs/__init__.py", line 541, in finish
    dataset.set_meta( overwrite = False )
  File "/home/gordon/temp/slowgalaxy/lib/galaxy/model/__init__.py", line 546, in set_meta
    return self.datatype.set_meta( self, **kwd )
  File "/home/gordon/temp/slowgalaxy/lib/galaxy/datatypes/tabular.py", line 111, in set_meta
    column_type = guess_column_type( field )
  File "/home/gordon/temp/slowgalaxy/lib/galaxy/datatypes/tabular.py", line 91, in guess_column_type
    if is_column_type[column_type]( column_text ):
  File "/home/gordon/temp/slowgalaxy/lib/galaxy/datatypes/tabular.py", line 80, in is_list
    return "," in column_text
==============
The other threads are in "waiter.acquire()", "sock.accept()" or "_sleep(delay)" states.

The active threads loops around tabular.py, in the method "set_meta".

Now, 
On my galaxy, it brings the entire server to a grinding halt.
Could be a mis-configuration on my part, or some lock issue, or maybe something bad with mysql/postgres.
But the fact is, once such a metadata thread is started, galaxy stops responding almost completely.
Refreshing the web page takes minutes(!), and sometimes apache just chokes and says:
====
"Bad Gateway
 The proxy server received an invalid response from an upstream server."
=====

The 'set-metadata' is also activated when creating a new file (e.g. try "remove beginning of file" on this big file - it will take another half an hour), and when changing file types.

Another issue with this set-metadata mechanism, is that the 'job' is considered 'ok' (in the job table), but the dataset is still marked 'running' (in the 'dataset' table).

My users were complaining that their jobs are running for a long time (telling me they *are* looking at the history and seeing the item as yellow), while I tell them that it's impossible :) because the 'galaxy reports' page says no jobs are running (and there are indeed no external processes running).
If I kill galaxy during this stage (which I did couple of times), the dataset is in limbo: it remains 'running' in the 'dataset' table, but will never get into 'ok' (even though the file content is OK and the job is completed).

That's what I've found so far.

Thank you for reading,
I'd appreciate any suggestion or solution.

-gordon

[galaxy-dev] local galaxy - very slow - more information (long)

Assaf Gordon