Glen Beane wrote:
On Feb 11, 2011, at 9:32 AM, Nate Coraor wrote:
Glen Beane wrote:
On Feb 9, 2011, at 9:44 AM, Glen Beane wrote:
I've been doing some testing with a Galaxy instance running on my laptop for some tools we are developing. I am uploading a file into Galaxy from a URL to use as test input (~1.5GB tabular) I can download this file to my laptop in ~30 seconds with wget, while if I pull from the same URL into Galaxy it takes about 30 minutes. I set the file type so Galaxy did not have to auto-detect.
This seems very slow considering it only takes about 30 seconds to get the file over the network and write it to disk. What is Galaxy doing that makes this file upload so slow? We also tried defining our own datatype (data, not tabular with the thought that maybe Galaxy tried to examine tabular files), but it is still very slow. In production our input files will grow to be much larger than this (although we'll probably abandon tabular for a more compact binary format by then).
So no insight as to why a 1.5GB file takes 60 times as long to load into galaxy via URL as it takes to download the file from the same URL outside of Galaxy? I'm assuming it has to do with detecting Metadata, since changing the file type from our custom tabular type to the galaxy tabular type causes a set metadata job that takes at least 20 minutes (I didn't time it). However, I changed our data type from tabular to "data" hoping Galaxy would just ignore the file contents and it still takes 30 minutes to load into Galaxy.
We haven't updated to the latest galaxy-dist (it is on our todo list to synch up), but this seems like it takes much longer than it should and is a problem with the implementation
Hi Glen,
Sorry, I haven't had a chance to address your question yet. The reason is most likely metadata as you have surmised. Do you have:
set_metadata_externally = True
Set in universe_wsgi.ini?
I'm not sure. I'll check. What does this setting do?
Python has a limitation when using threads in that it's not true threading - only one thread can actually be on CPU at a time. Because detecting metadata can be very CPU-intensive, it has to contend with and often suffers from (and blocks operation of) other threads in the Galaxy process. set_metadata_externally = True moves the operation of detecting metadata to a separate OS process, meaning it does not contend for the same resources as Galaxy itself. This should yield a performance increase, but I suspect the main cause of the slowness is due to trying to detect column types for the entire 1.5GB file. The enchancements in the newest dist release will cause it to only check the first 100,000 lines. Some metadata elements are also optional, and choosing not to set them for large files can be configured using 'max_optional_metadata_filesize' in datatypes_conf.xml. This also requires the latest stable distribution. --nate
Also, there are some recent changes in the newest dist release which limit the number of lines checked for metadata that should make this process much faster.
Thanks, we'll try to update our test Galaxy instance to the newest dist releast to see if that helps.
-- Glen L. Beane Software Engineer The Jackson Laboratory Phone (207) 288-6153
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev