Galaxy job scheduler slows down, when too many jobs are in the queue
Hi Over the last few days, I have encountered a very bizarre behavior of the internal Galaxy job scheduler. It all started with using the API to generate Data Libraries: Instead of generating individual Data Libraries when requested, I decided to make all HiSEQ and MiSEQ data which has been produced in our institute for the last two years available. I used the following call from BioBlend upload_from_galaxy_filesystem(library_id, filesystem_paths, folder[0]["id"], type, dbkey='?', link_data_only='link_to_files', roles='') to link ~19000 (fastq and metadata) files into several sub-folders of either the 'HiSEQ' or the 'MiSEQ' Data Library. That all worked very well - btw a big Thank You to all the BioBlend developers! Using the "Data libraries Beta" page I could nicely follow how my script is working down all the files. Unfortunately, I realized too late, that although, the files were showing up correctly (i.e with the right path to the original file) in the "Data libraries Beta" page, the actual 'upload' job had not been finished. So, when my script was done, I ended up with about ~16000 unfinished jobs waiting in the queue. We use the internal scheduler, and the settings in the job_conf.xml, were set to <limit type="registered_user_concurrent_jobs">2</limit> . At the beginning, the 'upload' jobs were running one after the other. However, the more jobs were in the queue, the longer it took between the two jobs were started. At the hight, two jobs were started only every ~60 minutes. During that hour, nothing happened and no job was set to "running". Even if someone else was using the Galaxy server, there was a wait of an hour for that job to be executed. Luckily I did all this on our development server, so no actual user was affected. I changed the settings in the job_conf.xml file to allow 100 jobs per user with a total of 105 concurrent jobs. I restarted the server, and now, every hour 100 'upload' jobs were executed. But again, there were about 60 minutes in between, when nothing happened. I was playing with the 'cache_user_job_count' setting ("True"/"False") but that didn't change anything. With 100 jobs executed every hour, the queue became eventually smaller and smaller. At about 5000 jobs to go, the gap reduced to ~30 minutes and at about 2000 jobs to go, the waiting time was about 10 minutes and eventually it went down to zero again. Has anyone else seen such a behavior before? Thank very much for any help or suggestions Regards, Hans-Rudolf PS: I now modifying the script, with a call to the database to check whether all jobs have been done, before making the call to upload more files to the Data Libraries. -- Hans-Rudolf Hotz, PhD Bioinformatics Support Friedrich Miescher Institute for Biomedical Research Maulbeerstrasse 66 4058 Basel/Switzerland
participants (1)
-
Hans-Rudolf Hotz