Hi
Over the last few days, I have encountered a very bizarre behavior of
the internal Galaxy job scheduler. It all started with using the API to
generate Data Libraries:
Instead of generating individual Data Libraries when requested, I
decided to make all HiSEQ and MiSEQ data which has been produced in our
institute for the last two years available.
I used the following call from BioBlend
upload_from_galaxy_filesystem(library_id, filesystem_paths,
folder[0]["id"], type, dbkey='?', link_data_only='link_to_files', roles='')
to link ~19000 (fastq and metadata) files into several sub-folders of
either the 'HiSEQ' or the 'MiSEQ' Data Library. That all worked very
well - btw a big Thank You to all the BioBlend developers!
Using the "Data libraries Beta" page I could nicely follow how my script
is working down all the files.
Unfortunately, I realized too late, that although, the files were
showing up correctly (i.e with the right path to the original file) in
the "Data libraries Beta" page, the actual 'upload' job had not been
finished. So, when my script was done, I ended up with about ~16000
unfinished jobs waiting in the queue.
We use the internal scheduler, and the settings in the job_conf.xml,
were set to <limit type="registered_user_concurrent_jobs">2</limit> . At
the beginning, the 'upload' jobs were running one after the other.
However, the more jobs were in the queue, the longer it took between the
two jobs were started. At the hight, two jobs were started only every
~60 minutes.
During that hour, nothing happened and no job was set to "running". Even
if someone else was using the Galaxy server, there was a wait of an hour
for that job to be executed. Luckily I did all this on our development
server, so no actual user was affected.
I changed the settings in the job_conf.xml file to allow 100 jobs per
user with a total of 105 concurrent jobs. I restarted the server, and
now, every hour 100 'upload' jobs were executed. But again, there were
about 60 minutes in between, when nothing happened.
I was playing with the 'cache_user_job_count' setting ("True"/"False")
but that didn't change anything.
With 100 jobs executed every hour, the queue became eventually smaller
and smaller. At about 5000 jobs to go, the gap reduced to ~30 minutes
and at about 2000 jobs to go, the waiting time was about 10 minutes and
eventually it went down to zero again.
Has anyone else seen such a behavior before?
Thank very much for any help or suggestions
Regards, Hans-Rudolf
PS: I now modifying the script, with a call to the database to check
whether all jobs have been done, before making the call to upload
more files to the Data Libraries.
--
Hans-Rudolf Hotz, PhD
Bioinformatics Support
Friedrich Miescher Institute for Biomedical Research
Maulbeerstrasse 66
4058 Basel/Switzerland