sge & pbs job dependency
I am looking at the code inside lib/galaxy/jobs. It appears that the dependency tracking between jobs is done entirely by Galaxy. In other words, the option of e.g. SGE to submit a job with a dependency on previously submitted jobs is not used. Galaxy simply waits until the dependency jobs have finished, and then it submits the dependent job. It seems like inefficient to me on a busy cluster. E.g. if a user has one "reducer" job in a workflow that depends on a 100 "mapper" jobs, and all are submitted at once with an SGE dependency of the mapper on the reducer, then the mapper can be at the top of the SGE queue by the time all mappers have finished, and run immediately. In the current Galaxy implementation, the reducer will only be submitted to SGE queue after all mappers have finished, and have to wait for all the jobs in front of it (assuming a simple FIFO policy on the SGE side). Why is such design necessary?
Andrey, The design works this way because a single Galaxy instance can submit to multiple clusters / job runners. Thus, Galaxy needs to manage things where dependencies cross job runners. We are also trying to avoid Galaxy jobs being so fine grained that you would have a 100 reducer / 1 mapper situation. A Galaxy job is supposed to be a fairly high level unit of analysis. We are actively working on functionality to allow one Galaxy job to actually produce a graph of many smaller cluster jobs, which could then be submitted as a unit to the job queue (or a more sophisticated workflow planning engine) to be scheduled. We would welcome an enhancement that allows for Galaxy to delegate dependencies to the underlying job queue when it can be determined that all jobs belong to the same runner. Thanks, James On Jul 1, 2010, at 7:17 PM, Andrey Tovchigrechko wrote:
I am looking at the code inside lib/galaxy/jobs. It appears that the dependency tracking between jobs is done entirely by Galaxy. In other words, the option of e.g. SGE to submit a job with a dependency on previously submitted jobs is not used. Galaxy simply waits until the dependency jobs have finished, and then it submits the dependent job. It seems like inefficient to me on a busy cluster. E.g. if a user has one "reducer" job in a workflow that depends on a 100 "mapper" jobs, and all are submitted at once with an SGE dependency of the mapper on the reducer, then the mapper can be at the top of the SGE queue by the time all mappers have finished, and run immediately. In the current Galaxy implementation, the reducer will only be submitted to SGE queue after all mappers have finished, and have to wait for all the jobs in front of it (assuming a simple FIFO policy on the SGE side). Why is such design necessary? _______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
participants (2)
-
Andrey Tovchigrechko
-
James Taylor