Hey Evan, Galaxy should perhaps be able to retry submissions that fail - especially if they fail quickly and I have created a Trello card for this here (https://trello.com/c/hxy2bcIb). Nate has added some features for job state handling plugins (https://bitbucket.org/galaxy/galaxy-central/commits/7b209e06ddb944e953d34075...) and it may be possible to write a plugin to do this today though immediate submissions failures maybe should be handled a level above this by the framework... not sure. I am not really sure this is the appropriate solution though for this particular problem though - this seems like an unfortunate interplay between your file system and your cluster manager and it would seem that any script or platform that automates the creation of submissions of jobs would potentially be subject to the same problems. Solving it in Galaxy would be a application level solution to a system-level configuration problem in my opinion. Have you ran this problem by the systems staff - it seems like it should be possible to delay each submission by a half of a second or change the flushing settings of the file system. As you mentioned - a local work around might be to `time.sleep(1)` before `external_job_id = self.ds.runJob(jt)` in lib/galaxy/jobs/runners/drmaa.py or similar line line pbs.py. Do you want to try that and let us know if it addresses the problem? Finally, in terms of the workflow - if you rerun the failed step in the GUI you should be given the option via a new checkbox on the tool form to resume the workflow. -John On Mon, Jan 5, 2015 at 4:48 PM, Evan Bollig PhD <boll0107@umn.edu> wrote:
I get this error occasionally:
"/bin/sh: 1: /opt/galaxy/web/database/job_working_directory/000/100/galaxy_100.sh: Text file busy"
When this occurs, the step fails outright. Resubmitting the step resolves the issue and things run no problem. If this error appears early in a long workflow, I have to manually resubmit ALL dependent steps... what a pain!
Perhaps this is something the Galaxy job scheduler can look out for, flush() the system, sleep() a second or two to let the file write and close, and then rerun. A more fault-tolerant way of running workflows without unnecessary human intervention.
Cheers, -Evan Bollig Research Associate | Application Developer | User Support Consultant Minnesota Supercomputing Institute 599 Walter Library 612 624 1447 evan@msi.umn.edu boll0107@umn.edu ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/