Thanks Nate,

I tried that commit and seems it fixes the issue.

Thanks

Derrick


On Tue, Jan 22, 2013 at 7:50 AM, Nate Coraor <nate@bx.psu.edu> wrote:
Hi all,

The commit[1] that fixes this is not in the January 11 distribution.  It'll be part of the next distribution.

--nate

[1] https://bitbucket.org/galaxy/galaxy-central/commits/c015b82b3944f967e2c859d5552c00e3e38a2da0

On Jan 21, 2013, at 3:10 PM, Anthonius deBoer wrote:

> I have seen this same issue exactly. Python just dies without any errors in the log. Using the latest galaxy-dist
>
> Sent from my iPhone
>
> On Jan 20, 2013, at 8:35 PM, Derrick Lin <klin938@gmail.com> wrote:
>
>> Update to the 11 Jan 2013 dist does not help with this issue. :(
>>
>> I checked the database and have the look at the job entries that handler0 tried to stop then shutdown:
>>
>> | 3088 | 2013-01-03 14:25:38 | 2013-01-03 14:27:05 |        531 | toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_findPeaks/0.1.2            | 0.1.2        | deleted_new | Job output deleted by user before job completed. | NULL         | NULL           | NULL        | NULL   | NULL   | NULL      |       1659 | drmaa://-V -j n -R y -q intel.q/ | NULL                   |              NULL |      76 |        0 | NULL            | NULL   | handler0 |      NULL |
>> | 3091 | 2013-01-04 10:52:19 | 2013-01-07 09:14:34 |        531 | toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_findPeaks/0.1.2            | 0.1.2        | deleted_new | Job output deleted by user before job completed. | NULL         | NULL           | NULL        | NULL   | NULL   | NULL      |       1659 | drmaa://-V -j n -R y -q intel.q/ | NULL                   |              NULL |      76 |        0 | NULL            | NULL   | handler0 |      NULL |
>> | 3093 | 2013-01-07 22:02:21 | 2013-01-07 22:16:27 |        531 | toolshed.g2.bx.psu.edu/repos/kevyin/homer/homer_pos2bed/1.0.0              | 1.0.0        | deleted_new | Job output deleted by user before job completed. | NULL         | NULL           | NULL        | NULL   | NULL   | NULL      |       1749 | drmaa://-V -j n -R y -q intel.q/ | NULL                   |              NULL |      76 |        0 | NULL            | NULL   | handler0 |      NULL |
>>
>> So basically the job table has several of these entries what assigned to handler0 and marked as "deleted_new". When the handler0 is up, it starts stopping these jobs, after the first job has been "stopped", handler0 went crash and died. But that job was then marked as "deleted".
>>
>> I think if I manually change the job state from "deleted_new" to "deleted" in the db, the handler0 will become fine. I am just concerned about how these jobs were created (like assigned to a handler but marked as "deleted_new").
>>
>> Cheers,
>> D
>>
>>
>> On Mon, Jan 21, 2013 at 1:49 PM, Derrick Lin <klin938@gmail.com> wrote:
>> I had a close look at the code in
>>
>> galaxy-dist / lib / galaxy / jobs / handler.py
>> galaxy-dist / lib / galaxy / jobs / runners / drmaa.py
>>
>> and found that stopping "deleted" and "deleted_new" seems to be normal routine for the job handler. Could not find any exception that caused the shutdown.
>>
>> I do notice in the galaxy-dist on bitbucket, there is one commit with comment "Fix shutdown on python >= 2.6.2 by calling setDaemon when creating threads (these are still...", it seems to be relevant?
>>
>> I will do the update to 11 Jan release and see if it fixes the issue.
>>
>> D
>>
>>
>> On Fri, Jan 18, 2013 at 4:03 PM, Derrick Lin <klin938@gmail.com> wrote:
>> Hi guys,
>>
>> We have updated our galaxy to 20 Dec 2012 release. Recently we found that some submitted jobs could not start (stay gray forever).
>>
>> We found that it was caused by the job manager sent jobs to a handler (handler0) whose python process crashed and died.
>>
>> From the handler log we found the last messages right before the crash:
>>
>> galaxy.jobs.handler DEBUG 2013-01-18 15:00:34,481 Stopping job 3032:
>> galaxy.jobs.handler DEBUG 2013-01-18 15:00:34,481 stopping job 3032 in drmaa runner
>>
>> We restarted the galaxy, handler0 is up for few seconds then died again with the same error messages except the job number moved to the next one.
>>
>> We observed that the jobs it was trying to stop are all previous jobs whose status is either "deleted" or "deleted_new".
>>
>> We have never seen this in the past, so wondering if there is bugs in the new release?
>>
>> Cheers,
>> Derrick
>>
>>
>> ___________________________________________________________
>> Please keep all replies on the list by using "reply all"
>> in your mail client.  To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>>
>>  http://lists.bx.psu.edu/
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>
>  http://lists.bx.psu.edu/