Thanks for looking into this John.  I'm pretty sure there are multiple reasons for stuck jobs. (In fact I'm running into another I'll discuss separately, likely in a Trello ticket). But it turns out in my most recent run-in with this issue, the problem turned out to be caused by datasets that were marked as deleted in the dataset table, but not marked as deleted in the history dataset association table, and thus were used as inputs to jobs.

The following query fixed the stuck jobs:

update dataset
set deleted = 'f'
and purgable = 'f'
where id in
(select distinct(d.id)
    from dataset d
    join history_dataset_association hda on d.id = hda.dataset_id
    join job_to_input_dataset jtid on hda.id = jtid.dataset_id
    join job j on jtid.job_id = j.id
    where d.deleted = 't' and hda.deleted = 'f' and j.state = 'new');

However, there seem to many instances where I find datasets marked as deleted but where the history dataset association is not marked as deleted.  I'm wary of updating them all without knowing how they got set this way (or even if this is sometimes an appropriate state).  Is this ever a valid state for a dataset?

BTW: There is also a discussion at https://biostar.usegalaxy.org/p/9608/ about this.

Lance


John Chilton wrote:
Hello Lance,

  I cannot think of a good way to rescue these jobs. If you are
curious about the code where jobs are selected for execution - I would
check out the job handler (lib/galaxy/jobs/handler.py) - see
__monitor_step for instance.

  It seems like to prevent this from happening in the future - we
should only allow copying datasets from libraries into histories if
the the library dataset is in an 'OK' state
(https://trello.com/c/0vxbP4El).

-John

On Thu, Nov 6, 2014 at 11:13 AM, Lance Parsons <lparsons@princeton.edu> wrote:
I'v run into this same issue again (just with some other Data Library
datasets).  This time, there are a few users involved with quite a few
"stuck" jobs.  Does anyone have any advice on pushing these jobs through?
Maybe even a pointer to the relevant code?  I'm running latest_2014.08.11.
Thanks in advance.

Lance


Lance Parsons wrote:

Thanks, that was the first thing I checked.  However, restarting the handler
didn't help.  Downloading the offending data and re-uploading as a new data
set and then rerunning using the new dataset as input did work.  Also, all
other jobs continued to run fine.

Lance

Kandalaft, Iyad wrote:

I’ve had jobs get stuck in the new state when one of the handler servers
crashes.  If you have dedicated handlers, check to make sure they are still
running.

Restart the handler to see if the jobs get resumed automatically.







Iyad Kandalaft



From: galaxy-dev-bounces@lists.bx.psu.edu
[mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Aaron Petkau
Sent: Wednesday, October 01, 2014 5:32 PM
To: Lance Parsons
Cc: galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] Jobs stuck in "new" state - Data Library datasets
to blame?



Are you attempting to upload datasets to a Data Library, and then copy to a
history and run jobs on them right away?  I've run into issues before where
if I attempt to run a job on a dataset in a library before it is finished
being uploaded and processed, then the job gets stuck in a queued state and
never executes.

Aaron



On Wed, Oct 1, 2014 at 2:51 PM, Lance Parsons <lparsons@princeton.edu>
wrote:

Recently, I updated our Galaxy instance to use two processes (one for web,
the other as a job handler).  This has been working well, except in a few
cases.  I've noticed that a number of jobs get stuck in the "new" status.

In a number of cases, I've resolved the issue by downloading and uploading
one of the input files and rerunning the job using the newly uploaded file.
In at least one of these cases, the offending input file was one that was
copied from a Data Library.

Can anyone point me to something to look for in the database, etc. that
would cause a job to think a dataset was not ready for use as a job input?
I'd very much like to fix these datasets since having to re-upload data
libraries would be very tedious.

Thanks in advance.

--
Lance Parsons - Scientific Programmer
134 Carl C. Icahn Laboratory
Lewis-Sigler Institute for Integrative Genomics
Princeton University

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
 http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
 http://galaxyproject.org/search/mailinglists/




--
Lance Parsons - Scientific Programmer
134 Carl C. Icahn Laboratory
Lewis-Sigler Institute for Integrative Genomics
Princeton University


--
Lance Parsons - Scientific Programmer
134 Carl C. Icahn Laboratory
Lewis-Sigler Institute for Integrative Genomics
Princeton University


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


--
Lance Parsons - Scientific Programmer
134 Carl C. Icahn Laboratory
Lewis-Sigler Institute for Integrative Genomics
Princeton University