Thank you, Iayd. Indeed, setting ulimit -s to unlimited helped to advance this further.
I can see now that a job gets generated and submitted. However Galaxy crashes immediately after that. 
Job <108038> is submitted to queue <short>.
*** glibc detected *** python: free(): invalid pointer: 0x00007fff79f10b64 ***
======= Backtrace: =========
< further output is omitted >

Tracking the job through the scheduler reveals that the job finished successfully.

The command in the job script is something like this:

python /galaxy-dist/tools/data_source/upload.py /galaxy-dist /galaxy-dist/database/tmp/tmpGY5_lI /galaxy-dist/database/tmp/tmpr7VKGy         1:/galaxy-dist/database/job_working_directory/000/1/dataset_1_files:/galaxy-dist/database/files/000/dataset_1.dat

usage: upload.py <root> <datatypes_conf> <json paramfile> <output spec> ...

I cannot re-run it because only the first file in the tmp folder is there. The second (json paramfile, tmpr7VKGy) is gone. I presume dataset_1.dat is the output and it's there.

The second half of the job script is the execution of set_metadata.sh
I can execute it without issues (is this a db update?).

One significant difference between the setup which works and which doesnt is that the working setup sits on local disk whereas the not working on Lustre. Could that be relevant?

By the way, is there a method for removing the pending job?
When I re-run Galaxy, it promptly crashes again due the stuck job.

When Galaxy starts, the only error that I see is this
IOError: [Errno 2] No such file or directory: './tools/mutation/visualize.xml'
While it might be a good question why mutation directory is not there, the error is very likely not relevant to the issue.

So I'm open to further suggestions as to how to understand what's going on.

Thank you


On 10 June 2014 19:24, Kandalaft, Iyad <Iyad.Kandalaft@agr.gc.ca> wrote:

This is just a guess, which may help you troubleshoot.

It could be a that python is reaching a stack limit: run ulimit -s  and set it to a higher value if required

I’m completely guessing here but is it possible that the DRMAA is missing a linked library on the redhat system – check with ldd?

 

Regards,

Iyad Kandalaft

 

Iyad Kandalaft

Microbial Biodiversity Bioinformatics

Agriculture and Agri-Food Canada | Agriculture et Agroalimentaire Canada
960 Carling Ave.| 960 Ave. Carling

Ottawa, ON| Ottawa (ON) K1A 0C6

E-mail Address / Adresse courriel  Iyad.Kandalaft@agr.gc.ca
Telephone | Téléphone 613-759-1228
Facsimile | Télécopieur 613-759-1701
Teletypewriter | Téléimprimeur 613-773-2600
Government of Canada | Gouvernement du Canada