Hello Nate,

I had that parameter set to 1, but I up'd it to 5.  I also added -noac to the nfs mounts for /nextgen3

That appears to have fixed it.

Thank you!!!


On 3/6/14, 1:57 PM, Nate Coraor wrote:
Hi Pete,

I'd suggest setting retry_job_output_collection > 0 in universe_wsgi.ini. This is usually a symptom of attribute caching on network filesystems.

--nate


On Wed, Mar 5, 2014 at 8:06 PM, Pete Schmitt <Peter.R.Schmitt@dartmouth.edu> wrote:


In trying something simple, using galaxy I downloaded data from USCS main.   The data gets downloaded but the job errors out.   I verified that the job actually ran, and completed successfully according to the scheduler but  I get errors like this:

galaxy.jobs.runners.drmaa DEBUG 2014-03-05 18:17:35,941 (624/46.dirigo.mdibl.org) state change: job finished normally
galaxy.jobs.runners ERROR 2014-03-05 18:17:36,060 (624/46.dirigo.mdibl.org) Job output not returned from cluster: [Errno 2] No such file or directory: '/nextgen3/galaxy/galaxy-dist/database/job_working_directory/000/624/galaxy_624.o'

There are no directories being created below the 000 directory.   I verified that the directory tree is owned by galaxy and that the galaxy user can run jobs from the command line as a normal user.

I set the parameter "cleanup_job = never".  It was set to "always" which is probably why the files were never there.  Now the files are there, including the galaxy_###.o file but galaxy still errors like above.

I had set the parameter "cluster_files_directory = database/pbs", but that doesn't seem to work any longer.  The .o and .e files used to end up there.

Here is an example:

(galaxyvenv)[galaxy@dirigo 630]$ ll
total 16
-rw------- 1 galaxy galaxy    0 Mar  5 19:29 galaxy_630.e
-rw-rw-r-- 1 galaxy galaxy    2 Mar  5 19:29 galaxy_630.ec
-rw------- 1 galaxy galaxy  940 Mar  5 19:29 galaxy_630.o
-rwxr-xr-x 1 galaxy galaxy 2429 Mar  5 19:29 galaxy_630.sh
-rw-rw-r-- 1 galaxy galaxy  138 Mar  5 19:29 galaxy.json
-rw-rw-r-- 1 galaxy galaxy 2139 Mar  5 19:29 metadata_in_HistoryDatasetAssociation_1182_o830e3
-rw-rw-r-- 1 galaxy galaxy   20 Mar  5 19:29 metadata_kwds_HistoryDatasetAssociation_1182_hOhPp7
-rw-rw-r-- 1 galaxy galaxy   55 Mar  5 19:29 metadata_out_HistoryDatasetAssociation_1182_Ynb70M
-rw-rw-r-- 1 galaxy galaxy    2 Mar  5 19:29 metadata_override_HistoryDatasetAssociation_1182_HsMljG
-rw-rw-r-- 1 galaxy galaxy   44 Mar  5 19:29 metadata_results_HistoryDatasetAssociation_1182_LxdsAZ
(galaxyvenv)[galaxy@dirigo 630]$ pwd
/nextgen3/galaxy/galaxy-dist/database/job_working_directory/000/630

Here is the error from this:

galaxy.jobs.runners.drmaa DEBUG 2014-03-05 19:31:37,731 (630/51.dirigo.mdibl.org) state change: job is running
galaxy.jobs.runners.drmaa DEBUG 2014-03-05 19:31:49,119 (630/51.dirigo.mdibl.org) state change: job finished normally
galaxy.jobs.runners ERROR 2014-03-05 19:31:50,225 (630/51.dirigo.mdibl.org) Job output not returned from cluster: [Errno 2] No such file or directory: '/nextgen3/galaxy/galaxy-dist/database/job_working_directory/000/630/galaxy_630.o'
galaxy.jobs DEBUG 2014-03-05 19:31:50,252 finish(): Moved /nextgen3/galaxy/galaxy-dist/database/job_working_directory/000/630/galaxy_dataset_856.dat to /nextgen3/galaxy/galaxy-dist/database/files/000/dataset_856.dat
galaxy.jobs DEBUG 2014-03-05 19:31:50,351 job 630 ended

On the galaxy page in the history you get in pink:
1 UCSC Main on Human: knownGene (chr22:1-51304566)
error
An error occurred with this dataset:
Job output not returned from cluster

But the dataset is there.