Hi Pete,
I'd suggest setting retry_job_output_collection > 0 in universe_wsgi.ini. This is usually a symptom of attribute caching on network filesystems.
--nate
On Wed, Mar 5, 2014 at 8:06 PM, Pete Schmitt <Peter.R.Schmitt@dartmouth.edu> wrote:
In trying something simple, using galaxy I downloaded data from USCS main. The data gets downloaded but the job errors out. I verified that the job actually ran, and completed successfully according to the scheduler but I get errors like this:
galaxy.jobs.runners.drmaa DEBUG 2014-03-05 18:17:35,941 (624/46.dirigo.mdibl.org) state change: job finished normally
galaxy.jobs.runners ERROR 2014-03-05 18:17:36,060 (624/46.dirigo.mdibl.org) Job output not returned from cluster: [Errno 2] No such file or directory: '/nextgen3/galaxy/galaxy-dist/database/job_working_directory/000/624/galaxy_624.o'
There are no directories being created below the 000 directory. I verified that the directory tree is owned by galaxy and that the galaxy user can run jobs from the command line as a normal user.
I set the parameter "cleanup_job = never". It was set to "always" which is probably why the files were never there. Now the files are there, including the galaxy_###.o file but galaxy still errors like above.
I had set the parameter "cluster_files_directory = database/pbs", but that doesn't seem to work any longer. The .o and .e files used to end up there.
Here is an example:
(galaxyvenv)[galaxy@dirigo 630]$ ll
total 16
-rw------- 1 galaxy galaxy 0 Mar 5 19:29 galaxy_630.e
-rw-rw-r-- 1 galaxy galaxy 2 Mar 5 19:29 galaxy_630.ec
-rw------- 1 galaxy galaxy 940 Mar 5 19:29 galaxy_630.o
-rwxr-xr-x 1 galaxy galaxy 2429 Mar 5 19:29 galaxy_630.sh
-rw-rw-r-- 1 galaxy galaxy 138 Mar 5 19:29 galaxy.json
-rw-rw-r-- 1 galaxy galaxy 2139 Mar 5 19:29 metadata_in_HistoryDatasetAssociation_1182_o830e3
-rw-rw-r-- 1 galaxy galaxy 20 Mar 5 19:29 metadata_kwds_HistoryDatasetAssociation_1182_hOhPp7
-rw-rw-r-- 1 galaxy galaxy 55 Mar 5 19:29 metadata_out_HistoryDatasetAssociation_1182_Ynb70M
-rw-rw-r-- 1 galaxy galaxy 2 Mar 5 19:29 metadata_override_HistoryDatasetAssociation_1182_HsMljG
-rw-rw-r-- 1 galaxy galaxy 44 Mar 5 19:29 metadata_results_HistoryDatasetAssociation_1182_LxdsAZ
(galaxyvenv)[galaxy@dirigo 630]$ pwd
/nextgen3/galaxy/galaxy-dist/database/job_working_directory/000/630
Here is the error from this:
galaxy.jobs.runners.drmaa DEBUG 2014-03-05 19:31:37,731 (630/51.dirigo.mdibl.org) state change: job is running
galaxy.jobs.runners.drmaa DEBUG 2014-03-05 19:31:49,119 (630/51.dirigo.mdibl.org) state change: job finished normally
galaxy.jobs.runners ERROR 2014-03-05 19:31:50,225 (630/51.dirigo.mdibl.org) Job output not returned from cluster: [Errno 2] No such file or directory: '/nextgen3/galaxy/galaxy-dist/database/job_working_directory/000/630/galaxy_630.o'
galaxy.jobs DEBUG 2014-03-05 19:31:50,252 finish(): Moved /nextgen3/galaxy/galaxy-dist/database/job_working_directory/000/630/galaxy_dataset_856.dat to /nextgen3/galaxy/galaxy-dist/database/files/000/dataset_856.dat
galaxy.jobs DEBUG 2014-03-05 19:31:50,351 job 630 ended
On the galaxy page in the history you get in pink:1 UCSC Main on Human: knownGene (chr22:1-51304566)errorAn error occurred with this dataset:Job output not returned from cluster
But the dataset is there.