On Feb 13, 2012, at 11:52 AM, Fields, Christopher J wrote:
On Feb 13, 2012, at 9:45 AM, Nate Coraor wrote:
On Feb 8, 2012, at 9:32 PM, Fields, Christopher J wrote:
'samtools sort' seems to be running on our server end as well (not on the cluster). I may look into it a bit more myself. Snapshot of top off our server (you can see our local runner as well):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3950 galaxy 20 0 1303m 1.2g 676 R 99.7 15.2 234:48.07 samtools sort /home/a-m/galaxy/dist-database/file/000/dataset_587.dat /home/a-m/galaxy/dist-database/tmp/tmp9tv6zc/sorted 5417 galaxy 20 0 1186m 104m 5384 S 0.3 1.3 0:15.08 python ./scripts/paster.py serve universe_wsgi.runner.ini --server-name=runner0 --pid-file=runner0.pid --log-file=runner0.log --daemon
Hi Chris,
'samtools sort' is run by groom_dataset_contents, which should only be called from within the upload tool, which should run on the cluster unless you still have the default local override for it in your job runner's config file.
Yes, that is likely the problem. Our cluster was running an old version of python (v2.4) that was also UCS2 (bx_python broke), so we were running locally. That was rectified this past week (the admins insisted on not installing a python version locally, so we insisted back they install something modern using UCS4). I tested a single upload with success off the cluster, so I would guess this is rectified (I'll confirm that).
Is there any information on data grooming on the wiki? I only found info relevant to FASTQ grooming, not SAM/BAM.
FASTQ grooming runs voluntarily as a tool. The datatype grooming method is only called at the end of the upload tool, and is only defined for the Bam datatype (although other datatypes could define it). I believe it's implemented this way because it was deemed inefficient to force FASTQ grooming when the FASTQ may already be in an acceptable format. I am not sure why the same determination was not made for BAM, so perhaps one of my colleagues will clarify that.
Ryan's instance is running 'samtools index' which is in set_meta which is supposed to be run on the cluster if set_metadata_externally = True, but can be run locally under certain conditions.
--nate
Will have to check, but I believe we have not set that yet either. We are in the midst of moving all jobs to the cluster, just rectifying the various issues with disparate python versions, etc. which now seem to be rectified, so that will shortly be resolved as well.
set_metadata_externally = True should "just work" and will significantly decrease the performance penalty taken on the server and by the (effectively single-threaded) Galaxy process. --nate
chris