Hi again, I have looked into this matter a little bit more, and it looks like this is happening: - tasked job is split - tasks commands are sent to workers (I am running 8-core high cpu extra large workers on EC2) - per task, worker runs env.sh for the respective tool - per task, worker runs scripts/extract_dataset_part.py - this scripts issues import statements (ones forsimplejson and galaxy.model.mapping have caused me problems) - which lead to unzipping .so libraries from python eggs into the nodes' /home/galaxy/.python-eggs - this runs into lib/pkg_resources.py and its _bypass_ensure_directory method that creates the temporary dir for the egg unzip - since there are 8 processes on the node, sometimes this method tries to mkdir a directory that was just made by the previous process after the isdir. That last point is my guessing. I don't really know how to solve this in a non-hackish way, so until someone finds out, I may use reading from a 'eggs_extracted.txt' file to determine if the eggs have been extracted. And locking the file when writing to it of course. cheers, jorrit On 09/14/2012 10:57 AM, Jorrit Boekel wrote:
Dear list,
I am running galaxy-dist on Amazon EC2 through Cloudman, and am using the enable_tasked_jobs to run jobs in parallel. Yes, I know it's not recommended in production. My jobs usually get split in 72 parts, and sometimes (but not always, maybe in 30-50% of cases), errors are returned concerning the python egg cache, usually:
[Errno 17] File exists: '/home/galaxy/.python-eggs'
or something like
[Errno 17] File exists: '/home/galaxy/.python-eggs/simplejson-2.1.1-py2.7-linux-x86_64-ucs4.egg-tmp'
The errors arise AFAIK from when scripts/extract_dataset_part.py is run. I am guessing that the tmp python egg dir is created for every task of the mentioned 72, that they sometimes coincide and that this leads to an error.
I would like to solve this problem, but before doing so, I'd like to know if someone else has already fixed it in a galaxy-central changeset.
cheers, jorrit