Re: [galaxy-dev] python egg cache exists error

18 Sep 2012

      Hi again,

I have looked into this matter a little bit more, and it looks like this 
is happening:

- tasked job is split
- tasks commands are sent to workers (I am running 8-core high cpu extra 
large workers on EC2)
- per task, worker runs env.sh for the respective tool
- per task, worker runs scripts/extract_dataset_part.py
- this scripts issues import statements (ones forsimplejson and 
galaxy.model.mapping have caused me problems)
- which lead to unzipping .so libraries from python eggs into the nodes' 
/home/galaxy/.python-eggs
- this runs into lib/pkg_resources.py and its _bypass_ensure_directory 
method that creates the temporary dir for the egg unzip
- since there are 8 processes on the node, sometimes this method tries 
to mkdir a directory that was just made by the previous process after 
the isdir.

That last point is my guessing. I don't really know how to solve this in 
a non-hackish way, so until someone finds out, I may use reading from a 
'eggs_extracted.txt'  file to determine if the eggs have been extracted. 
And locking the file when writing to it of course.

cheers,
jorrit

On 09/14/2012 10:57 AM, Jorrit Boekel wrote:
...
Dear list,
I am running galaxy-dist on Amazon EC2 through Cloudman, and am using 
the enable_tasked_jobs to run jobs in parallel. Yes, I know it's not 
recommended in production. My jobs usually get split in 72 parts, and 
sometimes (but not always, maybe in 30-50% of cases), errors are 
returned concerning the python egg cache, usually:
[Errno 17] File exists: '/home/galaxy/.python-eggs'
or something like
[Errno 17] File exists: 
'/home/galaxy/.python-eggs/simplejson-2.1.1-py2.7-linux-x86_64-ucs4.egg-tmp'
The errors arise AFAIK from when scripts/extract_dataset_part.py is 
run. I am guessing that the tmp python egg dir is created for every 
task of the mentioned 72, that they sometimes coincide and that this 
leads to an error.
I would like to solve this problem, but before doing so, I'd like to 
know if someone else has already fixed it in a galaxy-central changeset.
cheers,
jorrit