[galaxy-dev] inconsistent use of tempfile.mkstemp during upload causes problems

9 Jun 2014

      There appears to be some inconsistent use of tempfile.mkstemp() within
upload.py that causes problems when users import data files to galaxy
from a cluster directory via the upload process and import/temp/dataset
directories are on different file systems.

The issue manifests when Galaxy's job directory, dataset directory and
import directory are on different file systems (common for cluster
environments) in conjunction with a configuration where users can copy
their data files directly to the import directory from which Galaxy
selects data sets to upload (as opposed to using an FTP gateway).

While allowing users to copy files to an import directory rather than
using the FTP gateway may not be that common, we use this configuration
locally to help build a more seamless interface with our local
collection of HPC resources.  Users can be logged into their cluster
account and move data into galaxy with a file copy command rather than
having to use FTP.

This configuration has worked well in our environment as long as the
correct ownership configuration existed on the import directory and as
long as the import directory, job temporary directory, and galaxy data
set directory were all on the same file system.

We now have our galaxy dataset directory on a different file system and
are seeing inconsistent behavior during the upload.py runs depending on
if the data is ordinary text, BAM files, or gzipped data.

A subset of uploads will fail because of the way temporary files are
created by Galaxy to facilitate the import and any associated conversion
processes of different file types.

During the import,

1) Galaxy will copy the original file to a temporary target file
(converting as needed during the copy).
2) Once this first conversion step is complete, Galaxy then attempts to
move the temporary file back to the original location, ie. the import
directory.
3) If this move is succeeds, Galaxy completes the upload processing and
the data becomes a registered data set in the user's dataset collection.

Galaxy prefers the Python shutil.move method to move tempfile . This
results in a simple os.rename if the files remain on the same file
system.  However, if os.rename raises OSError because a move was
attempted across a file system boundary, shutil.move resorts to a copy2,
which copies the data to the original import file and then tries to copy
the file attributes (permissions and utimes) to the original import file
from the source file (which will be the temporary file Galaxy created in
step 1 to begin the conversion process).

The os.rename and shutil.copy2 behave (and fail) differently depending
on the file ownership of the original import file.  The os.rename will
succeed even if the Galaxy upload.py job process only maps to the
group-owner of the original import file (which can be ensured with group
sticky bit on the import dir or ACLs). The shutil.copy2 command,
however, will fail if the Galaxy upload.py job process UID is not the
user-owner of the original import file.

We could ensure the os.rename succeeds by keeping the job temporary
directory and the import directory on the same file system.  However, it
seems the temporary directories used by upload.py are inconsistent
across data types which prevents this simple fix from working for all
data types.

When text files are imported, upload calls the sniff.* methods to
perform conversion.  These methods use a bare call to tempfile.mkstemp()
which ensures the file is created in the directory specified by the env
var $TMPDIR. For example in sniff.convert_newlines:

    fd, temp_name = tempfile.mkstemp()

https://bitbucket.org/galaxy/galaxy-central/src/5884328e91724e9bdf4b43f012eb...

However, for compressed files, the upload.py script directly creates
temp files but here it specifies the target directory as the same as the
data set directory:

    fd, uncompressed = tempfile.mkstemp(
prefix='data_id_%s_upload_gunzip_' % dataset.dataset_id,
dir=os.path.dirname( output_path ), text=False )

https://bitbucket.org/galaxy/galaxy-central/src/5884328e91724e9bdf4b43f012eb...

It's not clear if there is any significance to using the data set
directory as the tempdir for compressed files versus the job temporary
directory for other data files.

It seems like all temporary files created by upload.py should be
consistently created in the same temporary location, and preferably in
the job temp directory.

Is there a reason that these file types use different temporary file
locations?

If they used the same tempfile location, we could use one consistent
system configuration and ensure all our data files can be imported even
when the import+tempdir are not on the same file system as the Galaxy
dataset dir.  It seems reasonable that all tempfile.mkstemp() calls
should be unadorned and inherit the temp directory location from their
environment.

A more comprehensive solution that would correct the inconsistency in
failures between os.rename and shutil.copy2 and also remove any
constraint for Galaxy to have it's import, temp, and data set
directories on the same file system, would be to simply delete the
original import file before attempting the shutil.move.  This would
ensure the file that the upload.py job attempts to create in step 2 is
new and created with full Galaxy process ownership.

Finally, it seems odd that Galaxy attempts to reuse the users original
import file in the first place.  It seems that once galaxy begins
processing the content of the to-be-imported file, it should not ever
write back to that file.  What's the motivation here?

I'll be interested to learn more about the motivations of these
different tempfile conventions and if this can be fixed in the upstream.

Thanks,

~jpr

[galaxy-dev] inconsistent use of tempfile.mkstemp during upload causes problems

John-Paul Robinson