[galaxy-dev] (Composite) Dataset Upload not Setting Metadata

12 Sep 2011

      Hi everyone,

I've been getting my feet wet with Galaxy development working to get some of the rexpression tools online, and I've run into a snag that I've traced back to a set_meta datatype method not being able to find a file from which it wants to extract metadata.  After reading the code, I believe this would also be a problem for non-composite datatypes.

The specific test case I've been looking at is uploading an affybatch file (and associated pheno file) using Galaxy's built-in upload tool and selecting the File Format manually (ie choosing "affybatch" in the dropdown).  I am using unmodified datatype definitions provided in lib/galaxy/datatypes/genetics.py and unmodified core Galaxy upload code as of 5955:949e4f5fa03a.  (I am also testing with modified versions, but I am able to reproduce and track this bug in the specified clean version).

The crux of the cause of error is that in JobWrapper.finish(), dataset.set_meta() is called (lib/galaxy/jobs/__init__.py:607) before the composite dataset uploaded files are moved (in a call to a Tool method "self.tool.collect_associated_files(out_data, self.working_directory)" on line 670) from the job working directory to the final destination under config.file_path (which defaults to "database/files").

In my test case, "database.set_meta( overwrite = False )" eventually calls lib/galaxy/datatypes/genetics.py:Rexp.set_meta(dataset, **kwd).  As far as I can tell, the only ways to construct a path to a file (or the file) in a dataset without using hard-coded paths from external knowledge is to use the Dataset.get_file_name or Dataset.extra_files_path properties.  Unless explicitly told otherwise, both of these methods construct a path based on the Dataset.file_path class data member, whose value is set during Galaxy startup to config.file_path (default "database/files").  However, at the time set_meta is called in this case, the files are not under config.file_path, but rather under the job working directory.  Attempting to open files from the dataset therefore fails when using these paths.  However, unless the job working directory is passed to set_meta or during construction of the underlying Dataset object, there doesn't appear to be a way for a Dataset method to access the currently running job (for instance to get its job ID or working directory).  (The second suggestion is actually not possible; since the standard upload is asynchronous, the Dataset object is created (and persisted) before the Job that will process it is created.)

Thoughts?  This issue affects Rexp.set_peek also, as well as any other functions that may want to read data from the uploaded files before they are moved to permanent location.  This is why if you have an affybatch file and its associated pheno file and you test this on, say, the public Galaxy server at http://main.g2.bx.psu.edu/ you'll see that the peek info says (for example): "##failed to find /galaxy/main_database/files/002/948/dataset_2948818_files/affybatch_test.pheno"

It seems that if the current way that Dataset.file_path, Dataset.file_name, and Dataset.extra_files_path is part of the desired design of Galaxy, that methods like set_meta should be run after the files have been moved to config.file_path so they can set metadata based on file contents.  It looks like this is intended to happen at least in some cases, from looking at lib/galaxy/jobs/__init__.py:568-586.  However, in my tests this code is not kicking in because hda_tool_output is None.

Any clarification on what's happening here, what's supposed to be happening for setting metadata on (potentially composite) uploads, why dataset.set_meta() isn't already being called after the files are moved to config.file_path, or any insights on related Galaxy design decisions I may not know about or design constraints I may have missed would be very greatly appreciated.

I'd also be glad to provide further detail or test files upon request.

Thank you,
Eric Paniagua

PS: Further notes on passing the job working directory to set_meta or set_peek - I have been successful modifying the code to do this for set_meta since the call chain starting from dataset.set_meta() in JobWrapper.finish() to Rexp.set_meta() accepts and forwards keyword argument dictionaries along the way.  However, set_peek does not accept arbitrary keyword arguments, making it harder to pass along the job working directory when needed without stepping on the toes of any other code.

[galaxy-dev] (Composite) Dataset Upload not Setting Metadata

Paniagua, Eric