New subject: More meaningful dataset names/easier method of identifying?

24 Apr 2012

      Hello,

For a while now with the Galaxy mirror that we have I have found on many
occasions a need to identify which dataset_*.dat files on the file system
(in the "[galaxy_dist]/database/files/000/" directory) belong to which
user, and even for the same user to distinguish between their various
datasets. Files directly uploaded by the user will have a Galaxy job &
dataset file name which match - like a Galaxy job name of "data 18" (for
example) which actually is reflective of the file name 'dataset_18.dat' on
the file system. However any analysis on that file thereafter that produces
another dataset does not give you a clue of the corresponding file name.
For example, a "Clip on data 18" run some time later may be called
'dataset_44.dat' on the filesystem, and a "Map with Bowtie on data 18" that
runs on the clipped 'dataset_44.dat' may produce an output file of
'dataset_53.dat'.

When debugging failed jobs, and after the user has rerun them for the
umpteenth time, there may be dozens of identical or near-identical files to
weed through, and the generic naming scheme is not helpful even though it
is sequential (also not easy to keep track of/match up unless you are
watching the file writes in the directory live). The current implementation
makes sense for internal usage and the code that uses it, but it is
difficult for a human to distinguish which files match the jobs in Galaxy.

It would be useful to have more meaningful dataset file names or an easier
way to identify them (a record that matches the "internal" and "external"
names) for administrative maintenance reasons so that I can delete files,
or possibly even export those .dat files to a network share where our users
can perform manual analysis on them. Could anyone point me to where in the
code I could look to make the dataset names more meaningful? Or perhaps I
should request of the Galaxy developers (as a feature) a way for the users
themselves to see under the "metadata name" of their job (like "Map with
Bowtie on data 18") in the right side pane the *actual* corresponding file
and location on the file system path to it (dataset_53.dat, for example).
Or if not for users at least something for Administrators. Even a database
that has four columns for the internal/filesystem dataset name, the job
metadata name, the Galaxy job number (that the user sees), and the user
that the dataset belongs to, would be helpful. A lot of our users are heavy
into informatics though and would probably prefer that the user be able to
see that information. Does anyone have any suggestions or thoughts about
this?

Thanks,
Josh Nielsen

More meaningful dataset names/easier method of identifying?

Josh Nielsen

Dannon Baker

Hans-Rudolf Hotz

tags

participants (3)