Hello,
For a while now with the Galaxy mirror that we have I have found on many occasions a need to identify which dataset_*.dat files on the file system (in the "[galaxy_dist]/database/files/000/" directory) belong to which user, and even for the same user to distinguish between their various datasets. Files directly uploaded by the user will have a Galaxy job & dataset file name which match - like a Galaxy job name of "data 18" (for example) which actually is reflective of the file name 'dataset_18.dat' on the file system. However any analysis on that file thereafter that produces another dataset does not give you a clue of the corresponding file name. For example, a "Clip on data 18" run some time later may be called 'dataset_44.dat' on the filesystem, and a "Map with Bowtie on data 18" that runs on the clipped 'dataset_44.dat' may produce an output file of 'dataset_53.dat'.
When debugging failed jobs, and after the user has rerun them for the umpteenth time, there may be dozens of identical or near-identical files to weed through, and the generic naming scheme is not helpful even though it is sequential (also not easy to keep track of/match up unless you are watching the file writes in the directory live). The current implementation makes sense for internal usage and the code that uses it, but it is difficult for a human to distinguish which files match the jobs in Galaxy.
It would be useful to have more meaningful dataset file names or an easier way to identify them (a record that matches the "internal" and "external" names) for administrative maintenance reasons so that I can delete files, or possibly even export those .dat files to a network share where our users can perform manual analysis on them. Could anyone point me to where in the code I could look to make the dataset names more meaningful? Or perhaps I should request of the Galaxy developers (as a feature) a way for the users themselves to see under the "metadata name" of their job (like "Map with Bowtie on data 18") in the right side pane the *actual* corresponding file and location on the file system path to it (dataset_53.dat, for example). Or if not for users at least something for Administrators. Even a database that has four columns for the internal/filesystem dataset name, the job metadata name, the Galaxy job number (that the user sees), and the user that the dataset belongs to, would be helpful. A lot of our users are heavy into informatics though and would probably prefer that the user be able to see that information. Does anyone have any suggestions or thoughts about this?