More meaningful dataset names/easier method of identifying?
Hello, For a while now with the Galaxy mirror that we have I have found on many occasions a need to identify which dataset_*.dat files on the file system (in the "[galaxy_dist]/database/files/000/" directory) belong to which user, and even for the same user to distinguish between their various datasets. Files directly uploaded by the user will have a Galaxy job & dataset file name which match - like a Galaxy job name of "data 18" (for example) which actually is reflective of the file name 'dataset_18.dat' on the file system. However any analysis on that file thereafter that produces another dataset does not give you a clue of the corresponding file name. For example, a "Clip on data 18" run some time later may be called 'dataset_44.dat' on the filesystem, and a "Map with Bowtie on data 18" that runs on the clipped 'dataset_44.dat' may produce an output file of 'dataset_53.dat'. When debugging failed jobs, and after the user has rerun them for the umpteenth time, there may be dozens of identical or near-identical files to weed through, and the generic naming scheme is not helpful even though it is sequential (also not easy to keep track of/match up unless you are watching the file writes in the directory live). The current implementation makes sense for internal usage and the code that uses it, but it is difficult for a human to distinguish which files match the jobs in Galaxy. It would be useful to have more meaningful dataset file names or an easier way to identify them (a record that matches the "internal" and "external" names) for administrative maintenance reasons so that I can delete files, or possibly even export those .dat files to a network share where our users can perform manual analysis on them. Could anyone point me to where in the code I could look to make the dataset names more meaningful? Or perhaps I should request of the Galaxy developers (as a feature) a way for the users themselves to see under the "metadata name" of their job (like "Map with Bowtie on data 18") in the right side pane the *actual* corresponding file and location on the file system path to it (dataset_53.dat, for example). Or if not for users at least something for Administrators. Even a database that has four columns for the internal/filesystem dataset name, the job metadata name, the Galaxy job number (that the user sees), and the user that the dataset belongs to, would be helpful. A lot of our users are heavy into informatics though and would probably prefer that the user be able to see that information. Does anyone have any suggestions or thoughts about this? Thanks, Josh Nielsen
In changeset 7013:dae7eefe2f71 I added the full file path to the dataset "View Details" page. Galaxy administrators will always see this, and if you set expose_dataset_path to True in your universe_wsgi.ini, users will see it as well. Hopefully that's what you're looking for, but let me know if I've misunderstood what you're after and I can take another look. -Dannon On Apr 24, 2012, at 4:41 PM, Josh Nielsen wrote:
Hello,
For a while now with the Galaxy mirror that we have I have found on many occasions a need to identify which dataset_*.dat files on the file system (in the "[galaxy_dist]/database/files/000/" directory) belong to which user, and even for the same user to distinguish between their various datasets. Files directly uploaded by the user will have a Galaxy job & dataset file name which match - like a Galaxy job name of "data 18" (for example) which actually is reflective of the file name 'dataset_18.dat' on the file system. However any analysis on that file thereafter that produces another dataset does not give you a clue of the corresponding file name. For example, a "Clip on data 18" run some time later may be called 'dataset_44.dat' on the filesystem, and a "Map with Bowtie on data 18" that runs on the clipped 'dataset_44.dat' may produce an output file of 'dataset_53.dat'.
When debugging failed jobs, and after the user has rerun them for the umpteenth time, there may be dozens of identical or near-identical files to weed through, and the generic naming scheme is not helpful even though it is sequential (also not easy to keep track of/match up unless you are watching the file writes in the directory live). The current implementation makes sense for internal usage and the code that uses it, but it is difficult for a human to distinguish which files match the jobs in Galaxy.
It would be useful to have more meaningful dataset file names or an easier way to identify them (a record that matches the "internal" and "external" names) for administrative maintenance reasons so that I can delete files, or possibly even export those .dat files to a network share where our users can perform manual analysis on them. Could anyone point me to where in the code I could look to make the dataset names more meaningful? Or perhaps I should request of the Galaxy developers (as a feature) a way for the users themselves to see under the "metadata name" of their job (like "Map with Bowtie on data 18") in the right side pane the *actual* corresponding file and location on the file system path to it (dataset_53.dat, for example). Or if not for users at least something for Administrators. Even a database that has four columns for the internal/filesystem dataset name, the job metadata name, the Galaxy job number (that the user sees), and the user that the dataset belongs to, would be helpful. A lot of our users are heavy into informatics though and would probably prefer that the user be able to see that information. Does anyone have any suggestions or thoughts about this?
Thanks, Josh Nielsen ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Hi Josh Are you running the additional "reports web site"? see: 'run_reports.sh' and 'reports_wsgi.ini' We use this extra web site a lot for debugging. It helps tracking what an individual user is doing - kind of 'big brother is watching you' Regards, Hans On 04/24/2012 10:51 PM, Dannon Baker wrote:
In changeset 7013:dae7eefe2f71 I added the full file path to the dataset "View Details" page. Galaxy administrators will always see this, and if you set expose_dataset_path to True in your universe_wsgi.ini, users will see it as well. Hopefully that's what you're looking for, but let me know if I've misunderstood what you're after and I can take another look.
-Dannon
On Apr 24, 2012, at 4:41 PM, Josh Nielsen wrote:
Hello,
For a while now with the Galaxy mirror that we have I have found on many occasions a need to identify which dataset_*.dat files on the file system (in the "[galaxy_dist]/database/files/000/" directory) belong to which user, and even for the same user to distinguish between their various datasets. Files directly uploaded by the user will have a Galaxy job& dataset file name which match - like a Galaxy job name of "data 18" (for example) which actually is reflective of the file name 'dataset_18.dat' on the file system. However any analysis on that file thereafter that produces another dataset does not give you a clue of the corresponding file name. For example, a "Clip on data 18" run some time later may be called 'dataset_44.dat' on the filesystem, and a "Map with Bowtie on data 18" that runs on the clipped 'dataset_44.dat' may produce an output file of 'dataset_53.dat'.
When debugging failed jobs, and after the user has rerun them for the umpteenth time, there may be dozens of identical or near-identical files to weed through, and the generic naming scheme is not helpful even though it is sequential (also not easy to keep track of/match up unless you are watching the file writes in the directory live). The current implementation makes sense for internal usage and the code that uses it, but it is difficult for a human to distinguish which files match the jobs in Galaxy.
It would be useful to have more meaningful dataset file names or an easier way to identify them (a record that matches the "internal" and "external" names) for administrative maintenance reasons so that I can delete files, or possibly even export those .dat files to a network share where our users can perform manual analysis on them. Could anyone point me to where in the code I could look to make the dataset names more meaningful? Or perhaps I should request of the Galaxy developers (as a feature) a way for the users themselves to see under the "metadata name" of their job (like "Map with Bowtie on data 18") in the right side pane the *actual* corresponding file and location on the file system path to it (dataset_53.dat, for example). Or if not for users at least something for Administrators. Even a database that has four columns for the internal/filesystem dataset name, the job metadata name, the Galaxy job number (that the user sees), and the user that the dataset belong !
s to, would be helpful. A lot of our users are heavy into informatics though and would probably prefer that the user be able to see that information. Does anyone have any suggestions or thoughts about this?
Thanks, Josh Nielsen ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (3)
-
Dannon Baker
-
Hans-Rudolf Hotz
-
Josh Nielsen