Idea for user-based dataset subdirectories
Hello, Please forgive the length of this proposition as I try to explain my reasoning behind this. Let me say first of all that I understand that Galaxy is not meant to be everything to everyone and that requests for features may not suit everyone who uses Galaxy. That being said I have an idea or request that I think would be convenient for dealing with user's datasets from a file-system perspective. Galaxy has the obvious benefit and advantage (compared to manual job-submission for tools on a cluster) of providing an interface for using all the analysis tools, and the history of the operations done on your data, all in one place. However I have found that putting all the output & datasets in one directory (the files/000/ directory) on the file-system causes a problem for the users if they specifically want to interact with it *on the file-system*, and not just through the Web interface - for whatever complicated or diverse reasons. Since Galaxy runs on a cluster of its own in our environment, and we do not allow users to remote connect into it to submit manual jobs (and individually output it to their separate home directories) like we do our main cluster, it is essentially a black box beyond the GUI interface of Galaxy. That is essentially what we want except for how they can interact with the output files. The issue is that our users would like an easy means of copying their files off of the Galaxy cluster to other servers from a command line (possibly even automated by scripts). Even if we allow an FTP share of the output directory for users to do that, the common [galaxy-dist]/database/files/000/ directory clumps all of the files for all users together in one directory and uses a sequential file-naming scheme (dataset_N++) that is not easy to discriminate between as to who the owner is for each file. Is there a way that the dataset output directory locations could be designed (or set optionally?) like the FTP upload feature's expected directory structure: where the files are dropped into the corresponding subdirectory of the user who produced it? For example having under database/files/ subdirectories named according to the user's Galaxy account id (like [galaxy-dist]/database/files/jsmith, [galaxy-dist]/database/files/sparker, etc.). If they could be segregated by user it would be much easier to keep track of what datasets belong to whom on the file-system. Then I could possibly set up a read-only FTP share to the files/ directory on the cluster, from which the users could directly copy the files in their personal subdirectory to other systems, and perhaps batch download them, rather than having to rely solely on the Web interface. I understand that the way Galaxy is currently designed is that the files are just generically named (the "behind-the-scenes" handling of data is a black box) and it is the database that keeps track of which files belong to whom, and which has the metadata for more meaningful dataset/job names, etc. But a file-system hierarchy alternative would also be welcome in a heavily command-line oriented computational environment too. Would setting up a more user-representative output directory hierarchy on the file-system like that be possible? Best Regards, Josh Nielsen
How about if there were a completely separate daemon that monitored the galaxy database periodically to determine what datasets belong to which user(s). Then it would move the actual dataset to an area owned by the user and group accessible to galaxy, replacing the dataset with a symlink. This would require no changes to the galaxy build, but it would require a constant monitoring system. There is already a mechanism for users to move their files into a joint user/galaxy directory, but it is (as far as I know) only allowed for libraries, not histories. It would be better if there were a way for users to browse through their own directories as a tool, and be able to load files directly into their history. David Hoover On May 15, 2012, at 7:40 PM, Josh Nielsen wrote:
Hello,
Please forgive the length of this proposition as I try to explain my reasoning behind this. Let me say first of all that I understand that Galaxy is not meant to be everything to everyone and that requests for features may not suit everyone who uses Galaxy. That being said I have an idea or request that I think would be convenient for dealing with user's datasets from a file-system perspective.
Galaxy has the obvious benefit and advantage (compared to manual job-submission for tools on a cluster) of providing an interface for using all the analysis tools, and the history of the operations done on your data, all in one place. However I have found that putting all the output & datasets in one directory (the files/000/ directory) on the file-system causes a problem for the users if they specifically want to interact with it *on the file-system*, and not just through the Web interface - for whatever complicated or diverse reasons.
Since Galaxy runs on a cluster of its own in our environment, and we do not allow users to remote connect into it to submit manual jobs (and individually output it to their separate home directories) like we do our main cluster, it is essentially a black box beyond the GUI interface of Galaxy. That is essentially what we want except for how they can interact with the output files.
The issue is that our users would like an easy means of copying their files off of the Galaxy cluster to other servers from a command line (possibly even automated by scripts). Even if we allow an FTP share of the output directory for users to do that, the common [galaxy-dist]/database/files/000/ directory clumps all of the files for all users together in one directory and uses a sequential file-naming scheme (dataset_N++) that is not easy to discriminate between as to who the owner is for each file.
Is there a way that the dataset output directory locations could be designed (or set optionally?) like the FTP upload feature's expected directory structure: where the files are dropped into the corresponding subdirectory of the user who produced it? For example having under database/files/ subdirectories named according to the user's Galaxy account id (like [galaxy-dist]/database/files/jsmith, [galaxy-dist]/database/files/sparker, etc.). If they could be segregated by user it would be much easier to keep track of what datasets belong to whom on the file-system. Then I could possibly set up a read-only FTP share to the files/ directory on the cluster, from which the users could directly copy the files in their personal subdirectory to other systems, and perhaps batch download them, rather than having to rely solely on the Web interface.
I understand that the way Galaxy is currently designed is that the files are just generically named (the "behind-the-scenes" handling of data is a black box) and it is the database that keeps track of which files belong to whom, and which has the metadata for more meaningful dataset/job names, etc. But a file-system hierarchy alternative would also be welcome in a heavily command-line oriented computational environment too.
Would setting up a more user-representative output directory hierarchy on the file-system like that be possible?
Best Regards, Josh Nielsen ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Hi David, Actually that is an interesting idea to use a daemon to move the files into associated user directories. Is that something that Galaxy Dev is working/can work on, or was that just a suggestion? I'm not opposed to doing any dev work of my own, but I don't really know Python that well and I know most of the Galaxy code is Python. I'm not sure that I follow what you are talking about with the joint user/galaxy directory though. I'm of course wanting it to not be unified (not all in the same directory) and rather be segregated by user into user subdirectories, but I think you already caught that so I guess I just didn't understand what you were getting at. Josh Nielsen ----------
How about if there were a completely separate daemon that monitored the galaxy database periodically to determine what datasets belong to which user(s). Then it would move the actual dataset to an area owned by the user and group accessible to galaxy, replacing the dataset with a symlink. This would require no changes to the galaxy build, but it would require a constant monitoring system.
There is already a mechanism for users to move their files into a joint user/galaxy directory, but it is (as far as I know) only allowed for libraries, not histories. It would be better if there were a way for users to browse through their own directories as a tool, and be able to load files directly into their history.
David Hoover
No, this was all an idea I've had for a while, but never did anything about it. I'm pretty sure the Galaxy developers are not interested in anything this locally-centric, and I don't blame them. It ought to be something outside the Galaxy build completely, because Galaxy is meant to be system-independent. What I meant by 'joint user/galaxy directory' is a directory that is owned by a user, but that the galaxy user has read (and possibly write) access to. This is entirely possible given either a well-informed user population, or an iron-clad suexec executable. The mechanism I alluded to is a feature by which a user can upload a directory of files all at once. There is a configuration directive in universe_wsgi.ini, user_library_import_dir, that allows non-administrative users to upload an entire directory of files into a library. The directive identifies the base directory, within which subdirectories named as the galaxy user login (email address) are searched. The user_library_import_dir directory is owned by the galaxy user, and the subdirectories are owned by the user, but group owned by the galaxy user. A user will copy files to the subdirectory, login to galaxy, switch to their library, and upload all the files in the directory into a single library folder. There isn't much documentation about it in the main Galaxy wiki, so forget that. I haven't enabled it in our local production site, and I haven't played with it in a long time. I'm pretty sure that the files are not removed after uploading, and a user is free to re-upload the files again and again, so it's kind of quirky. Also, if the files are not readable by the galaxy user, a bizarre and unhelpful error is thrown. If this functionality could be extended and elaborated, it could do what you want. The user_library_import_dir requires that the user's login in Galaxy must be identical to the the user's login on the cluster, and that the permissions be kept correct. Typically users have no idea what is going on with their permissions, so what are you going to do? David On May 16, 2012, at 1:33 PM, Josh Nielsen wrote:
Hi David,
Actually that is an interesting idea to use a daemon to move the files into associated user directories. Is that something that Galaxy Dev is working/can work on, or was that just a suggestion? I'm not opposed to doing any dev work of my own, but I don't really know Python that well and I know most of the Galaxy code is Python.
I'm not sure that I follow what you are talking about with the joint user/galaxy directory though. I'm of course wanting it to not be unified (not all in the same directory) and rather be segregated by user into user subdirectories, but I think you already caught that so I guess I just didn't understand what you were getting at.
Josh Nielsen
----------
How about if there were a completely separate daemon that monitored the galaxy database periodically to determine what datasets belong to which user(s). Then it would move the actual dataset to an area owned by the user and group accessible to galaxy, replacing the dataset with a symlink. This would require no changes to the galaxy build, but it would require a constant monitoring system.
There is already a mechanism for users to move their files into a joint user/galaxy directory, but it is (as far as I know) only allowed for libraries, not histories. It would be better if there were a way for users to browse through their own directories as a tool, and be able to load files directly into their history.
David Hoover
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Thanks for breaking that down for me. We are trying to set up some dev machines in our environment in a few weeks and I may create a clone of our production Galaxy mirror and play around with that version to see if I can get the functionality that I'm looking for. I'll take that idea about having a daemon into consideration. Regards, Josh On Wed, May 16, 2012 at 1:08 PM, David Hoover <hooverdm@helix.nih.gov>wrote:
No, this was all an idea I've had for a while, but never did anything about it. I'm pretty sure the Galaxy developers are not interested in anything this locally-centric, and I don't blame them. It ought to be something outside the Galaxy build completely, because Galaxy is meant to be system-independent.
What I meant by 'joint user/galaxy directory' is a directory that is owned by a user, but that the galaxy user has read (and possibly write) access to. This is entirely possible given either a well-informed user population, or an iron-clad suexec executable.
The mechanism I alluded to is a feature by which a user can upload a directory of files all at once. There is a configuration directive in universe_wsgi.ini, user_library_import_dir, that allows non-administrative users to upload an entire directory of files into a library. The directive identifies the base directory, within which subdirectories named as the galaxy user login (email address) are searched. The user_library_import_dir directory is owned by the galaxy user, and the subdirectories are owned by the user, but group owned by the galaxy user. A user will copy files to the subdirectory, login to galaxy, switch to their library, and upload all the files in the directory into a single library folder.
There isn't much documentation about it in the main Galaxy wiki, so forget that. I haven't enabled it in our local production site, and I haven't played with it in a long time. I'm pretty sure that the files are not removed after uploading, and a user is free to re-upload the files again and again, so it's kind of quirky. Also, if the files are not readable by the galaxy user, a bizarre and unhelpful error is thrown.
If this functionality could be extended and elaborated, it could do what you want. The user_library_import_dir requires that the user's login in Galaxy must be identical to the the user's login on the cluster, and that the permissions be kept correct. Typically users have no idea what is going on with their permissions, so what are you going to do?
David
On May 16, 2012, at 1:33 PM, Josh Nielsen wrote:
Hi David,
Actually that is an interesting idea to use a daemon to move the files into associated user directories. Is that something that Galaxy Dev is working/can work on, or was that just a suggestion? I'm not opposed to doing any dev work of my own, but I don't really know Python that well and I know most of the Galaxy code is Python.
I'm not sure that I follow what you are talking about with the joint user/galaxy directory though. I'm of course wanting it to not be unified (not all in the same directory) and rather be segregated by user into user subdirectories, but I think you already caught that so I guess I just didn't understand what you were getting at.
Josh Nielsen
----------
How about if there were a completely separate daemon that monitored the galaxy database periodically to determine what datasets belong to which user(s). Then it would move the actual dataset to an area owned by the user and group accessible to galaxy, replacing the dataset with a symlink. This would require no changes to the galaxy build, but it would require a constant monitoring system.
There is already a mechanism for users to move their files into a joint user/galaxy directory, but it is (as far as I know) only allowed for libraries, not histories. It would be better if there were a way for users to browse through their own directories as a tool, and be able to load files directly into their history.
David Hoover
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
This seems like a nice application of the API: (1) use a daemon to periodically query Galaxy for user histories and/or datasets (I think this is possible with the API right now); (2) create symbolic links to user's datasets, perhaps ordered in subdirectories based on history. Best, J. On May 16, 2012, at 4:34 PM, Josh Nielsen wrote:
Thanks for breaking that down for me. We are trying to set up some dev machines in our environment in a few weeks and I may create a clone of our production Galaxy mirror and play around with that version to see if I can get the functionality that I'm looking for. I'll take that idea about having a daemon into consideration.
Regards, Josh
On Wed, May 16, 2012 at 1:08 PM, David Hoover <hooverdm@helix.nih.gov> wrote: No, this was all an idea I've had for a while, but never did anything about it. I'm pretty sure the Galaxy developers are not interested in anything this locally-centric, and I don't blame them. It ought to be something outside the Galaxy build completely, because Galaxy is meant to be system-independent.
What I meant by 'joint user/galaxy directory' is a directory that is owned by a user, but that the galaxy user has read (and possibly write) access to. This is entirely possible given either a well-informed user population, or an iron-clad suexec executable.
The mechanism I alluded to is a feature by which a user can upload a directory of files all at once. There is a configuration directive in universe_wsgi.ini, user_library_import_dir, that allows non-administrative users to upload an entire directory of files into a library. The directive identifies the base directory, within which subdirectories named as the galaxy user login (email address) are searched. The user_library_import_dir directory is owned by the galaxy user, and the subdirectories are owned by the user, but group owned by the galaxy user. A user will copy files to the subdirectory, login to galaxy, switch to their library, and upload all the files in the directory into a single library folder.
There isn't much documentation about it in the main Galaxy wiki, so forget that. I haven't enabled it in our local production site, and I haven't played with it in a long time. I'm pretty sure that the files are not removed after uploading, and a user is free to re-upload the files again and again, so it's kind of quirky. Also, if the files are not readable by the galaxy user, a bizarre and unhelpful error is thrown.
If this functionality could be extended and elaborated, it could do what you want. The user_library_import_dir requires that the user's login in Galaxy must be identical to the the user's login on the cluster, and that the permissions be kept correct. Typically users have no idea what is going on with their permissions, so what are you going to do?
David
On May 16, 2012, at 1:33 PM, Josh Nielsen wrote:
Hi David,
Actually that is an interesting idea to use a daemon to move the files into associated user directories. Is that something that Galaxy Dev is working/can work on, or was that just a suggestion? I'm not opposed to doing any dev work of my own, but I don't really know Python that well and I know most of the Galaxy code is Python.
I'm not sure that I follow what you are talking about with the joint user/galaxy directory though. I'm of course wanting it to not be unified (not all in the same directory) and rather be segregated by user into user subdirectories, but I think you already caught that so I guess I just didn't understand what you were getting at.
Josh Nielsen
----------
How about if there were a completely separate daemon that monitored the galaxy database periodically to determine what datasets belong to which user(s). Then it would move the actual dataset to an area owned by the user and group accessible to galaxy, replacing the dataset with a symlink. This would require no changes to the galaxy build, but it would require a constant monitoring system.
There is already a mechanism for users to move their files into a joint user/galaxy directory, but it is (as far as I know) only allowed for libraries, not histories. It would be better if there were a way for users to browse through their own directories as a tool, and be able to load files directly into their history.
David Hoover
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (3)
-
David Hoover
-
Jeremy Goecks
-
Josh Nielsen