Idea for user-based dataset subdirectories

15 May 2012

      Hello,

Please forgive the length of this proposition as I try to explain my
reasoning behind this. Let me say first of all that I understand that
Galaxy is not meant to be everything to everyone and that requests for
features may not suit everyone who uses Galaxy. That being said I have an
idea or request that I think would be convenient for dealing with user's
datasets from a file-system perspective.

Galaxy has the obvious benefit and advantage (compared to manual
job-submission for tools on a cluster) of providing an interface for using
all the analysis tools, and the history of the operations done on your
data, all in one place. However I have found that putting all the output &
datasets in one directory (the files/000/ directory) on the file-system
causes a problem for the users if they specifically want to interact with
it *on the file-system*, and not just through the Web interface - for
whatever complicated or diverse reasons.

Since Galaxy runs on a cluster of its own in our environment, and we do not
allow users to remote connect into it to submit manual jobs (and
individually output it to their separate home directories) like we do our
main cluster, it is essentially a black box beyond the GUI interface of
Galaxy. That is essentially what we want except for how they can interact
with the output files.

The issue is that our users would like an easy means of copying their files
off of the Galaxy cluster to other servers from a command line (possibly
even automated by scripts). Even if we allow an FTP share of the output
directory for users to do that, the common
[galaxy-dist]/database/files/000/ directory clumps all of the files for all
users together in one directory and uses a sequential file-naming scheme
(dataset_N++) that is not easy to discriminate between as to who the owner
is for each file.

Is there a way that the dataset output directory locations could be
designed (or set optionally?) like the FTP upload feature's expected
directory structure: where the files are dropped into the corresponding
subdirectory of the user who produced it? For example having under
database/files/ subdirectories named according to the user's Galaxy account
id
(like [galaxy-dist]/database/files/jsmith,
[galaxy-dist]/database/files/sparker,
etc.). If they could be segregated by user it would be much easier to keep
track of what datasets belong to whom on the file-system. Then I could
possibly set up a read-only FTP share to the files/ directory on the
cluster, from which the users could directly copy the files in their
personal subdirectory to other systems, and perhaps batch download them,
rather than having to rely solely on the Web interface.

I understand that the way Galaxy is currently designed is that the files
are just generically named (the "behind-the-scenes" handling of data is a
black box) and it is the database that keeps track of which files belong to
whom, and which has the metadata for more meaningful dataset/job names,
etc. But a file-system hierarchy alternative would also be welcome in a
heavily command-line oriented computational environment too.

Would setting up a more user-representative output directory hierarchy on
the file-system like that be possible?

Best Regards,
Josh Nielsen

Josh Nielsen

David Hoover

Josh Nielsen

David Hoover

Josh Nielsen

Jeremy Goecks

tags

participants (3)