user data upload directory structure
All- Is there a way to change the upload default directory structure (/database/files) to organize files per user_id instead? something along the following lines ~galaxy-dist/database/files/postgresql_user_id0 ~galaxy-dist/database/files/postgresql_user_id1 Thank you JC
On Mon, May 14, 2012 at 10:22 PM, Jean-Christophe Ducom <jcducom@scripps.edu> wrote:
All- Is there a way to change the upload default directory structure (/database/files) to organize files per user_id instead? something along the following lines ~galaxy-dist/database/files/postgresql_user_id0 ~galaxy-dist/database/files/postgresql_user_id1
Thank you JC
I can see that being useful for a quick way to look at per user disk usage - although you'd have problems counting with shared data. Is that your motivation? Another concern would be overly long filenames, which has a direct impact on the command line lengths used to call the tools - there are OS limits on this. Peter
Thank you for your email Peter. We have implemented Galaxy to interface with our HPC cluster via PBS/Torque. Thanks to DRMAA (not PBS python) all user cpu usage can be accounted.The motivation is indeed what you describe besides managing cost/disk performance on user/project basis as we have a tiered storage. Our filesystem is GPFS which as you might know has one (amongst many) nice feature called fileset: it's basically a data bucket that reports usage disregarding the Unix ownership. It works great for project type directory. The file name length is a legitimate one indeed for command line limitation (GPFS has same length name limit as ext3/4). The current filename can remain unmodified: the requested schema would only introduce the user database ID (usually 3-4 digits) in the path e.g. ~/galaxy-dist/database/files/000/dataset_0001.dat ~/galaxy-dist/database/files/001/dataset_0002.dat ~/galaxy-dist/database/files/002/dataset_0003.dat ~/galaxy-dist/database/files/000/dataset_0004.dat ~/galaxy-dist/database/files/000/dataset_0005.dat ~/galaxy-dist/database/files/000/dataset_0006.dat ~/galaxy-dist/database/files/002/dataset_0007.dat [...] Any thoughts? Thanks again JC ________________________________________ From: Peter Cock [p.j.a.cock@googlemail.com] Sent: Tuesday, May 15, 2012 1:38 AM To: Jean-Christophe Ducom Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] user data upload directory structure On Mon, May 14, 2012 at 10:22 PM, Jean-Christophe Ducom <jcducom@scripps.edu> wrote:
All- Is there a way to change the upload default directory structure (/database/files) to organize files per user_id instead? something along the following lines ~galaxy-dist/database/files/postgresql_user_id0 ~galaxy-dist/database/files/postgresql_user_id1
Thank you JC
I can see that being useful for a quick way to look at per user disk usage - although you'd have problems counting with shared data. Is that your motivation? Another concern would be overly long filenames, which has a direct impact on the command line lengths used to call the tools - there are OS limits on this. Peter
On May 15, 2012, at 10:15 AM, Jean-Christophe Ducom wrote:
Thank you for your email Peter. We have implemented Galaxy to interface with our HPC cluster via PBS/Torque. Thanks to DRMAA (not PBS python) all user cpu usage can be accounted.The motivation is indeed what you describe besides managing cost/disk performance on user/project basis as we have a tiered storage. Our filesystem is GPFS which as you might know has one (amongst many) nice feature called fileset: it's basically a data bucket that reports usage disregarding the Unix ownership. It works great for project type directory. The file name length is a legitimate one indeed for command line limitation (GPFS has same length name limit as ext3/4). The current filename can remain unmodified: the requested schema would only introduce the user database ID (usually 3-4 digits) in the path e.g. ~/galaxy-dist/database/files/000/dataset_0001.dat ~/galaxy-dist/database/files/001/dataset_0002.dat ~/galaxy-dist/database/files/002/dataset_0003.dat ~/galaxy-dist/database/files/000/dataset_0004.dat ~/galaxy-dist/database/files/000/dataset_0005.dat ~/galaxy-dist/database/files/000/dataset_0006.dat ~/galaxy-dist/database/files/002/dataset_0007.dat [...] Any thoughts? Thanks again JC
Hi JC, As Peter mentions, there's no clear way to determine ownership when data is shared. The best you could do is identify the user that originally created a dataset. If you wanted to go this route, the best place to start would be an enhancement of the Object Store framework, at galaxy-dist/lib/galaxy/objectstore/__init__.py If you're not aware, Galaxy does have internal disk accounting and quota features: http://wiki.g2.bx.psu.edu/Admin/Disk%20Quotas --nate
________________________________________ From: Peter Cock [p.j.a.cock@googlemail.com] Sent: Tuesday, May 15, 2012 1:38 AM To: Jean-Christophe Ducom Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] user data upload directory structure
On Mon, May 14, 2012 at 10:22 PM, Jean-Christophe Ducom <jcducom@scripps.edu> wrote:
All- Is there a way to change the upload default directory structure (/database/files) to organize files per user_id instead? something along the following lines ~galaxy-dist/database/files/postgresql_user_id0 ~galaxy-dist/database/files/postgresql_user_id1
Thank you JC
I can see that being useful for a quick way to look at per user disk usage - although you'd have problems counting with shared data. Is that your motivation?
Another concern would be overly long filenames, which has a direct impact on the command line lengths used to call the tools - there are OS limits on this.
Peter
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Nate- I do know about the disk accounting/quota features of Galaxy As I eluded in my previous email, it goes beyond accounting actually. I wanted to be able to implement something like: ~/galaxy-dist/database/files/user_id_000 -> /one_data_pool_set/id_000 ~/galaxy-dist/database/files/user_id_001 -> /another_data_pool_set/id_001 which would match the usual data placement from a scheduler perspective too. I'll look at galaxy-dist/lib/galaxy/objectstore/__init__.py Thanks a lot JC On 05/15/2012 07:26 AM, Nate Coraor wrote:
On May 15, 2012, at 10:15 AM, Jean-Christophe Ducom wrote:
Thank you for your email Peter. We have implemented Galaxy to interface with our HPC cluster via PBS/Torque. Thanks to DRMAA (not PBS python) all user cpu usage can be accounted.The motivation is indeed what you describe besides managing cost/disk performance on user/project basis as we have a tiered storage. Our filesystem is GPFS which as you might know has one (amongst many) nice feature called fileset: it's basically a data bucket that reports usage disregarding the Unix ownership. It works great for project type directory. The file name length is a legitimate one indeed for command line limitation (GPFS has same length name limit as ext3/4). The current filename can remain unmodified: the requested schema would only introduce the user database ID (usually 3-4 digits) in the path e.g. ~/galaxy-dist/database/files/000/dataset_0001.dat ~/galaxy-dist/database/files/001/dataset_0002.dat ~/galaxy-dist/database/files/002/dataset_0003.dat ~/galaxy-dist/database/files/000/dataset_0004.dat ~/galaxy-dist/database/files/000/dataset_0005.dat ~/galaxy-dist/database/files/000/dataset_0006.dat ~/galaxy-dist/database/files/002/dataset_0007.dat [...] Any thoughts? Thanks again JC Hi JC,
As Peter mentions, there's no clear way to determine ownership when data is shared. The best you could do is identify the user that originally created a dataset. If you wanted to go this route, the best place to start would be an enhancement of the Object Store framework, at galaxy-dist/lib/galaxy/objectstore/__init__.py
If you're not aware, Galaxy does have internal disk accounting and quota features:
http://wiki.g2.bx.psu.edu/Admin/Disk%20Quotas
--nate
________________________________________ From: Peter Cock [p.j.a.cock@googlemail.com] Sent: Tuesday, May 15, 2012 1:38 AM To: Jean-Christophe Ducom Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] user data upload directory structure
On Mon, May 14, 2012 at 10:22 PM, Jean-Christophe Ducom <jcducom@scripps.edu> wrote:
All- Is there a way to change the upload default directory structure (/database/files) to organize files per user_id instead? something along the following lines ~galaxy-dist/database/files/postgresql_user_id0 ~galaxy-dist/database/files/postgresql_user_id1
Thank you JC I can see that being useful for a quick way to look at per user disk usage - although you'd have problems counting with shared data. Is that your motivation?
Another concern would be overly long filenames, which has a direct impact on the command line lengths used to call the tools - there are OS limits on this.
Peter
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (3)
-
Jean-Christophe Ducom
-
Nate Coraor
-
Peter Cock