user data upload directory structure

older
Internal Server Error when using...

Jean-Christophe Ducom

14 May 2012 14 May '12

9:22 p.m.

All- Is there a way to change the upload default directory structure (/database/files) to organize files per user_id instead? something along the following lines ~galaxy-dist/database/files/postgresql_user_id0 ~galaxy-dist/database/files/postgresql_user_id1 Thank you JC

Show replies by date

Peter Cock

15 May 15 May

8:38 a.m.

On Mon, May 14, 2012 at 10:22 PM, Jean-Christophe Ducom <jcducom@scripps.edu> wrote:

...

All- Is there a way to change the upload default directory structure (/database/files) to organize files per user_id instead? something along the following lines ~galaxy-dist/database/files/postgresql_user_id0 ~galaxy-dist/database/files/postgresql_user_id1

Thank you JC

I can see that being useful for a quick way to look at per user disk usage - although you'd have problems counting with shared data. Is that your motivation? Another concern would be overly long filenames, which has a direct impact on the command line lengths used to call the tools - there are OS limits on this. Peter

Jean-Christophe Ducom

2:15 p.m.

Thank you for your email Peter. We have implemented Galaxy to interface with our HPC cluster via PBS/Torque. Thanks to DRMAA (not PBS python) all user cpu usage can be accounted.The motivation is indeed what you describe besides managing cost/disk performance on user/project basis as we have a tiered storage. Our filesystem is GPFS which as you might know has one (amongst many) nice feature called fileset: it's basically a data bucket that reports usage disregarding the Unix ownership. It works great for project type directory. The file name length is a legitimate one indeed for command line limitation (GPFS has same length name limit as ext3/4). The current filename can remain unmodified: the requested schema would only introduce the user database ID (usually 3-4 digits) in the path e.g. ~/galaxy-dist/database/files/000/dataset_0001.dat ~/galaxy-dist/database/files/001/dataset_0002.dat ~/galaxy-dist/database/files/002/dataset_0003.dat ~/galaxy-dist/database/files/000/dataset_0004.dat ~/galaxy-dist/database/files/000/dataset_0005.dat ~/galaxy-dist/database/files/000/dataset_0006.dat ~/galaxy-dist/database/files/002/dataset_0007.dat [...] Any thoughts? Thanks again JC ________________________________________ From: Peter Cock [p.j.a.cock@googlemail.com] Sent: Tuesday, May 15, 2012 1:38 AM To: Jean-Christophe Ducom Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] user data upload directory structure On Mon, May 14, 2012 at 10:22 PM, Jean-Christophe Ducom <jcducom@scripps.edu> wrote:

...

All- Is there a way to change the upload default directory structure (/database/files) to organize files per user_id instead? something along the following lines ~galaxy-dist/database/files/postgresql_user_id0 ~galaxy-dist/database/files/postgresql_user_id1

Thank you JC

Nate Coraor

2:26 p.m.

On May 15, 2012, at 10:15 AM, Jean-Christophe Ducom wrote:

...

Thank you for your email Peter. We have implemented Galaxy to interface with our HPC cluster via PBS/Torque. Thanks to DRMAA (not PBS python) all user cpu usage can be accounted.The motivation is indeed what you describe besides managing cost/disk performance on user/project basis as we have a tiered storage. Our filesystem is GPFS which as you might know has one (amongst many) nice feature called fileset: it's basically a data bucket that reports usage disregarding the Unix ownership. It works great for project type directory. The file name length is a legitimate one indeed for command line limitation (GPFS has same length name limit as ext3/4). The current filename can remain unmodified: the requested schema would only introduce the user database ID (usually 3-4 digits) in the path e.g. ~/galaxy-dist/database/files/000/dataset_0001.dat ~/galaxy-dist/database/files/001/dataset_0002.dat ~/galaxy-dist/database/files/002/dataset_0003.dat ~/galaxy-dist/database/files/000/dataset_0004.dat ~/galaxy-dist/database/files/000/dataset_0005.dat ~/galaxy-dist/database/files/000/dataset_0006.dat ~/galaxy-dist/database/files/002/dataset_0007.dat [...] Any thoughts? Thanks again JC

Hi JC, As Peter mentions, there's no clear way to determine ownership when data is shared. The best you could do is identify the user that originally created a dataset. If you wanted to go this route, the best place to start would be an enhancement of the Object Store framework, at galaxy-dist/lib/galaxy/objectstore/__init__.py If you're not aware, Galaxy does have internal disk accounting and quota features: http://wiki.g2.bx.psu.edu/Admin/Disk%20Quotas --nate

...

________________________________________ From: Peter Cock [p.j.a.cock@googlemail.com] Sent: Tuesday, May 15, 2012 1:38 AM To: Jean-Christophe Ducom Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] user data upload directory structure

On Mon, May 14, 2012 at 10:22 PM, Jean-Christophe Ducom <jcducom@scripps.edu> wrote:

...
All- Is there a way to change the upload default directory structure (/database/files) to organize files per user_id instead? something along the following lines ~galaxy-dist/database/files/postgresql_user_id0 ~galaxy-dist/database/files/postgresql_user_id1

Thank you JC

I can see that being useful for a quick way to look at per user disk usage - although you'd have problems counting with shared data. Is that your motivation?

Another concern would be overly long filenames, which has a direct impact on the command line lengths used to call the tools - there are OS limits on this.

Peter

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Jean-Christophe Ducom

3:45 p.m.

Nate- I do know about the disk accounting/quota features of Galaxy As I eluded in my previous email, it goes beyond accounting actually. I wanted to be able to implement something like: ~/galaxy-dist/database/files/user_id_000 -> /one_data_pool_set/id_000 ~/galaxy-dist/database/files/user_id_001 -> /another_data_pool_set/id_001 which would match the usual data placement from a scheduler perspective too. I'll look at galaxy-dist/lib/galaxy/objectstore/__init__.py Thanks a lot JC On 05/15/2012 07:26 AM, Nate Coraor wrote:

...

On May 15, 2012, at 10:15 AM, Jean-Christophe Ducom wrote:

...
Thank you for your email Peter. We have implemented Galaxy to interface with our HPC cluster via PBS/Torque. Thanks to DRMAA (not PBS python) all user cpu usage can be accounted.The motivation is indeed what you describe besides managing cost/disk performance on user/project basis as we have a tiered storage. Our filesystem is GPFS which as you might know has one (amongst many) nice feature called fileset: it's basically a data bucket that reports usage disregarding the Unix ownership. It works great for project type directory. The file name length is a legitimate one indeed for command line limitation (GPFS has same length name limit as ext3/4). The current filename can remain unmodified: the requested schema would only introduce the user database ID (usually 3-4 digits) in the path e.g. ~/galaxy-dist/database/files/000/dataset_0001.dat ~/galaxy-dist/database/files/001/dataset_0002.dat ~/galaxy-dist/database/files/002/dataset_0003.dat ~/galaxy-dist/database/files/000/dataset_0004.dat ~/galaxy-dist/database/files/000/dataset_0005.dat ~/galaxy-dist/database/files/000/dataset_0006.dat ~/galaxy-dist/database/files/002/dataset_0007.dat [...] Any thoughts? Thanks again JC Hi JC,

As Peter mentions, there's no clear way to determine ownership when data is shared. The best you could do is identify the user that originally created a dataset. If you wanted to go this route, the best place to start would be an enhancement of the Object Store framework, at galaxy-dist/lib/galaxy/objectstore/__init__.py

If you're not aware, Galaxy does have internal disk accounting and quota features:

http://wiki.g2.bx.psu.edu/Admin/Disk%20Quotas

--nate

...
________________________________________ From: Peter Cock [p.j.a.cock@googlemail.com] Sent: Tuesday, May 15, 2012 1:38 AM To: Jean-Christophe Ducom Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] user data upload directory structure

On Mon, May 14, 2012 at 10:22 PM, Jean-Christophe Ducom <jcducom@scripps.edu> wrote:

...
All- Is there a way to change the upload default directory structure (/database/files) to organize files per user_id instead? something along the following lines ~galaxy-dist/database/files/postgresql_user_id0 ~galaxy-dist/database/files/postgresql_user_id1

Thank you JC I can see that being useful for a quick way to look at per user disk usage - although you'd have problems counting with shared data. Is that your motivation?

Another concern would be overly long filenames, which has a direct impact on the command line lengths used to call the tools - there are OS limits on this.

Peter

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

5060

Age (days ago)

5061

Last active (days ago)

List overview

Download

4 comments

3 participants

participants (3)

Jean-Christophe Ducom
Nate Coraor
Peter Cock