Database cleanup and user management

Sebastian J. Schultheiss

14 Mar 2011 14 Mar '11

1:28 p.m.

Dear Galaxy Developers, we are experimenting with cleanup strategies for our production galaxy server that is used by the public. How are you treating the user- uploaded data on galaxy-main? We would like to see how much space is used by an individual user and evenutally be able to remove them after some time of inactivity (>180 days?), ideally preceded by 1-2 warning emails. Can the current cleanup scripts handle that? Thanks for your help! -- Sebastian

Show replies by date

Greg Von Kuster

14 Mar 14 Mar

2:04 p.m.

Hello Sebastian, On Mar 14, 2011, at 9:28 AM, Sebastian J. Schultheiss wrote:

...

Dear Galaxy Developers,

we are experimenting with cleanup strategies for our production galaxy server that is used by the public. How are you treating the user-uploaded data on galaxy-main?

We currently do not limit disk space per user. We run the cleanup scripts once per day using the default settings in cleanup_datasets.py.

...

We would like to see how much space is used by an individual user and evenutally be able to remove them after some time of inactivity (>180 days?), ideally preceded by 1-2 warning emails. Can the current cleanup scripts handle that?

No, the cleanup scripts do not determine space used per user, and they do not include the ability to send email notification. It would probably be best to implement these features in a script separate from the cleanup_datasets.py script. We have no current plans to implement this.

...

Thanks for your help!

-- Sebastian ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Greg Von Kuster Galaxy Development Team greg@bx.psu.edu

Sebastian J. Schultheiss

2:31 p.m.

Dear Greg and Gordon, thanks for the fast replies. Your query ideas are very helpful, Gordon, thanks a lot.

...

We currently do not limit disk space per user. We run the cleanup scripts once per day using the default settings in cleanup_datasets.py.

We will start to do the same. We haven't run them at all for several months, and now they used about 4.6 GB of RAM while deleting and took about 12 days to complete.

...

No, the cleanup scripts do not determine space used per user, and they do not include the ability to send email notification. It would probably be best to implement these features in a script separate from the cleanup_datasets.py script. We have no current plans to implement this.

Is there a way to completely delete a user and their data at the same time? We would like to keep published pages and shared data though. I know it sort of counters the dogma of having everything reproducible, but our users upload really big datasets and we cannot keep adding new disk space every month, therefore we would like an expiration date on anything that is not published or shared and used. I guess we can modify the SQL queries to include some of the dates stored. Thank you for your time. -- Sebastian

Greg Von Kuster

2:47 p.m.

Hi Sebastian, If you uncomment the following lines ( lines 92 - 94 in the latest Galaxy rev ) in ~/lib/galaxy/web/controllers/admin.py, your Galaxy administrators will be able to delete / undelete / purge specific users from the Admin -> Manage users view. You'll need to "Delete" selected users before you "Purge" them. See the comments in the purge_user() method at about line 1925 in ~/lib/galaxy/web/base/controller.py. #operations.append( grids.GridOperation( "Delete", condition=( lambda item: not item.deleted ), allow_multiple=True ) ) #operations.append( grids.GridOperation( "Undelete", condition=( lambda item: item.deleted and not item.purged ), allow_multiple=True ) ) #operations.append( grids.GridOperation( "Purge", condition=( lambda item: item.deleted and not item.purged ), allow_multiple=True ) ) On Mar 14, 2011, at 10:31 AM, Sebastian J. Schultheiss wrote:

...

Is there a way to completely delete a user and their data at the same time? We would like to keep published pages and shared data though.

Thank you for your time.

-- Sebastian

Greg Von Kuster Galaxy Development Team greg@bx.psu.edu

Assaf Gordon

2:52 p.m.

Sebastian J. Schultheiss wrote, On 03/14/2011 10:31 AM:

...

Is there a way to completely delete a user and their data at the same time? We would like to keep published pages and shared data though.

Do you really want to delete all the user's records from the database ? I think the database size is tiny compared to the actual files on disk, keeping all database records forever shouldn't be such a problem. Regarding files, It's my understanding (galaxy people, correct me if I'm wrong), that once a dataset is marked as "deleted" in "history_dataset_association" table, it's as if the user deleted the dataset by himself. So if you run a query that sets "deleted=true", the galaxy clean-up scripts will take it from there and will eventually delete the dataset ("eventually", because there are couple of clean up steps and scripts). Something like: update history_dataset_association set deleted=true where id = NNNNNN ; (After finding the ID with previous queries). Same caveat as before: this is very messy, don't do it without testing and verification.

...

I know it sort of counters the dogma of having everything reproducible, but our users upload really big datasets and we cannot keep adding new disk space every month, therefore we would like an expiration date on anything that is not published or shared and used. I guess we can modify the SQL queries to include some of the dates stored.

I agree, same problem for us. This is actually something I would like to request as a feature: "reproducibility" doesn't require all the files, all the time - only the first file (let say: a FASTQ file) and the meta-data for downstream files (jobs, tools, parameters) are needed. It would be great if there was a way for users to see the datasets (and the jobs, parameters, etc.) of all their datasets (ever), even if I deleted the underlying physical file. Unfortunately, the current method is that once a dataset is marked as "purged" (meaning the file was deleted), it will never appear again inside galaxy. -gordon

Greg Von Kuster

3:09 p.m.

Sebastian and Assaf, On Mar 14, 2011, at 10:52 AM, Assaf Gordon wrote:

...

Sebastian J. Schultheiss wrote, On 03/14/2011 10:31 AM:

...
Is there a way to completely delete a user and their data at the same time? We would like to keep published pages and shared data though.

Do you really want to delete all the user's records from the database ? I think the database size is tiny compared to the actual files on disk, keeping all database records forever shouldn't be such a problem.

Regarding files, It's my understanding (galaxy people, correct me if I'm wrong), that once a dataset is marked as "deleted" in "history_dataset_association" table, it's as if the user deleted the dataset by himself.

So if you run a query that sets "deleted=true", the galaxy clean-up scripts will take it from there and will eventually delete the dataset ("eventually", because there are couple of clean up steps and scripts).

Something like: update history_dataset_association set deleted=true where id = NNNNNN ;

(After finding the ID with previous queries). Same caveat as before: this is very messy, don't do it without testing and verification.

For the above, I recommentd the approach discussed in my previous email ( uncomment the Delete / Undelete / Purge operation buttons in the admin controller. This will do what you want, and you'll not need to execute any sql commands manually, which can be very dangerous.

...

...
I know it sort of counters the dogma of having everything reproducible, but our users upload really big datasets and we cannot keep adding new disk space every month, therefore we would like an expiration date on anything that is not published or shared and used. I guess we can modify the SQL queries to include some of the dates stored.

I agree, same problem for us.

This is actually something I would like to request as a feature: "reproducibility" doesn't require all the files, all the time - only the first file (let say: a FASTQ file) and the meta-data for downstream files (jobs, tools, parameters) are needed.

For the above, determining whether a dataset is shared is currently available, but what is your definition of a "published dataset"?

...

It would be great if there was a way for users to see the datasets (and the jobs, parameters, etc.) of all their datasets (ever), even if I deleted the underlying physical file.

Where are you talking about "seeing the datasets, jobs, parameters, etc" whose underlying file has been removed from disk? Would this be in the history, where you can currently see "deleted" datasets, but not "purged" datasets?

...

Unfortunately, the current method is that once a dataset is marked as "purged" (meaning the file was deleted), it will never appear again inside galaxy.

-gordon

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Greg Von Kuster Galaxy Development Team greg@bx.psu.edu

Sebastian J. Schultheiss

3:15 p.m.

On Mar 14, 2011, at 16:09 , Greg Von Kuster wrote:

...

On Mar 14, 2011, at 10:52 AM, Assaf Gordon wrote:

...
Do you really want to delete all the user's records from the database ? I think the database size is tiny compared to the actual files on disk, keeping all database records forever shouldn't be such a problem.

You're right, of course. There would be no problem with keeping the user's credentials and other records.

...

...
Regarding files, It's my understanding (galaxy people, correct me if I'm wrong), that once a dataset is marked as "deleted" in "history_dataset_association" table, it's as if the user deleted the dataset by himself.

So if you run a query that sets "deleted=true", the galaxy clean-up scripts will take it from there and will eventually delete the dataset ("eventually", because there are couple of clean up steps and scripts).

For the above, I recommentd the approach discussed in my previous email ( uncomment the Delete / Undelete / Purge operation buttons in the admin controller. This will do what you want, and you'll not need to execute any sql commands manually, which can be very dangerous.

Yes, I believe that's how we can do it. But there should be some kind of feedback for a user when this is about to happen... and a way to prevent it.

...

...
This is actually something I would like to request as a feature: "reproducibility" doesn't require all the files, all the time - only the first file (let say: a FASTQ file) and the meta-data for downstream files (jobs, tools, parameters) are needed.

For the above, determining whether a dataset is shared is currently available, but what is your definition of a "published dataset"?

A history item that is used on a Galaxy page, as supplementary material to a publication, for example, or a history that is listed on the "published histories" of Galaxy.

...

...
It would be great if there was a way for users to see the datasets (and the jobs, parameters, etc.) of all their datasets (ever), even if I deleted the underlying physical file.

Where are you talking about "seeing the datasets, jobs, parameters, etc" whose underlying file has been removed from disk? Would this be in the history, where you can currently see "deleted" datasets, but not "purged" datasets?

In an ideal world, we would be able to eventually remove all the intermediates and keep only the initial dataset from which all the results were generated. For that to work 100%, you would need a snapshot of the galaxy installation with all the tools and their versions at the time of execution... it sounds very messy to implement and test, though. The dataset without the physical file could get its own color and all analysis steps could be re-run transparently if someone really wanted to access a specific intermediate file. Thanks for discussing this with us! -- Sebastian

Assaf Gordon

3:31 p.m.

Hello all, Greg Von Kuster wrote, On 03/14/2011 11:09 AM:

...

Sebastian and Assaf,

For the above, determining whether a dataset is shared is currently available, but what is your definition of a "published dataset"?

We use the following query to see if a dataset is shared between more than one history: select count(*) from history_dataset_association hda where hda.dataset_id=NNNNN And the following to see if it's shared between more than one user: == select count ( distinct history.user_id ) from history, dataset, history_dataset_association hda where dataset.id = hda.dataset_id AND history.id = hda.history_id AND dataset.id = NNNN == Are there other types of sharing (between users) ?

...

...
It would be great if there was a way for users to see the datasets (and the jobs, parameters, etc.) of all their datasets (ever), even if I deleted the underlying physical file.

Where are you talking about "seeing the datasets, jobs, parameters, etc" whose underlying file has been removed from disk? Would this be in the history, where you can currently see "deleted" datasets, but not "purged" datasets?

I'm not sure about the best UI way of displaying those purged datasets, but something like showing the datasets just like the current green rectangles (or the way deleted datasets are displayed, with a warning), showing the user what was the analysis, the tools, the parameters, and if/when the user clicks on the "eye" icon or the "download" icon, he will get a message saying "this dataset has been purged". But he will still be able to "re-run" or view the tool's parameters. The rational behind this: I'm in the same situation as Sebastian - I have users running big jobs, on many FASTQ files, on crazy datasets (example: Genomics Interval's "Join" on two 300MB BED files, producing a 70GB file, then filtering the results with GREP, or 8 x PE100 FASTQ files that go throw the same workflow). Many times they don't need the intermediate files, but never bother to delete them (this was before the workflows had an option to delete intermediate files, and even now not many people are using this feature - and I can't force them). I want to delete those intermediate files, but once I delete+purge them - all records of the jobs/datasets are gone. Users (looking at their histories) can't tell/remember how they got from the first FASTQ file to the last final file. They don't always use workflows, so "just take a look at the workflow" not a good solution. Having a way to display the meta-data of a purged dataset (especially since each dataset has a "peek" and "info" data) would be very helpful (helpful - but not top priority - I don't want to create the wrong impression). -gordon

Sebastian J. Schultheiss

6:30 p.m.

Dear all,

...

We use the following query to see if a dataset is shared between more than one history: select count(*) from history_dataset_association hda where hda.dataset_id=NNNNN

And the following to see if it's shared between more than one user: == select count ( distinct history.user_id ) from history, dataset, history_dataset_association hda where dataset.id = hda.dataset_id AND history.id = hda.history_id AND dataset.id = NNNN ==

Thanks a lot for these helpful scripts. I've just tried some of them out and they work wonderfully.

...

Are there other types of sharing (between users) ?

You can gererally publish a history, so anyone can access it.

...

I'm not sure about the best UI way of displaying those purged datasets, but something like showing the datasets just like the current green rectangles (or the way deleted datasets are displayed, with a warning), showing the user what was the analysis, the tools, the parameters, and if/when the user clicks on the "eye" icon or the "download" icon, he will get a message saying "this dataset has been purged".

--- or an option to rerun all the tools to get this dataset again.

...

But he will still be able to "re-run" or view the tool's parameters.

The rational behind this: I'm in the same situation as Sebastian - I have users running big jobs, on many FASTQ files, on crazy datasets (example: Genomics Interval's "Join" on two 300MB BED files, producing a 70GB file, then filtering the results with GREP, or 8 x PE100 FASTQ files that go throw the same workflow). Many times they don't need the intermediate files, but never bother to delete them (this was before the workflows had an option to delete intermediate files, and even now not many people are using this feature - and I can't force them).

This could be the default... especially if the GUI provies a way to "restore" or re-generate the intermediates

...

Having a way to display the meta-data of a purged dataset (especially since each dataset has a "peek" and "info" data) would be very helpful (helpful - but not top priority - I don't want to create the wrong impression).

True, I'm sure there are more pressing things. However, you could put it on the nice-to-have list or do a minimal implementation like the ones that show up like deleted datasets. An expiration date may be very hard to implement. Thanks again! -- Sebastian

Greg Von Kuster

7:41 p.m.

I've created the following ticket for this which you can follow if you want. https://bitbucket.org/galaxy/galaxy-central/issue/492/add-the-ability-to-dis... Thanks for all of your input, greg Von Kuster On Mar 14, 2011, at 2:30 PM, Sebastian J. Schultheiss wrote:

...

Dear all,

...
We use the following query to see if a dataset is shared between more than one history: select count(*) from history_dataset_association hda where hda.dataset_id=NNNNN

And the following to see if it's shared between more than one user: == select count ( distinct history.user_id ) from history, dataset, history_dataset_association hda where dataset.id = hda.dataset_id AND history.id = hda.history_id AND dataset.id = NNNN ==

Thanks a lot for these helpful scripts. I've just tried some of them out and they work wonderfully.

...
Are there other types of sharing (between users) ?

You can gererally publish a history, so anyone can access it.

...
I'm not sure about the best UI way of displaying those purged datasets, but something like showing the datasets just like the current green rectangles (or the way deleted datasets are displayed, with a warning), showing the user what was the analysis, the tools, the parameters, and if/when the user clicks on the "eye" icon or the "download" icon, he will get a message saying "this dataset has been purged".

--- or an option to rerun all the tools to get this dataset again.

...
But he will still be able to "re-run" or view the tool's parameters.

The rational behind this: I'm in the same situation as Sebastian - I have users running big jobs, on many FASTQ files, on crazy datasets (example: Genomics Interval's "Join" on two 300MB BED files, producing a 70GB file, then filtering the results with GREP, or 8 x PE100 FASTQ files that go throw the same workflow). Many times they don't need the intermediate files, but never bother to delete them (this was before the workflows had an option to delete intermediate files, and even now not many people are using this feature - and I can't force them).

This could be the default... especially if the GUI provies a way to "restore" or re-generate the intermediates

...
Having a way to display the meta-data of a purged dataset (especially since each dataset has a "peek" and "info" data) would be very helpful (helpful - but not top priority - I don't want to create the wrong impression).

True, I'm sure there are more pressing things. However, you could put it on the nice-to-have list or do a minimal implementation like the ones that show up like deleted datasets. An expiration date may be very hard to implement.

Thanks again!

-- Sebastian

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Greg Von Kuster Galaxy Development Team greg@bx.psu.edu

Assaf Gordon

2:27 p.m.

Hello Sebastian, We are doing similar clean-ups as you've requested (except the automated email notification, which will be implemented soon).

...

We would like to see how much space is used by an individual user and evenutally be able to remove them after some time of inactivity (>180 days?), ideally preceded by 1-2 warning emails. Can the current cleanup scripts handle that?

If your galaxy is using PostgreSQL, you can use the following queries: 1. Estimated size per user, including datasets marked as deleted.

...

From my little experience, users rarely delete their files, and with "remote_user" on, there are no orphaned histories to clean up. This query shows how much space each user uses, and how much space he/she already marked as deleted (usually, it's zero or just a tiny amount for me). ==== select galaxy_user.email, sum(dataset.file_size) as total_size, sum(CASE WHEN hda.deleted THEN dataset.file_size ELSE 0 END) as deletable_size from dataset, galaxy_user, history, history_dataset_association hda where history.user_id = galaxy_user.id AND hda.history_id = history.id AND hda.dataset_id = dataset.id AND dataset.purged = false group by galaxy_user.email order by total_size desc ====

2. Find the biggest files, the owner and the tool which created them. Useful for spotting offending users :) change the "dataset.file_size>100000000" to set the minimum size of files to find. ==== select dataset.id, timezone('UTC', dataset.create_time)::date, hda.id, CASE WHEN hda.deleted THEN 'Y' ELSE '' END as hda_deleted, CASE WHEN dataset.deleted THEN 'Y' ELSE '' END as dataset_deleted, galaxy_user.email as "user", history.name, history.id, hda.name, job.tool_id as "created_by_tool", dataset.file_size from dataset, history_dataset_association hda, history, galaxy_user, job_to_output_dataset, job where hda.dataset_id = dataset.id AND hda.history_id = history.id AND history.user_id = galaxy_user.id AND job_to_output_dataset.dataset_id = hda.id AND job.id = job_to_output_dataset.job_id AND dataset.file_size > 100000000 AND dataset.purged = false order by dataset.file_size desc ; ==== 3. Find all FASTQ files A variation on the above query, replace: dataset.file_size > 100000000 With: hda.extension like 'fast%' To list all the FASTQ files (which are a big storage waste for us). There are couple of caveats: 1. You should never meddle directly with the DB unless you know what you're doing (and have a good backup). 2. The total size is only an approximation, because: a. It ignores the size of medadata files and data libraries b. For some files (mostly failed jobs) the size is NULL or zero but the file on disk is not necessarily empty. Still, the approximation is quite reliable. -gordon

5259

Age (days ago)

5259

Last active (days ago)

List overview

Download

10 comments

3 participants

participants (3)

Assaf Gordon
Greg Von Kuster
Sebastian J. Schultheiss

Database cleanup and user management

tags

participants (3)