Hello Sebastian, We are doing similar clean-ups as you've requested (except the automated email notification, which will be implemented soon).
We would like to see how much space is used by an individual user and evenutally be able to remove them after some time of inactivity (>180 days?), ideally preceded by 1-2 warning emails. Can the current cleanup scripts handle that?
If your galaxy is using PostgreSQL, you can use the following queries: 1. Estimated size per user, including datasets marked as deleted.
From my little experience, users rarely delete their files, and with "remote_user" on, there are no orphaned histories to clean up. This query shows how much space each user uses, and how much space he/she already marked as deleted (usually, it's zero or just a tiny amount for me). ==== select galaxy_user.email, sum(dataset.file_size) as total_size, sum(CASE WHEN hda.deleted THEN dataset.file_size ELSE 0 END) as deletable_size from dataset, galaxy_user, history, history_dataset_association hda where history.user_id = galaxy_user.id AND hda.history_id = history.id AND hda.dataset_id = dataset.id AND dataset.purged = false group by galaxy_user.email order by total_size desc ====
2. Find the biggest files, the owner and the tool which created them. Useful for spotting offending users :) change the "dataset.file_size>100000000" to set the minimum size of files to find. ==== select dataset.id, timezone('UTC', dataset.create_time)::date, hda.id, CASE WHEN hda.deleted THEN 'Y' ELSE '' END as hda_deleted, CASE WHEN dataset.deleted THEN 'Y' ELSE '' END as dataset_deleted, galaxy_user.email as "user", history.name, history.id, hda.name, job.tool_id as "created_by_tool", dataset.file_size from dataset, history_dataset_association hda, history, galaxy_user, job_to_output_dataset, job where hda.dataset_id = dataset.id AND hda.history_id = history.id AND history.user_id = galaxy_user.id AND job_to_output_dataset.dataset_id = hda.id AND job.id = job_to_output_dataset.job_id AND dataset.file_size > 100000000 AND dataset.purged = false order by dataset.file_size desc ; ==== 3. Find all FASTQ files A variation on the above query, replace: dataset.file_size > 100000000 With: hda.extension like 'fast%' To list all the FASTQ files (which are a big storage waste for us). There are couple of caveats: 1. You should never meddle directly with the DB unless you know what you're doing (and have a good backup). 2. The total size is only an approximation, because: a. It ignores the size of medadata files and data libraries b. For some files (mostly failed jobs) the size is NULL or zero but the file on disk is not necessarily empty. Still, the approximation is quite reliable. -gordon