Re: [galaxy-dev] Database cleanup and user management

14 Mar 2011

      Hello Sebastian,

We are doing similar clean-ups as you've requested (except the automated email notification, which will be implemented soon).
...
We would like to see how much space is used by an individual user and
evenutally be able to remove them after some time of inactivity (>180
days?), ideally preceded by 1-2 warning emails. Can the current
cleanup scripts handle that?
If your galaxy is using PostgreSQL, you can use the following queries:

1. Estimated size per user, including datasets marked as deleted.
...
From my little experience, users rarely delete their files, and with "remote_user" on, there are no orphaned histories to clean up.
This query shows how much space each user uses, and how much space he/she already marked as deleted (usually, it's zero or just a tiny amount for me).
====
select
            galaxy_user.email,
            sum(dataset.file_size) as total_size,
            sum(CASE WHEN hda.deleted THEN dataset.file_size ELSE 0 END) as deletable_size
from
            dataset,
            galaxy_user,
            history,
            history_dataset_association hda
where
            history.user_id = galaxy_user.id
AND
            hda.history_id = history.id
AND
            hda.dataset_id = dataset.id
AND
            dataset.purged = false
group by
            galaxy_user.email
order by
            total_size desc
====
2. Find the biggest files, the owner and the tool which created them.
Useful for spotting offending users :)
change the "dataset.file_size>100000000" to set the minimum size of files to find.
====
select
  dataset.id,
  timezone('UTC', dataset.create_time)::date,
  hda.id,
  CASE WHEN hda.deleted THEN 'Y' ELSE '' END as hda_deleted,
  CASE WHEN dataset.deleted THEN 'Y' ELSE '' END as dataset_deleted,
  galaxy_user.email as "user",
  history.name,
  history.id,
  hda.name,
  job.tool_id as "created_by_tool",
  dataset.file_size
from
  dataset,
  history_dataset_association hda,
  history,
  galaxy_user,
  job_to_output_dataset,
  job
where
  hda.dataset_id = dataset.id
  AND
  hda.history_id = history.id
  AND
  history.user_id = galaxy_user.id
  AND
  job_to_output_dataset.dataset_id = hda.id
  AND
  job.id = job_to_output_dataset.job_id
  AND
  dataset.file_size > 100000000
  AND
  dataset.purged = false
order by
  dataset.file_size desc
;
====

3. Find all FASTQ files
A variation on the above query, replace:
  dataset.file_size > 100000000
With:
  hda.extension like 'fast%'

To list all the FASTQ files (which are a big storage waste for us).

There are couple of caveats:
1. You should never meddle directly with the DB unless you know what you're doing (and have a good backup).
2. The total size is only an approximation, because:
 a. It ignores the size of medadata files and data libraries
 b. For some files (mostly failed jobs) the size is NULL or zero but the file on disk is not necessarily empty.
 Still, the approximation is quite reliable.

-gordon

Re: [galaxy-dev] Database cleanup and user management

Assaf Gordon