[galaxy-dev] Improving Administrative Data Clean Up (pgcleanup.py vs cleanup_datasets.py)

22 Mar 2013

      I have been running a Galaxy server for our sequencing researchers for a 
while now and it's become increasingly successful. The biggest resource 
challenge for us has been, and continues to be disk space.  As such, I'd 
like to implement some additional cleanup scripts. I thought I run a few 
questions by this list before I got too far into things.

In general, I'm wondering how to implement updates/additions to the 
cleanup system that will be in line with the direction that the Galaxy 
project is headed.  The pgcleanup.py script is the newest piece of code 
in this area (and even adds cleanup of exported histories, which are 
absent from the older cleanup scripts). Also, the pgcleanup.py script 
uses a "cleanup_event" table that I don't believe is used by the older 
cleanup_datasets.py script. However, the new pgcleanup.py script only 
works for Postgres, and worse, only for version 9.1+.  I run my system 
on RedHat (CentOS) and thus we use version 8.4 of Postgres.  Are there 
plans to support other databases or older versions of Postgres?

I'd like to implement a script to delete (set the deleted flag) for 
certain datasets (e.g. raw data imported from our archive, for old, 
inactive users, etc.).  I'm wondering if it would make sense to try and 
extend pgcleanup.py or cleanup_datasets.py.  Or perhaps it would be best 
to just implement a separate script, though that seems like I'd have to 
re-implement a lot of boilerplate code for configuration reading, 
connections, logging, etc.   Any tips on generally acceptable 
(supported) procedures for marking a dataset as deleted?

Of course, I'll make any of the enhancements available (and would be 
happy to submit pull requests if there is interest).

-- 
Lance Parsons - Scientific Programmer
134 Carl C. Icahn Laboratory
Lewis-Sigler Institute for Integrative Genomics
Princeton University