Improving Administrative Data Clean Up (pgcleanup.py vs cleanup_datasets.py)
I have been running a Galaxy server for our sequencing researchers for a while now and it's become increasingly successful. The biggest resource challenge for us has been, and continues to be disk space. As such, I'd like to implement some additional cleanup scripts. I thought I run a few questions by this list before I got too far into things. In general, I'm wondering how to implement updates/additions to the cleanup system that will be in line with the direction that the Galaxy project is headed. The pgcleanup.py script is the newest piece of code in this area (and even adds cleanup of exported histories, which are absent from the older cleanup scripts). Also, the pgcleanup.py script uses a "cleanup_event" table that I don't believe is used by the older cleanup_datasets.py script. However, the new pgcleanup.py script only works for Postgres, and worse, only for version 9.1+. I run my system on RedHat (CentOS) and thus we use version 8.4 of Postgres. Are there plans to support other databases or older versions of Postgres? I'd like to implement a script to delete (set the deleted flag) for certain datasets (e.g. raw data imported from our archive, for old, inactive users, etc.). I'm wondering if it would make sense to try and extend pgcleanup.py or cleanup_datasets.py. Or perhaps it would be best to just implement a separate script, though that seems like I'd have to re-implement a lot of boilerplate code for configuration reading, connections, logging, etc. Any tips on generally acceptable (supported) procedures for marking a dataset as deleted? Of course, I'll make any of the enhancements available (and would be happy to submit pull requests if there is interest). -- Lance Parsons - Scientific Programmer 134 Carl C. Icahn Laboratory Lewis-Sigler Institute for Integrative Genomics Princeton University
On Mar 22, 2013, at 11:56 AM, Lance Parsons wrote:
I have been running a Galaxy server for our sequencing researchers for a while now and it's become increasingly successful. The biggest resource challenge for us has been, and continues to be disk space. As such, I'd like to implement some additional cleanup scripts. I thought I run a few questions by this list before I got too far into things.
In general, I'm wondering how to implement updates/additions to the cleanup system that will be in line with the direction that the Galaxy project is headed. The pgcleanup.py script is the newest piece of code in this area (and even adds cleanup of exported histories, which are absent from the older cleanup scripts). Also, the pgcleanup.py script uses a "cleanup_event" table that I don't believe is used by the older cleanup_datasets.py script. However, the new pgcleanup.py script only works for Postgres, and worse, only for version 9.1+. I run my system on RedHat (CentOS) and thus we use version 8.4 of Postgres. Are there plans to support other databases or older versions of Postgres?
Hi Lance, pgcleanup.py makes extensive use of Writable CTEs, so there is not really a way to port it to older versions. For 8.4 or MySQL, you can still use the older cleanup_datasets.py.
I'd like to implement a script to delete (set the deleted flag) for certain datasets (e.g. raw data imported from our archive, for old, inactive users, etc.). I'm wondering if it would make sense to try and extend pgcleanup.py or cleanup_datasets.py. Or perhaps it would be best to just implement a separate script, though that seems like I'd have to re-implement a lot of boilerplate code for configuration reading, connections, logging, etc. Any tips on generally acceptable (supported) procedures for marking a dataset as deleted?
You could probably reuse a lot of the code from either of the cleanup scripts for this. Thanks, --nate
Of course, I'll make any of the enhancements available (and would be happy to submit pull requests if there is interest).
-- Lance Parsons - Scientific Programmer 134 Carl C. Icahn Laboratory Lewis-Sigler Institute for Integrative Genomics Princeton University
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
On Mar 22, 2013, at 11:56 AM, Lance Parsons wrote:
I have been running a Galaxy server for our sequencing researchers for a while now and it's become increasingly successful. The biggest resource challenge for us has been, and continues to be disk space. As such, I'd like to implement some additional cleanup scripts. I thought I run a few questions by this list before I got too far into things.
In general, I'm wondering how to implement updates/additions to the cleanup system that will be in line with the direction that the Galaxy project is headed. The pgcleanup.py script is the newest piece of code in this area (and even adds cleanup of exported histories, which are absent from the older cleanup scripts). Also, the pgcleanup.py script uses a "cleanup_event" table that I don't believe is used by the older cleanup_datasets.py script. However, the new pgcleanup.py script only works for Postgres, and worse, only for version 9.1+. I run my system on RedHat (CentOS) and thus we use version 8.4 of Postgres. Are there plans to support other databases or older versions of Postgres?
Hi Lance,
pgcleanup.py makes extensive use of Writable CTEs, so there is not really a way to port it to older versions. For 8.4 or MySQL, you can still use the older cleanup_datasets.py. After looking at it a bit more, I see what you mean. Are there plans to implement and additional cleanup scripts for non-postgres 9.1 users? Just curious so I don't reinvent the wheel, I'd be happy to help with existing efforts.
I'd like to implement a script to delete (set the deleted flag) for certain datasets (e.g. raw data imported from our archive, for old, inactive users, etc.). I'm wondering if it would make sense to try and extend pgcleanup.py or cleanup_datasets.py. Or perhaps it would be best to just implement a separate script, though that seems like I'd have to re-implement a lot of boilerplate code for configuration reading, connections, logging, etc. Any tips on generally acceptable (supported) procedures for marking a dataset as deleted?
You could probably reuse a lot of the code from either of the cleanup scripts for this. Right. It seems to make sense to me to focus on the cleanup_datasets.py since that will work for everyone. I would like to essentially mimic
Nate Coraor wrote: the user deleting a dataset. I'd then email them to let them know that some old data had been marked for deletion and let the rest of the scripts proceed as normal, cleaning that up if they don't undelete it. It looks like I would want to mark the HistoryDatasetAssociations as deleted? Is that correct? Would I need to do anything else to simulate the user deleting the dataset? Thanks for the help, Lance
Thanks, --nate
-- Lance Parsons - Scientific Programmer 134 Carl C. Icahn Laboratory Lewis-Sigler Institute for Integrative Genomics Princeton University
On Mar 22, 2013, at 4:57 PM, Lance Parsons wrote:
Nate Coraor wrote:
On Mar 22, 2013, at 11:56 AM, Lance Parsons wrote:
I have been running a Galaxy server for our sequencing researchers for a while now and it's become increasingly successful. The biggest resource challenge for us has been, and continues to be disk space. As such, I'd like to implement some additional cleanup scripts. I thought I run a few questions by this list before I got too far into things.
In general, I'm wondering how to implement updates/additions to the cleanup system that will be in line with the direction that the Galaxy project is headed. The pgcleanup.py script is the newest piece of code in this area (and even adds cleanup of exported histories, which are absent from the older cleanup scripts). Also, the pgcleanup.py script uses a "cleanup_event" table that I don't believe is used by the older cleanup_datasets.py script. However, the new pgcleanup.py script only works for Postgres, and worse, only for version 9.1+. I run my system on RedHat (CentOS) and thus we use version 8.4 of Postgres. Are there plans to support other databases or older versions of Postgres?
Hi Lance,
pgcleanup.py makes extensive use of Writable CTEs, so there is not really a way to port it to older versions. For 8.4 or MySQL, you can still use the older cleanup_datasets.py.
After looking at it a bit more, I see what you mean. Are there plans to implement and additional cleanup scripts for non-postgres 9.1 users? Just curious so I don't reinvent the wheel, I'd be happy to help with existing efforts.
No, there aren't any plans as long as the alternative (cleanup_datasets.py) still works for other versions.
I'd like to implement a script to delete (set the deleted flag) for certain datasets (e.g. raw data imported from our archive, for old, inactive users, etc.). I'm wondering if it would make sense to try and extend pgcleanup.py or cleanup_datasets.py. Or perhaps it would be best to just implement a separate script, though that seems like I'd have to re-implement a lot of boilerplate code for configuration reading, connections, logging, etc. Any tips on generally acceptable (supported) procedures for marking a dataset as deleted?
You could probably reuse a lot of the code from either of the cleanup scripts for this. Right. It seems to make sense to me to focus on the cleanup_datasets.py since that will work for everyone. I would like to essentially mimic the user deleting a dataset. I'd then email them to let them know that some old data had been marked for deletion and let the rest of the scripts proceed as normal, cleaning that up if they don't undelete it.
It looks like I would want to mark the HistoryDatasetAssociations as deleted? Is that correct? Would I need to do anything else to simulate the user deleting the dataset?
That's correct. --nate
Thanks for the help, Lance
Thanks, --nate
-- Lance Parsons - Scientific Programmer 134 Carl C. Icahn Laboratory Lewis-Sigler Institute for Integrative Genomics Princeton University
Just an update on the cleanup script. I have implemented a basic script to perform administrative dataset deletion and email notification. Right now it's limited to the history_dataset_assocation update_time and an optional tool_id string. I have pushed it to a galaxy-central and issued a pull request to galaxy-central (https://bitbucket.org/galaxy/galaxy-central/pull-request/158/basic-administr...). I'm open to comments or suggestions, it could certainly be extended. Hopefully people find this useful. admin_dataset_cleanup.py Documentation --------------------------------------------------------- Mark datasets as deleted that are older than specified cutoff and (optionally) with a tool_id that matches the specified search string. This script is useful for administrators to cleanup after users who leave many old datasets around. It was modeled after the cleanup_datasets.py script originally distributed with Galaxy. Basic Usage: admin_cleanup_datasets.py universe_wsgi.ini -d 60 \ --template=email_template.txt Required Arguments: config_file - the Galaxy configuration file (universe_wsgi.ini) Optional Arguments: -d --days - number of days old the dataset must be (default: 60) --tool_id - string to search for in dataset tool_id --template - Mako template file to use for email notification -i --info_only - Print results, but don't email or delete anything -e --email_only - Email notifications, but don't delete anything Useful for notifying users of pending deletion --smtp - Specify smtp server If not specified, use smtp settings specified in config file --fromaddr - Specify from address If not specified, use error_email_to specified in config file Email Template Variables: cutoff - the cutoff in days email - the users email address datasets - a list of tuples containing 'dataset' and 'history' names Lance Parsons wrote:
On Mar 22, 2013, at 11:56 AM, Lance Parsons wrote:
I have been running a Galaxy server for our sequencing researchers for a while now and it's become increasingly successful. The biggest resource challenge for us has been, and continues to be disk space. As such, I'd like to implement some additional cleanup scripts. I thought I run a few questions by this list before I got too far into things.
In general, I'm wondering how to implement updates/additions to the cleanup system that will be in line with the direction that the Galaxy project is headed. The pgcleanup.py script is the newest piece of code in this area (and even adds cleanup of exported histories, which are absent from the older cleanup scripts). Also, the pgcleanup.py script uses a "cleanup_event" table that I don't believe is used by the older cleanup_datasets.py script. However, the new pgcleanup.py script only works for Postgres, and worse, only for version 9.1+. I run my system on RedHat (CentOS) and thus we use version 8.4 of Postgres. Are there plans to support other databases or older versions of Postgres?
Hi Lance,
pgcleanup.py makes extensive use of Writable CTEs, so there is not really a way to port it to older versions. For 8.4 or MySQL, you can still use the older cleanup_datasets.py. After looking at it a bit more, I see what you mean. Are there plans to implement and additional cleanup scripts for non-postgres 9.1 users? Just curious so I don't reinvent the wheel, I'd be happy to help with existing efforts.
I'd like to implement a script to delete (set the deleted flag) for certain datasets (e.g. raw data imported from our archive, for old, inactive users, etc.). I'm wondering if it would make sense to try and extend pgcleanup.py or cleanup_datasets.py. Or perhaps it would be best to just implement a separate script, though that seems like I'd have to re-implement a lot of boilerplate code for configuration reading, connections, logging, etc. Any tips on generally acceptable (supported) procedures for marking a dataset as deleted?
You could probably reuse a lot of the code from either of the cleanup scripts for this. Right. It seems to make sense to me to focus on the cleanup_datasets.py since that will work for everyone. I would like to essentially mimic the user deleting a dataset. I'd then email them to let them know that some old data had been marked for deletion and let the rest of the scripts proceed as normal, cleaning that up if
Nate Coraor wrote: they don't undelete it.
It looks like I would want to mark the HistoryDatasetAssociations as deleted? Is that correct? Would I need to do anything else to simulate the user deleting the dataset?
Thanks for the help, Lance
Thanks, --nate
-- Lance Parsons - Scientific Programmer 134 Carl C. Icahn Laboratory Lewis-Sigler Institute for Integrative Genomics Princeton University
-- Lance Parsons - Scientific Programmer 134 Carl C. Icahn Laboratory Lewis-Sigler Institute for Integrative Genomics Princeton University
participants (2)
-
Lance Parsons
-
Nate Coraor