Hi Dan, Thanks! I'll update right away :)... Cheers, Pi On 10•Aug•2009, at 6:20 PM, Daniel Blankenberg wrote:
Hi Pi,
An updated wiki on this topic is available at http://bitbucket.org/galaxy/galaxy-central/wiki/PurgeHistoriesAndDatasets .
Apparently the script does not check whether it is being executed with proper permissions to clean,
Executing the scripts using the -f flag will cause the script to attempt to re-purge datasets that are already marked as purged (changing the time delay would also be required as the table entries would have been marked as updated at the failed attempt.)
It appears to me that Galaxy can not clean datasets unless the history or library it was once assigned to is also deleted. Is this correct?
Using the -6 flag / delete_datasets.sh script added in changeset 2551:5b405a43c406 will allow a base dataset to be marked as deleted without requiring a history/library/folder to be purged. This script could take considerable time, depending on the number of datasets in Galaxy.
When I look in the galaxy/database/files/ directory I still see duplicated files, which are easily identified, because they have exactly the same size. These appear to be associated with libraries. As I was experimenting I uploaded and deleted the same files several times and eventually only kept a single copy of these files in my libraries..
When history items are shared or library items are imported (to/from a history or between libraries), the base dataset (file) is shared, preventing duplication of file data on disk. Uploading a file several times creates a different base dataset (file) each time; each of the instances associated with each individually uploaded file will need to be deleted before a specific file will be purged from disk.
Thanks for using Galaxy,
Dan
Hi Dan and Erick,
Here's a follow up on my attempts to free up some disk space. I deleted all histories of all users (in a test installation :)). All I have left now is a few datasets in libraries. More than 10 days later I ran the scripts. At first this didn't help. Apparently the script does not check whether it is being executed with proper permissions to clean, because in the logs I found numerous entries like this one:
# Error, file has already been removed: [Errno 13] Permission denied: 'database/files'
At the end of the log it still claims:
# Freed disk space: 609064379
But it did not clean a single byte. Re-running the script as root does not help, because the script did modify the database... So I manually modified the database by setting the purged column to 0 and the update_time to something at least 10 days ago. Then re-running the script did free up some disk space :).
When I look in the galaxy/database/files/ directory I still see duplicated files, which are easily identified, because they have exactly the same size. These appear to be associated with libraries. As I was experimenting I uploaded and deleted the same files several times and eventually only kept a single copy of these files in my libraries... Maybe if I delete all libraries as well I'll be able to get rid of the binary trash.
It appears to me that Galaxy can not clean datasets unless the history or library it was once assigned to is also deleted. Is this correct? If so, this is quite a pain, because it is only natural for users to experiment resulting first in several failures before resulting in the datasets they want to keep. As long as they keep the histories / libraries with the good data all the intermediate failures associated with those histories / libraries will claim disk space too :(...
Cheers,
Pi
On 30•Jul•2009, at 1:33 PM, Erick Antezana wrote:
Hi Dan,
I have been facing the same problem as Pieter (as I reported some time ago) while trying to purge several NGS data files with which I was playing ... at that time I had used directly the scripts with different options (-d 1, -d 0, etc) and the deleted files were still there, I have just tried once again executing them in the order you have indicated but unfortunately the files are still there...
I am using mysql to store my data. I have the same behaviour while using the default db (sqlite).
cheers, Erick
2009/7/29 Daniel Blankenberg <dan@bx.psu.edu> Hi Pi,
The wiki for deleting datasets is out of date, and I will be updating it shortly.
There is a collection of shell scripts included in the scripts/cleanup_datasets directory. In order to delete no longer needed datasets from disk, the scripts can be used in the following order (assuming you have not used library functions):
delete_userless_histories.sh purge_histories.sh purge_datasets.sh
I will send a message after the wiki has been updated.
In addition: 1. What if I ran the script without -r and later decide I want to delete the associated files anyway to free up some space? How do I then know what files to delete?
This is an excellent feature for us to add to the script.
2. If I understand correctly, I should be able to remove associated data sets -r, but even when purging stuff the entries will still remain in the database... How do I really, really, Yes-Ok-I-accept- I- know-what-I'm-doing-Delete outdated stuff :) ?
There are several database tables which Galaxy expects to exist (for Job reporting, etc.) and should not have entries deleted. Datasets are an example of this, when a Dataset is purged, the purged flag is set to True, but the entry is kept. Deleting entries from the dataset tables is not recommended.
Thanks for using Galaxy,
Dan
Hi Erick, Greg et alia,
I've setup Galaxy with a MySQL DB too, but I cannot get rid off old stuff. According to the wiki, running the script with ... -1 or -3 or -5 should show me what the script would do with -2, -4 or -6. When I ran with -1 it told me:
-------- # 2009-07-29 14:03:22 - Handling stuff older than 1 days
# Datasets will NOT be removed from disk.
# The following datasets and associated userless histories have been deleted # Deleted 0 histories.
Elapsed time: 0.21 --------
That was I bit weird, because I know there should be stuff to delete. So I tried my luck with -2 to perform the actual cleanup and viola:
-------- # 2009-07-29 14:04:25 - Handling stuff older than 1 days
# Datasets will NOT be removed from disk.
# The following datasets and associated deleted histories have been purged 1 4 5 6 7 8 9 10 11 12 13 14
<..cut a lot of white space..>
15 16 # Purged 14 histories.
Elapsed time: 1.17 --------
Running with -3, -4 and -5 all gave me 0 in either purged data sets or folders, but I know there must be stuff associated with user accounts older than 1 day that should be purged... The -6 option does not seem to work at all as I got this error: "cleanup_datasets.py: error: no such option: -6". Do I miss something?
In addition: 1. What if I ran the script without -r and later decide I want to delete the associated files anyway to free up some space? How do I then know what files to delete? 2. If I understand correctly, I should be able to remove associated data sets -r, but even when purging stuff the entries will still remain in the database... How do I really, really, Yes-Ok-I-accept- I- know-what-I'm-doing-Delete outdated stuff :) ?
Cheers,
Pi
On 23•Jul•2009, at 5:17 PM, Erick Antezana wrote:
Greg,
please see in-line:
2009/7/23 Greg Von Kuster <ghv2@psu.edu> Hi Erick,
Erick Antezana wrote: Greg,
I manage to set my connection string so that we could use a remote mysql server. Thanks.
w.r.t. the datasets purging, I used the scripts to clean deleted libraries, folders, datasets, userless history ... I've seen that one must speficy the span of time in days. What about the data that was added mistakenly for instance today and that we want to immediately delete it? I tried to launch the script with "-d 0" but the data is still there... Am I missing something?
No, I don't think so. It's possible that your system clock is off from your database time.
both servers (mysql and the one where galaxy is running) have the same time.
Is your database storing time as local time?
how can I see that?
The cleanup script uses the update_time for the objects being deleted.
In which file can I find the SQL command that actually deletes and purges the data?
I am no longer using the sqlite DB created in our first trials. I guess I can safely delete (from the command line) all the files under the directory database?
Maybe. Did you keep any data that refers to them in your tables when you migrated to mysql? If so, you'll need to keep them.
no, I have no data referring to anything... I just deleted (to save space) all those files and I have no problems at all (so far ;-) )
have the purge_*.sh scripts tested with mysql?
Yes
last question (already asked before): are there any plans to support Oracle?
Not sure why it wouldn't already be supported, although we don't use it here. Just needs a different URL - sqlalchemy supports Oracle.
good to know that, I will try to find some time to test it and let you know.
cheers, Erick
thanks, Erick
2009/7/22 Greg Von Kuster <ghv2@psu.edu <mailto:ghv2@psu.edu>>
Erick,
To use a different database than the sqlite that come with the Galaxy distribution all that is needed is to change the config setting, prviding the URL that points to your mysql database. See the mysql documentation for the connection URL, as the URL differs depending upon whether you database is installed locally or not.
The config setting is the "database_connection" setting, and could look something like this:
database_connection = mysql:///greg_test?unix_socket=/var/run/mysqld/mysqld.sock
Greg Von Kuster Galaxy Development Team
Erick Antezana wrote:
Hello,
I would like to use MySQL instead of sqlite to store my data. I coudn't find on the Galaxy web site a HOWTO or some guidelines to do it. I only found some lines that might need to be changed/enabled in the universe_wsgi.ini file:
#database_file = database/universe.sqlite database_connection = mysql:///galaxy #database_engine_option_echo = true #database_engine_option_echo_pool = true #database_engine_option_pool_size = 10 #database_engine_option_max_overflow = 20
Could you point out to some doc or briefly describe what I need to do in order to go for mysql?
Are there any plans to support other DBMS's (like Oracle for instance)?
thanks, Erick
_______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu <mailto:galaxy-user@bx.psu.edu>
http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
_______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
------------------------------------------------------------- Biomolecular Mass Spectrometry and Proteomics Utrecht University
Visiting address: H.R. Kruyt building room O607 Padualaan 8 3584 CH Utrecht The Netherlands
Mail address: P.O. box 80.082 3508 TB Utrecht The Netherlands
phone: +31 (0)6-143 66 783 email: pieter.neerincx@gmail.com skype: pieter.online ------------------------------------------------------------
_______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
_______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
_______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
------------------------------------------------------------- Biomolecular Mass Spectrometry and Proteomics Utrecht University
Visiting address: H.R. Kruyt building room O607 Padualaan 8 3584 CH Utrecht The Netherlands
Mail address: P.O. box 80.082 3508 TB Utrecht The Netherlands
phone: +31 (0)6-143 66 783 email: pieter.neerincx@gmail.com skype: pieter.online ------------------------------------------------------------
_______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
------------------------------------------------------------- Biomolecular Mass Spectrometry and Proteomics Utrecht University Visiting address: H.R. Kruyt building room O607 Padualaan 8 3584 CH Utrecht The Netherlands Mail address: P.O. box 80.082 3508 TB Utrecht The Netherlands phone: +31 (0)6-143 66 783 email: pieter.neerincx@gmail.com skype: pieter.online ------------------------------------------------------------