Hello Shaun,

On Jun 4, 2010, at 5:06 AM, SHAUN WEBB wrote:


Hi,
I recently ran the clean up datasets script to free up some storage space and this resulted in some of the active library datasets being purged from disk.


This should not be possible, so perhaps you have found a corner case scenario that needs to be handled.



This library was loaded from an external file path.


Did you just use the "Upload a directory of files" option for the upload form, or did you have the "allow_library_path_paste" config setting turned on?  In either case, did you check the "Copy data into Galaxy?" checkbox on the upload form, eliminating the copy of the files into Galaxy's default files directory?


When I wanted to add more files to galaxy from the same path it was easier to load the whole directory again and delete the duplicated files.

I'm assuming that the cleanup script looks at these deleted datasets and purges the file they are associated with even though another current dataset also links to this file.


This is not the case.  Any dataset file that has at least one active link ( an undeleted link from either a history item or a library ) will not be removed from disk by the cleanup_datasets.py script.  It does not matter how many deleted links to the file exist, as long as 1 active link exists, the file will not be removed from disk ( unless you've found a bug ).



Is there a way to check if a file is referenced by another dataset before purging or to prohibit the script from deleting files out with the default galaxy file directory?

The cleanup_dataset.py script already performs these types of checks.  

I've also added a very recent new feature to the library dataset information page ( the page that is displayed if you click on the library dataset link within the library ) which displays all history items and other library dataset that point to the disk file for the current library dataset.  This is useful for manually seeing the various linked items, but is not related to the cleanup_datasets.py script.  This new feature has not yet made it out to the distribution, but will be there soon.

There is also a new Galaxy report ( not yet in the distribution, but will be soon ) which shows disk usage for the file system on which the Galaxy data files exist, and you can view large datasets ( over 4 gb ) that are not purged, and see all history and library items that point to the disk file ( similar to the library dataset information page above, but display datasets linked to only by histories as well ).



If I leave out the purge_libraries script will this stop library datasets from being removed or only deleted libraries as a whole?

The purge_libraries option to the cleanup_datasets.py script handles removing data files from disk that have been deleted for the appropriate number of days as well as purging the library record from the database ( as long as the libraries contents have all been purged ).



Thanks for your help

Shaun

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


_______________________________________________
galaxy-dev mailing list
galaxy-dev@lists.bx.psu.edu
http://lists.bx.psu.edu/listinfo/galaxy-dev

Greg Von Kuster
Galaxy Development Team
greg@bx.psu.edu