Hi, I recently ran the clean up datasets script to free up some storage space and this resulted in some of the active library datasets being purged from disk. This library was loaded from an external file path. When I wanted to add more files to galaxy from the same path it was easier to load the whole directory again and delete the duplicated files. I'm assuming that the cleanup script looks at these deleted datasets and purges the file they are associated with even though another current dataset also links to this file. Is there a way to check if a file is referenced by another dataset before purging or to prohibit the script from deleting files out with the default galaxy file directory? If I leave out the purge_libraries script will this stop library datasets from being removed or only deleted libraries as a whole? Thanks for your help Shaun -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Hello Shaun, On Jun 4, 2010, at 5:06 AM, SHAUN WEBB wrote:
Hi, I recently ran the clean up datasets script to free up some storage space and this resulted in some of the active library datasets being purged from disk.
This should not be possible, so perhaps you have found a corner case scenario that needs to be handled.
This library was loaded from an external file path.
Did you just use the "Upload a directory of files" option for the upload form, or did you have the "allow_library_path_paste" config setting turned on? In either case, did you check the "Copy data into Galaxy?" checkbox on the upload form, eliminating the copy of the files into Galaxy's default files directory?
When I wanted to add more files to galaxy from the same path it was easier to load the whole directory again and delete the duplicated files.
I'm assuming that the cleanup script looks at these deleted datasets and purges the file they are associated with even though another current dataset also links to this file.
This is not the case. Any dataset file that has at least one active link ( an undeleted link from either a history item or a library ) will not be removed from disk by the cleanup_datasets.py script. It does not matter how many deleted links to the file exist, as long as 1 active link exists, the file will not be removed from disk ( unless you've found a bug ).
Is there a way to check if a file is referenced by another dataset before purging or to prohibit the script from deleting files out with the default galaxy file directory?
The cleanup_dataset.py script already performs these types of checks. I've also added a very recent new feature to the library dataset information page ( the page that is displayed if you click on the library dataset link within the library ) which displays all history items and other library dataset that point to the disk file for the current library dataset. This is useful for manually seeing the various linked items, but is not related to the cleanup_datasets.py script. This new feature has not yet made it out to the distribution, but will be there soon. There is also a new Galaxy report ( not yet in the distribution, but will be soon ) which shows disk usage for the file system on which the Galaxy data files exist, and you can view large datasets ( over 4 gb ) that are not purged, and see all history and library items that point to the disk file ( similar to the library dataset information page above, but display datasets linked to only by histories as well ).
If I leave out the purge_libraries script will this stop library datasets from being removed or only deleted libraries as a whole?
The purge_libraries option to the cleanup_datasets.py script handles removing data files from disk that have been deleted for the appropriate number of days as well as purging the library record from the database ( as long as the libraries contents have all been purged ).
Thanks for your help
Shaun
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
Hi Greg, I used the allow_library_path_paste = True option to upload a directory of files (several times). I also checked the box so that files would not be copied in to galaxy. Although I may have missed the box on occasion, deleted the library items straight away and restarted the upload with the box checked. I had assumed that the cleanup datasets script would perform these checks so it's reassuring to know his may be an isolated case. I have checked the logs and it was definitely this script that purged the files. Do you have any ideas what may have caused this? As use of our production server grows we will need to delete unused data and I want to be sure this doesn't happen again. Thanks for your help Shaun Quoting Greg Von Kuster <greg@bx.psu.edu>:
Hello Shaun,
On Jun 4, 2010, at 5:06 AM, SHAUN WEBB wrote:
Hi, I recently ran the clean up datasets script to free up some storage space and this resulted in some of the active library datasets being purged from disk.
This should not be possible, so perhaps you have found a corner case scenario that needs to be handled.
This library was loaded from an external file path.
Did you just use the "Upload a directory of files" option for the upload form, or did you have the "allow_library_path_paste" config setting turned on? In either case, did you check the "Copy data into Galaxy?" checkbox on the upload form, eliminating the copy of the files into Galaxy's default files directory?
When I wanted to add more files to galaxy from the same path it was easier to load the whole directory again and delete the duplicated files.
I'm assuming that the cleanup script looks at these deleted datasets and purges the file they are associated with even though another current dataset also links to this file.
This is not the case. Any dataset file that has at least one active link ( an undeleted link from either a history item or a library ) will not be removed from disk by the cleanup_datasets.py script. It does not matter how many deleted links to the file exist, as long as 1 active link exists, the file will not be removed from disk ( unless you've found a bug ).
Is there a way to check if a file is referenced by another dataset before purging or to prohibit the script from deleting files out with the default galaxy file directory?
The cleanup_dataset.py script already performs these types of checks.
I've also added a very recent new feature to the library dataset information page ( the page that is displayed if you click on the library dataset link within the library ) which displays all history items and other library dataset that point to the disk file for the current library dataset. This is useful for manually seeing the various linked items, but is not related to the cleanup_datasets.py script. This new feature has not yet made it out to the distribution, but will be there soon.
There is also a new Galaxy report ( not yet in the distribution, but will be soon ) which shows disk usage for the file system on which the Galaxy data files exist, and you can view large datasets ( over 4 gb ) that are not purged, and see all history and library items that point to the disk file ( similar to the library dataset information page above, but display datasets linked to only by histories as well ).
If I leave out the purge_libraries script will this stop library datasets from being removed or only deleted libraries as a whole?
The purge_libraries option to the cleanup_datasets.py script handles removing data files from disk that have been deleted for the appropriate number of days as well as purging the library record from the database ( as long as the libraries contents have all been purged ).
Thanks for your help
Shaun
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Hello Shaun, On Jun 7, 2010, at 5:15 AM, SHAUN WEBB wrote:
Hi Greg,
I used the allow_library_path_paste = True option to upload a directory of files (several times). I also checked the box so that files would not be copied in to galaxy. Although I may have missed the box on occasion, deleted the library items straight away and restarted the upload with the box checked.
I believe what must have happened is that using allow_library_path_paste, you uploaded the files to a Library *with the box checked*, and then thought that you had not checked the box, so deleted them, and uploaded them the same way again with the box checked again. This is the only way that we can see how the files would ultimately have been removed from disk. This scenario brought to light a weakness in our code that has been fixed in change set 3900:384137f8b5c6. From that change set on, whenever the box is checked, Galaxy will never be able to purge the files since they are not in Galaxy's default file location. This change set will be available in the distribution very soon. Thanks for reporting this issue Shaun!
I had assumed that the cleanup datasets script would perform these checks so it's reassuring to know his may be an isolated case. I have checked the logs and it was definitely this script that purged the files.
Do you have any ideas what may have caused this? As use of our production server grows we will need to delete unused data and I want to be sure this doesn't happen again.
Thanks for your help
Shaun
Quoting Greg Von Kuster <greg@bx.psu.edu>:
Hello Shaun,
On Jun 4, 2010, at 5:06 AM, SHAUN WEBB wrote:
Hi, I recently ran the clean up datasets script to free up some storage space and this resulted in some of the active library datasets being purged from disk.
This should not be possible, so perhaps you have found a corner case scenario that needs to be handled.
This library was loaded from an external file path.
Did you just use the "Upload a directory of files" option for the upload form, or did you have the "allow_library_path_paste" config setting turned on? In either case, did you check the "Copy data into Galaxy?" checkbox on the upload form, eliminating the copy of the files into Galaxy's default files directory?
When I wanted to add more files to galaxy from the same path it was easier to load the whole directory again and delete the duplicated files.
I'm assuming that the cleanup script looks at these deleted datasets and purges the file they are associated with even though another current dataset also links to this file.
This is not the case. Any dataset file that has at least one active link ( an undeleted link from either a history item or a library ) will not be removed from disk by the cleanup_datasets.py script. It does not matter how many deleted links to the file exist, as long as 1 active link exists, the file will not be removed from disk ( unless you've found a bug ).
Is there a way to check if a file is referenced by another dataset before purging or to prohibit the script from deleting files out with the default galaxy file directory?
The cleanup_dataset.py script already performs these types of checks.
I've also added a very recent new feature to the library dataset information page ( the page that is displayed if you click on the library dataset link within the library ) which displays all history items and other library dataset that point to the disk file for the current library dataset. This is useful for manually seeing the various linked items, but is not related to the cleanup_datasets.py script. This new feature has not yet made it out to the distribution, but will be there soon.
There is also a new Galaxy report ( not yet in the distribution, but will be soon ) which shows disk usage for the file system on which the Galaxy data files exist, and you can view large datasets ( over 4 gb ) that are not purged, and see all history and library items that point to the disk file ( similar to the library dataset information page above, but display datasets linked to only by histories as well ).
If I leave out the purge_libraries script will this stop library datasets from being removed or only deleted libraries as a whole?
The purge_libraries option to the cleanup_datasets.py script handles removing data files from disk that have been deleted for the appropriate number of days as well as purging the library record from the database ( as long as the libraries contents have all been purged ).
Thanks for your help
Shaun
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
participants (2)
-
Greg Von Kuster
-
SHAUN WEBB