Dear galaxy developers,
The question in short: How can galaxy user data (e.g. file_path) be stored safely on a file system where files have a limited life time?
Galaxy will run on a cluster (~2000 cores) head node where data is stored at three points:
/home (Individual User Homes) with 50 GB quota /data (Research Group Directories) with 4 up to 100 TB quota /work (User Directories - Temporary File Area) 60 days life time
We plan to store galaxy related data as follows:
/work/galaxy/files <- file_path /work/galaxy/tmp <- new_file_path /work/galaxy/jobs <- job_working_directory
As a note: Storing these in /data/ would undermine our quota system which our admins do not like.
/data/galaxy will contain the galaxy installation including tool data (and I hope that we can just set the quotas high enough to never run out of space).
Data libraries will be added using the "link mechanism" from /home/USER and /data/GROUP. I hope that I can automatize import and appropriate setting of permissions via the API / bioblend. Are there already scripts?
Is this scheme reasonable?
If yes: The main question is how I can guarantee that the life time of data of /work/ and the galaxy server play nice together.
My idea consists of two parts:
1. Adapt cleanup_datasets.py (i.e. the function purge_histories) such that all histories (also those that have not been deleted) are purged which are at the file system life time. The modification seems to be to remove the test: app.model.History.table.c.deleted == true() At the same time the included data sets will be purged.
2. Using the API I will get the update time of each history or the update time of the youngest included data set (or is it the same anyway). For the files corresponding to the included data sets I will update the access times in the file systems. Such I will guarantee that only complete histories are purged.
The script(s) can then be run via cron with a life time set to 1 day less than the file system life time (just to be sure).
In theory jobs could run longer than 60 days. Therefore my idea would be to update access times of all files in job_working_directory daily.
Thank you very much for any help.
Best, Matthias