galaxy user data on file system with limited file lifetime
Dear galaxy developers, The question in short: How can galaxy user data (e.g. file_path) be stored safely on a file system where files have a limited life time? Galaxy will run on a cluster (~2000 cores) head node where data is stored at three points: /home (Individual User Homes) with 50 GB quota /data (Research Group Directories) with 4 up to 100 TB quota /work (User Directories - Temporary File Area) 60 days life time We plan to store galaxy related data as follows: /work/galaxy/files <- file_path /work/galaxy/tmp <- new_file_path /work/galaxy/jobs <- job_working_directory As a note: Storing these in /data/ would undermine our quota system which our admins do not like. /data/galaxy will contain the galaxy installation including tool data (and I hope that we can just set the quotas high enough to never run out of space). Data libraries will be added using the "link mechanism" from /home/USER and /data/GROUP. I hope that I can automatize import and appropriate setting of permissions via the API / bioblend. Are there already scripts? Is this scheme reasonable? If yes: The main question is how I can guarantee that the life time of data of /work/ and the galaxy server play nice together. My idea consists of two parts: 1. Adapt cleanup_datasets.py (i.e. the function purge_histories) such that all histories (also those that have not been deleted) are purged which are at the file system life time. The modification seems to be to remove the test: app.model.History.table.c.deleted == true() At the same time the included data sets will be purged. 2. Using the API I will get the update time of each history or the update time of the youngest included data set (or is it the same anyway). For the files corresponding to the included data sets I will update the access times in the file systems. Such I will guarantee that only complete histories are purged. The script(s) can then be run via cron with a life time set to 1 day less than the file system life time (just to be sure). In theory jobs could run longer than 60 days. Therefore my idea would be to update access times of all files in job_working_directory daily. Thank you very much for any help. Best, Matthias -- ------------------------------------------- Matthias Bernt Bioinformatics Service Molekulare Systembiologie (MOLSYB) Helmholtz-Zentrum für Umweltforschung GmbH - UFZ/ Helmholtz Centre for Environmental Research GmbH - UFZ Permoserstraße 15, 04318 Leipzig, Germany Phone +49 341 235 482296, m.bernt@ufz.de, www.ufz.de Sitz der Gesellschaft/Registered Office: Leipzig Registergericht/Registration Office: Amtsgericht Leipzig Handelsregister Nr./Trade Register Nr.: B 4703 Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board: MinDirig Wilfried Kraus Wissenschaftlicher Geschäftsführer/Scientific Managing Director: Prof. Dr. Dr. h.c. Georg Teutsch Administrative Geschäftsführerin/ Administrative Managing Director: Prof. Dr. Heike Graßmann -------------------------------------------
participants (1)
-
Matthias Bernt