S3/Swift object store cache path and `extra_dir`s
Our local instance currently uses the traditional directories under `database/` for datasets, job working directories, and temporary files. Ultimately we wish to transition to using our Swift object store for storage. We've been doing some experimentation with Galaxy's Swift backend and have run into a few issues. The first major issue we came across was Swift's 5 GB segment size limit, since the segmentation/multipart upload code is bypassed for instances of SwiftObjectStore [1]. SwiftStack support provided a patch enabling multipart uploads for Swift (PR #648) which has been working well for us so far. (Thanks, Charles!) The next issue is that the path attribute of the cache tag in object_store_conf.xml appears to be ignored. The value does get stored to self.cache_path in _parse_config_xml, but elsewhere in the file self.staging_path is used instead. Finally, adding extra_dir tags to the Swift object store config doesn't appear to do anything. Here's my object_store_conf.xml: <?xml version="1.0"?> <object_store type="hierarchical"> <backends> <object_store type="swift" id="primary" order="0"> <auth access_key="..." secret_key="..."/> <bucket name="galaxy_store"/> <connection host="tin.fhcrc.org" port="443"/> <cache path="database/object_store_cache" size="1000"/> <extra_dir type="temp" path="database/tmp"/> <extra_dir type="job_work" path="database/job_working_directory"/> </object_store> <object_store type="disk" id="secondary" order="1"> <files_dir path="database/files"/> </object_store> </backends> </object_store> The goal with the hierarchical setup above is for new datasets to be created in the primary (Swift) object store, caching to `database/object_store_cache`, while the job and temporary directories remain at `database/job_working_directory` and `database/tmp`, respectively. Existing (pre-Swift) datasets remain in `database/files` and are handled by the secondary disk store. What actually happens (after renaming self.cache_path to self.staging_path in _parse_config_xml to get the cache path working) is this: galaxy.jobs DEBUG 2015-02-06 16:07:26,615 (1) Working directory for job is: /home/bclaywel/workspace/galaxy-central/database/object_store_cache/000/1 That is, the job working directory is created directly under the cache path's hash directories. I assume temp files would probably end up there also. We're quite excited to get Galaxy and Swift working well together, and I'm more than happy to help debug and test! Cheers, Brian [1] https://bitbucket.org/galaxy/galaxy-central/src/54ed3adb6575addba47d627944eb... -- Brian Claywell | programmer/analyst Matsen Group | http://matsen.fredhutch.org Fred Hutchinson Cancer Research Center
Hey Brian, Thanks for the interest in Galaxy's Swift object store! I also tested Charles' PR and it looks like a nice improvement -- I'll go ahead and get that pulled into Galaxy shortly. The HierarchicalObjectStore was written with exactly what you're trying to do in mind, so you're definitely on the right track here. I'll see if I can verify and fix the file location issues you point out and will get back to you. -Dannon On Mon Feb 09 2015 at 4:29:35 PM Brian Claywell <bclaywel@fredhutch.org> wrote:
Our local instance currently uses the traditional directories under `database/` for datasets, job working directories, and temporary files. Ultimately we wish to transition to using our Swift object store for storage. We've been doing some experimentation with Galaxy's Swift backend and have run into a few issues.
The first major issue we came across was Swift's 5 GB segment size limit, since the segmentation/multipart upload code is bypassed for instances of SwiftObjectStore [1]. SwiftStack support provided a patch enabling multipart uploads for Swift (PR #648) which has been working well for us so far. (Thanks, Charles!)
The next issue is that the path attribute of the cache tag in object_store_conf.xml appears to be ignored. The value does get stored to self.cache_path in _parse_config_xml, but elsewhere in the file self.staging_path is used instead.
Finally, adding extra_dir tags to the Swift object store config doesn't appear to do anything. Here's my object_store_conf.xml:
<?xml version="1.0"?> <object_store type="hierarchical"> <backends> <object_store type="swift" id="primary" order="0"> <auth access_key="..." secret_key="..."/> <bucket name="galaxy_store"/> <connection host="tin.fhcrc.org" port="443"/> <cache path="database/object_store_cache" size="1000"/> <extra_dir type="temp" path="database/tmp"/> <extra_dir type="job_work" path="database/job_working_ directory"/> </object_store> <object_store type="disk" id="secondary" order="1"> <files_dir path="database/files"/> </object_store> </backends> </object_store>
The goal with the hierarchical setup above is for new datasets to be created in the primary (Swift) object store, caching to `database/object_store_cache`, while the job and temporary directories remain at `database/job_working_directory` and `database/tmp`, respectively. Existing (pre-Swift) datasets remain in `database/files` and are handled by the secondary disk store.
What actually happens (after renaming self.cache_path to self.staging_path in _parse_config_xml to get the cache path working) is this:
galaxy.jobs DEBUG 2015-02-06 16:07:26,615 (1) Working directory for job is: /home/bclaywel/workspace/galaxy-central/database/ object_store_cache/000/1
That is, the job working directory is created directly under the cache path's hash directories. I assume temp files would probably end up there also.
We're quite excited to get Galaxy and Swift working well together, and I'm more than happy to help debug and test!
Cheers,
Brian
[1] https://bitbucket.org/galaxy/galaxy-central/src/ 54ed3adb6575addba47d627944ebd72f7547082d/lib/galaxy/ objectstore/s3.py?at=default#cl-331
-- Brian Claywell | programmer/analyst Matsen Group | http://matsen.fredhutch.org Fred Hutchinson Cancer Research Center ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Hi Dannon, I'm stoked that Charles's PR made it into master -- thanks! Have you had a chance to look into the job_work and temp extra_dir issue? Please let me know if there's anything I can do to help out! Cheers, Brian On Mon, Feb 9, 2015 at 1:39 PM, Dannon Baker <dannon.baker@gmail.com> wrote:
Hey Brian,
Thanks for the interest in Galaxy's Swift object store! I also tested Charles' PR and it looks like a nice improvement -- I'll go ahead and get that pulled into Galaxy shortly.
The HierarchicalObjectStore was written with exactly what you're trying to do in mind, so you're definitely on the right track here. I'll see if I can verify and fix the file location issues you point out and will get back to you.
-Dannon
On Mon Feb 09 2015 at 4:29:35 PM Brian Claywell <bclaywel@fredhutch.org> wrote:
Our local instance currently uses the traditional directories under `database/` for datasets, job working directories, and temporary files. Ultimately we wish to transition to using our Swift object store for storage. We've been doing some experimentation with Galaxy's Swift backend and have run into a few issues.
The first major issue we came across was Swift's 5 GB segment size limit, since the segmentation/multipart upload code is bypassed for instances of SwiftObjectStore [1]. SwiftStack support provided a patch enabling multipart uploads for Swift (PR #648) which has been working well for us so far. (Thanks, Charles!)
The next issue is that the path attribute of the cache tag in object_store_conf.xml appears to be ignored. The value does get stored to self.cache_path in _parse_config_xml, but elsewhere in the file self.staging_path is used instead.
Finally, adding extra_dir tags to the Swift object store config doesn't appear to do anything. Here's my object_store_conf.xml:
<?xml version="1.0"?> <object_store type="hierarchical"> <backends> <object_store type="swift" id="primary" order="0"> <auth access_key="..." secret_key="..."/> <bucket name="galaxy_store"/> <connection host="tin.fhcrc.org" port="443"/> <cache path="database/object_store_cache" size="1000"/> <extra_dir type="temp" path="database/tmp"/> <extra_dir type="job_work" path="database/job_working_ directory"/> </object_store> <object_store type="disk" id="secondary" order="1"> <files_dir path="database/files"/> </object_store> </backends> </object_store>
The goal with the hierarchical setup above is for new datasets to be created in the primary (Swift) object store, caching to `database/object_store_cache`, while the job and temporary directories remain at `database/job_working_directory` and `database/tmp`, respectively. Existing (pre-Swift) datasets remain in `database/files` and are handled by the secondary disk store.
What actually happens (after renaming self.cache_path to self.staging_path in _parse_config_xml to get the cache path working) is this:
galaxy.jobs DEBUG 2015-02-06 16:07:26,615 (1) Working directory for job is: /home/bclaywel/workspace/galaxy-central/database/ object_store_cache/000/1
That is, the job working directory is created directly under the cache path's hash directories. I assume temp files would probably end up there also.
We're quite excited to get Galaxy and Swift working well together, and I'm more than happy to help debug and test!
Cheers,
Brian
[1] https://bitbucket.org/galaxy/galaxy-central/src/ 54ed3adb6575addba47d627944ebd72f7547082d/lib/galaxy/ objectstore/s3.py?at=default#cl-331
-- Brian Claywell | programmer/analyst Matsen Group | http://matsen.fredhutch.org Fred Hutchinson Cancer Research Center ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-- Brian Claywell | programmer/analyst Matsen Group | http://matsen.fredhutch.org <http://matsen.fhcrc.org/> Fred Hutchinson Cancer Research Center
participants (2)
-
Brian Claywell
-
Dannon Baker