Galaxy Cloudman - How to analyse > 1TB data ?
Dear I am currently investigating if Galaxy Cloudman can help us in analyzing large NGS datasets. I was first impressed by the simple setup, the autoscaling and useability of Galaxy Cloudman but soon ran into the EBS limit of 1 TB L I thought to be clever and umounted the /mnt/galaxyData EBS volume, created a logical volume of 2 TB and remounted this volume to /mnt/galaxyData. All is green as you can see from the picture below but running a tool is not possible since Galaxy is not configured to work with logical volume I assume. It is truly a waste having this fine setup (autoscaling) but this is not useable if there is not enough storage ? Does anybody has experience with this ? Tips, tricks ... Kind Regards Yves Wetzels Contractor on behalf of Janssen Turnhoutseweg 30 B-2340-Beerse, Belgium Phone 32 (0)14/ 60 7181 ywetzel@its.jnj.com <mailto:ywetzel@its.jnj.com>
Yves;
I am currently investigating if Galaxy Cloudman can help us in analyzing large NGS datasets.
I was first impressed by the simple setup, the autoscaling and useability of Galaxy Cloudman but soon ran into the EBS limit of 1 TB L
I thought to be clever and umounted the /mnt/galaxyData EBS volume, created a logical volume of 2 TB and remounted this volume to /mnt/galaxyData.
How did you create this volume? I know there are some tricks to get around the 1Tb limit: http://alestic.com/2009/06/ec2-ebs-raid In the screenshot you sent it looks like Cloudman is a bit confused about the disk size. The Disk Status lists 1.2Tb out of 668Gb, which might be the source of your problems.
All is green as you can see from the picture below but running a tool is not possible since Galaxy is not configured to work with logical volume I assume.
Can you describe what errors you are seeing?
It is truly a waste having this fine setup (autoscaling) but this is not useable if there is not enough storage ?
Does anybody has experience with this ? Tips, tricks ...
The more general answer is that folks do not normally use EBS this way since having large permanent EBS filesystems is expensive. S3 stores larger data, up to 50Tb, at a more reasonable price. S3 files are then copied to a transient EBS store, processed, and uploaded back to S3. This isn't as automated since it will be highly dependent on your workflow and what files you want to save, but might be worth exploring in general when using EC2. Hope this helps, Brad
Hi Brad I used LVM2 to create the logical volume. I re-launched a new Galaxy Cloudman instance since I already removed the previous one. So, I have a LVM volume of 2 TB (1.8 TB netto) You can see this in the picture below, 1.5 TB available + 336 GB used = 1.8 TB. The error/warning = "Did not find a volume attached to instance i-xxxx as device 'None', file system 'galaxyData' (vols=[]" If I launch an extra node, the /mnt/galaxyData is nicely mounted onto the node ubuntu@ip-10-46-134-155:~$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 15G 12G 3.3G 79% / devtmpfs 3.7G 116K 3.7G 1% /dev none 3.8G 0 3.8G 0% /dev/shm none 3.8G 96K 3.8G 1% /var/run none 3.8G 0 3.8G 0% /var/lock none 3.8G 0 3.8G 0% /lib/init/rw /dev/sdb 414G 201M 393G 1% /mnt domU-12-31-39-0A-62-12.compute-1.internal:/mnt/galaxyData 1.9T 336G 1.5T 19% /mnt/galaxyData domU-12-31-39-0A-62-12.compute-1.internal:/mnt/galaxyTools 10G 1.7G 8.4G 17% /mnt/galaxyTools domU-12-31-39-0A-62-12.compute-1.internal:/mnt/galaxyIndices 700G 654G 47G 94% /mnt/galaxyIndices domU-12-31-39-0A-62-12.compute-1.internal:/opt/sge 15G 12G 3.3G 79% /opt/sge Uploading a file is OK but the "Grooming" results in following error (BTW this grooming succeeds in a "normal" Galaxy Cloudman setup on the same file with the same parameters used) WARNING:galaxy.datatypes.registry:Overriding conflicting datatype with extension 'coverage', using datatype from /mnt/galaxyData/tmp/tmpGx9fsi. I then moved the /mnt/galaxyData/tmp/tmpGx9fsi to /mnt/galaxyData/tmp/tmpGx9fsi.old but didn`t help. I restarted all services (Galaxy, SGE, PostgreSQL) ... SGE Log 02/15/2012 11:22:08| main|domU-12-31-39-0A-62-12|I|read job database with 0 entries in 0 seconds 02/15/2012 11:22:08| main|domU-12-31-39-0A-62-12|E|error opening file "/opt/sge/default/common/./sched_configuration" for reading: No such file or directory 02/15/2012 11:22:08| main|domU-12-31-39-0A-62-12|E|error opening file "/opt/sge/default/spool/qmaster/./sharetree" for reading: No such file or directory 02/15/2012 11:22:08| main|domU-12-31-39-0A-62-12|I|qmaster hard descriptor limit is set to 8192 02/15/2012 11:22:08| main|domU-12-31-39-0A-62-12|I|qmaster soft descriptor limit is set to 8192 02/15/2012 11:22:08| main|domU-12-31-39-0A-62-12|I|qmaster will use max. 8172 file descriptors for communication 02/15/2012 11:22:08| main|domU-12-31-39-0A-62-12|I|qmaster will accept max. 99 dynamic event clients 02/15/2012 11:22:08| main|domU-12-31-39-0A-62-12|I|starting up GE 6.2u5 (lx24-amd64) 02/15/2012 11:22:08| main|domU-12-31-39-0A-62-12|W|can't open job sequence number file "jobseqnum": for reading: No such file or directory -- guessing next number 02/15/2012 11:22:08| main|domU-12-31-39-0A-62-12|W|can't open ar sequence number file "arseqnum": for reading: No such file or directory -- guessing next number 02/15/2012 11:22:12|worker|domU-12-31-39-0A-62-12|E|adminhost "domU-12-31-39-0A-62-12.compute-1.internal" already exists 02/15/2012 11:22:13|worker|domU-12-31-39-0A-62-12|E|adminhost "domU-12-31-39-0A-62-12.compute-1.internal" already exists Uploaded my fastq file (OK) and trying to "Groom" GALAXY Log galaxy.jobs.runners.drmaa DEBUG 2012-02-15 11:30:53,425 (30) submitting file /mnt/galaxyTools/galaxy-central/database/pbs/galaxy_30.sh galaxy.jobs.runners.drmaa DEBUG 2012-02-15 11:30:53,425 (30) command is: python /mnt/galaxyTools/galaxy-central/tools/fastq/fastq_groomer.py '/mnt/galaxyData/files/000/dataset_58.dat' 'illumina' '/mnt/galaxyData/files/000/dataset_59.dat' 'sanger' 'ascii' 'summarize_input'; cd /mnt/galaxyTools/galaxy-central; /mnt/galaxyTools/galaxy-central/set_metadata.sh /mnt/galaxyData/files /mnt/galaxyData/tmp/job_working_directory/000/30 . /mnt/galaxyTools/galaxy-central/universe_wsgi.ini /mnt/galaxyData/tmp/tmp2GBeCB /mnt/galaxyData/tmp/job_working_directory/000/30/galaxy.json /mnt/galaxyData/tmp/job_working_directory/000/30/metadata_in_HistoryData setAssociation_59_Q8oYiT,/mnt/galaxyData/tmp/job_working_directory/000/3 0/metadata_kwds_HistoryDatasetAssociation_59_UXjfqE,/mnt/galaxyData/tmp/ job_working_directory/000/30/metadata_out_HistoryDatasetAssociation_59_q WHyc4,/mnt/galaxyData/tmp/job_working_directory/000/30/metadata_results_ HistoryDatasetAssociation_59_zGJk7G,,/mnt/galaxyData/tmp/job_working_dir ectory/000/30/metadata_override_HistoryDatasetAssociation_59_KjamX7 galaxy.jobs.runners.drmaa ERROR 2012-02-15 11:30:53,427 Uncaught exception queueing job Traceback (most recent call last): File "/mnt/galaxyTools/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 133, in run_next self.queue_job( obj ) File "/mnt/galaxyTools/galaxy-central/lib/galaxy/jobs/runners/drmaa.py", line 213, in queue_job job_id = self.ds.runJob(jt) File "/mnt/galaxyTools/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/__init __.py", line 331, in runJob _h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate) File "/mnt/galaxyTools/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/helper s.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "/mnt/galaxyTools/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/errors .py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) DeniedByDrmException: code 17: error: no suitable queues 148.177.129.210 - - [15/Feb/2012:11:30:56 +0000] "POST /root/history_item_updates HTTP/1.0" 200 - "http://ec2-23-20-77-195.compute-1.amazonaws.com/history" "Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1" galaxy.web.framework DEBUG 2012-02-15 11:30:59,815 Error: this request returned None from get_history(): http://127.0.0.1:8080/ 127.0.0.1 - - [15/Feb/2012:11:30:59 +0000] "GET / HTTP/1.1" 200 - "-" "Python-urllib/2.6" 148.177.129.210 - - [15/Feb/2012:11:31:00 +0000] "POST /root/history_item_updates HTTP/1.0" 200 - "http://ec2-23-20-77-195.compute-1.amazonaws.com/history" "Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1" 148.177.129.210 - - [15/Feb/2012:11:31:04 +0000] "POST /root/history_item_updates HTTP/1.0" 200 - "http://ec2-23-20-77-195.compute-1.amazonaws.com/history" "Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1" 148.177.129.210 - - [15/Feb/2012:11:31:08 +0000] "POST /root/history_item_updates HTTP/1.0" 200 - "http://ec2-23-20-77-195.compute-1.amazonaws.com/history" "Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1" 148.177.129.210 - - [15/Feb/2012:11:31:12 +0000] "POST /root/history_item_updates HTTP/1.0" 200 - "http://ec2-23-20-77-195.compute-1.amazonaws.com/history" "Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1" 148.177.129.210 - - [15/Feb/2012:11:31:17 +0000] "POST /root/history_item_updates HTTP/1.0" 200 - "http://ec2-23-20-77-195.compute-1.amazonaws.com/history" "Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1" galaxy.web.framework DEBUG 2012-02-15 11:31:19,186 Error: this request returned None from get_history(): http://127.0.0.1:8080/ 127.0.0.1 - - [15/Feb/2012:11:31:19 +0000] "GET / HTTP/1.1" 200 - "-" "Python-urllib/2.6" 148.177.129.210 - - [15/Feb/2012:11:31:21 +0000] "POST /root/history_item_updates HTTP/1.0" 200 - "http://ec2-23-20-77-195.compute-1.amazonaws.com/history" "Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1" Kind Regards Yves -----Original Message----- From: Brad Chapman [mailto:chapmanb@50mail.com] Sent: Wednesday, 15 February 2012 02:22 To: Wetzels, Yves [JRDBE Extern]; galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Galaxy Cloudman - How to analyse > 1TB data ? Yves;
I am currently investigating if Galaxy Cloudman can help us in analyzing
large NGS datasets.
I was first impressed by the simple setup, the autoscaling and
useability of Galaxy Cloudman but soon ran into the EBS limit of 1 TB L
I thought to be clever and umounted the /mnt/galaxyData EBS volume,
created a logical volume of 2 TB and remounted this volume to
/mnt/galaxyData.
How did you create this volume? I know there are some tricks to get around the 1Tb limit: http://alestic.com/2009/06/ec2-ebs-raid In the screenshot you sent it looks like Cloudman is a bit confused about the disk size. The Disk Status lists 1.2Tb out of 668Gb, which might be the source of your problems.
All is green as you can see from the picture below but running a tool is
not possible since Galaxy is not configured to work with logical volume
I assume.
Can you describe what errors you are seeing?
It is truly a waste having this fine setup (autoscaling) but this is not
useable if there is not enough storage ?
Does anybody has experience with this ? Tips, tricks ...
The more general answer is that folks do not normally use EBS this way since having large permanent EBS filesystems is expensive. S3 stores larger data, up to 50Tb, at a more reasonable price. S3 files are then copied to a transient EBS store, processed, and uploaded back to S3. This isn't as automated since it will be highly dependent on your workflow and what files you want to save, but might be worth exploring in general when using EC2. Hope this helps, Brad
Yves; I'm hoping Enis can jump in here since he is more familiar with the internals of CloudMan and may be able to offer better advice. I can tell you what I see from your error messages.
I used LVM2 to create the logical volume.
Does this involve stopping and restarting the master CloudMan node? The error messages you are seeing look like SGE is missing or not properly configured on the master node:
02/15/2012 11:22:08| main|domU-12-31-39-0A-62-12|E|error opening file "/opt/sge/default/common/./sched_configuration" for reading: No such file or directory [...] DeniedByDrmException: code 17: error: no suitable queues
which is causing the job submission to fail since it can't find the SGE cluster environment to submit to. The strange thing is that SGE is present in /opt on the main EBS store, so I wouldn't expect your modified /mnt/galaxyData volume to influence this. Since starting worker nodes appears to be fine, I'd focus on the main instance manipulations you are doing. Perhaps some of the setup causes the problem without creating the logical volume? This could help narrow down the issue and hopefully get you running again. Hope this helps, Brad
Hi Brad I did not restart the master CloudMan node. I only restarted the services (Galaxy, PostgreSQL and SGE). I do not have these problems without creating the logical volume. Kind Regards Yves Yves; I'm hoping Enis can jump in here since he is more familiar with the internals of CloudMan and may be able to offer better advice. I can tell you what I see from your error messages.
I used LVM2 to create the logical volume.
Does this involve stopping and restarting the master CloudMan node? The error messages you are seeing look like SGE is missing or not properly configured on the master node:
02/15/2012 11:22:08| main|domU-12-31-39-0A-62-12|E|error opening file "/opt/sge/default/common/./sched_configuration" for reading: No such file or directory [...] DeniedByDrmException: code 17: error: no suitable queues
which is causing the job submission to fail since it can't find the SGE cluster environment to submit to. The strange thing is that SGE is present in /opt on the main EBS store, so I wouldn't expect your modified /mnt/galaxyData volume to influence this. Since starting worker nodes appears to be fine, I'd focus on the main instance manipulations you are doing. Perhaps some of the setup causes the problem without creating the logical volume? This could help narrow down the issue and hopefully get you running again. Hope this helps, Brad
Hi Yves, When you create the LVM file system - are you composing it from the volume that already contains the data (i.e., directory structure++ created by CloudMan) and then adding another volume into the LVM or starting with 2 new, clean volumes? Maybe trying again and not messing with SGE at all would at least resolve the SGE issue. Namely, SGE is on the root file system so it should be fine as is. I'd suggest stopping Galaxy and PostgreSQL services (from the CloudMan Admin), from the CLI, unmount galaxyData file system and proceed to create the LVM. Mount the file system and ensure the directories and the data that were there are still there. The start back PostgreSQL and Galaxy services. See if it all comes up fine and try adding a worker node if it does. Currently, CloudMan does not support composition of a file system from multiple volumes but I would think that as long as you did not restart the cluster and created the file system manually, things would work fine. I've been thinking about why you're seeing the described behavior and am not really sure so please let me know how the above process works out. On Thu, Feb 16, 2012 at 7:37 PM, Wetzels, Yves [JRDBE Extern] < YWETZEL@its.jnj.com> wrote:
Hi Brad
I did not restart the master CloudMan node. I only restarted the services (Galaxy, PostgreSQL and SGE). I do not have these problems without creating the logical volume.
Kind Regards Yves
Yves; I'm hoping Enis can jump in here since he is more familiar with the internals of CloudMan and may be able to offer better advice. I can tell you what I see from your error messages.
I used LVM2 to create the logical volume.
Does this involve stopping and restarting the master CloudMan node? The error messages you are seeing look like SGE is missing or not properly configured on the master node:
02/15/2012 11:22:08| main|domU-12-31-39-0A-62-12|E|error opening file "/opt/sge/default/common/./sched_configuration" for reading: No such file or directory [...] DeniedByDrmException: code 17: error: no suitable queues
which is causing the job submission to fail since it can't find the SGE cluster environment to submit to. The strange thing is that SGE is present in /opt on the main EBS store, so I wouldn't expect your modified /mnt/galaxyData volume to influence this.
Since starting worker nodes appears to be fine, I'd focus on the main instance manipulations you are doing. Perhaps some of the setup causes the problem without creating the logical volume? This could help narrow down the issue and hopefully get you running again.
Hope this helps, Brad
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Hi Enis 1. I first created the LVM on 2 new volumes. 2. Mounted the LVM 3. I stopped all services (SGE, PostgreSQL, Galaxy). 4. Copied all data on filesystem /mnt/galaxyData to the LVM 5. Umounted /mnt/galaxyData 6. Mounted LVM to /mnt/galaxyData 7. Started all services As I mentioned in my previous posts all seems to be ok but I received a WARNING:galaxy.datatypes.registry:Overriding conflicting datatype with extension 'coverage', using datatype from /mnt/galaxyData/tmp/tmpGx9fsi. while running the "Groom" tool. I didn`t know what to do at that time and started "messing" around removing tmp files, restarting SGE etc. I later received the same error on a newly created Galaxy Cloudman instance with normal (<1TB size) galaxyData filesystem. Greg Von Kuster replied to me I had to remove a duplicate value in the datatypes.conf.xml file. Hello Yves, You have one or more entries in your datatypes_conf.xml file for a datatype named "coverage" These should be eliminated from your datatypes.conf.xml file because they are not valid datatypes (unless you have added proprietary datatypes to your Galaxy instance with this extension). They were originally in the datatypes.conf.xml.sample file for datatype indexers\, but datatype indexers have been eliminated from the Galaxy framework because datatype converters do the same thing. Greg Von Kuster Currently I am running multiple Galaxy Cloudman instances to circumvent the 1 TB limit. If I find some time I will redo the exercise with the LVM. Kind Regards Yves From: Enis Afgan [mailto:eafgan@emory.edu] Sent: Sunday, 19 February 2012 23:50 To: Wetzels, Yves [JRDBE Extern] Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Galaxy Cloudman - How to analyse > 1TB data ? Hi Yves, When you create the LVM file system - are you composing it from the volume that already contains the data (i.e., directory structure++ created by CloudMan) and then adding another volume into the LVM or starting with 2 new, clean volumes? Maybe trying again and not messing with SGE at all would at least resolve the SGE issue. Namely, SGE is on the root file system so it should be fine as is. I'd suggest stopping Galaxy and PostgreSQL services (from the CloudMan Admin), from the CLI, unmount galaxyData file system and proceed to create the LVM. Mount the file system and ensure the directories and the data that were there are still there. The start back PostgreSQL and Galaxy services. See if it all comes up fine and try adding a worker node if it does. Currently, CloudMan does not support composition of a file system from multiple volumes but I would think that as long as you did not restart the cluster and created the file system manually, things would work fine. I've been thinking about why you're seeing the described behavior and am not really sure so please let me know how the above process works out. On Thu, Feb 16, 2012 at 7:37 PM, Wetzels, Yves [JRDBE Extern] <YWETZEL@its.jnj.com> wrote: Hi Brad I did not restart the master CloudMan node. I only restarted the services (Galaxy, PostgreSQL and SGE). I do not have these problems without creating the logical volume. Kind Regards Yves Yves; I'm hoping Enis can jump in here since he is more familiar with the internals of CloudMan and may be able to offer better advice. I can tell you what I see from your error messages.
I used LVM2 to create the logical volume.
Does this involve stopping and restarting the master CloudMan node? The error messages you are seeing look like SGE is missing or not properly configured on the master node:
02/15/2012 11:22:08| main|domU-12-31-39-0A-62-12|E|error opening file "/opt/sge/default/common/./sched_configuration" for reading: No such file or directory [...] DeniedByDrmException: code 17: error: no suitable queues
which is causing the job submission to fail since it can't find the SGE cluster environment to submit to. The strange thing is that SGE is present in /opt on the main EBS store, so I wouldn't expect your modified /mnt/galaxyData volume to influence this. Since starting worker nodes appears to be fine, I'd focus on the main instance manipulations you are doing. Perhaps some of the setup causes the problem without creating the logical volume? This could help narrow down the issue and hopefully get you running again. Hope this helps, Brad ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
participants (3)
-
Brad Chapman
-
Enis Afgan
-
Wetzels, Yves [JRDBE Extern]