Re: [galaxy-dev] Exporting histories fails: no space left on device
Please keep all replies on-list so that everyone can contribute. Someone more knowledgeable about systems than me suggests that lsof(8) and/or /proc/<galaxy server pid>/fd should yield some clues as to what file is being written to. Good luck, J. On Mar 25, 2013, at 10:01 AM, Joachim Jacob | VIB | wrote:
Hi,
About the exporting of history, which fails: 1. the preparation seems to work fine: meaning: choosing 'Export this history' in the History menu leads to a URL that reports initially that the export is still in progress.
2. when the export is finished, and I click the download link, the root partition fills and the browser displays "Error reading from remote server". A folder ccpp-2013-03-25-14:51:15-27045.new is created in the directory /var/spool/abrt, which fills the root partition.
The handler reports in its log: """ galaxy.jobs DEBUG 2013-03-25 14:38:33,322 (8318) Working directory for job is: /mnt/galaxydb/job_working_directory/008/8318 galaxy.jobs.handler DEBUG 2013-03-25 14:38:33,322 dispatching job 8318 to local runner galaxy.jobs.handler INFO 2013-03-25 14:38:33,368 (8318) Job dispatched galaxy.jobs.runners.local DEBUG 2013-03-25 14:38:33,432 Local runner: starting job 8318 galaxy.jobs.runners.local DEBUG 2013-03-25 14:38:33,572 executing: python /home/galaxy/galaxy-dist/lib/galaxy/tools/imp_exp/export_history.py -G /mnt/galaxytemp/tmpHAEokb/tmpQM6g_R /mnt/galaxytemp/tmpHAEokb/tmpeg7bYF /mnt/galaxytemp/tmpHAEokb/tmpPXJ245 /mnt/galaxydb/files/013/dataset_13993.dat galaxy.jobs.runners.local DEBUG 2013-03-25 14:41:29,420 execution finished: python /home/galaxy/galaxy-dist/lib/galaxy/tools/imp_exp/export_history.py -G /mnt/galaxytemp/tmpHAEokb/tmpQM6g_R /mnt/galaxytemp/tmpHAEokb/tmpeg7bYF /mnt/galaxytemp/tmpHAEokb/tmpPXJ245 /mnt/galaxydb/files/013/dataset_13993.dat galaxy.jobs DEBUG 2013-03-25 14:41:29,476 Tool did not define exit code or stdio handling; checking stderr for success galaxy.tools DEBUG 2013-03-25 14:41:29,530 Error opening galaxy.json file: [Errno 2] No such file or directory: '/mnt/galaxydb/job_working_directory/008/8318/galaxy.json' galaxy.jobs DEBUG 2013-03-25 14:41:29,555 job 8318 ended """
The system reports: """ Mar 25 14:51:26 galaxy abrt[16805]: Write error: No space left on device Mar 25 14:51:27 galaxy abrt[16805]: Error writing '/var/spool/abrt/ccpp-2013-03-25-14:51:15-27045.new/coredump' """
Thanks, Joachim
Joachim Jacob
Rijvisschestraat 120, 9052 Zwijnaarde Tel: +32 9 244.66.34 Bioinformatics Training and Services (BITS) http://www.bits.vib.be @bitsatvib
On Tue 19 Mar 2013 11:22:27 PM CET, Jeremy Goecks wrote:
I'm unable to reproduce this behavior using a clean version of galaxy-dist. The code (export_history.py) doesn't create any temporary files and appears to write directly to the output file, so it seems unlikely that Galaxy is writing anything to the root directory.
Can you provide the name of any file that Galaxy appears to be writing to outside of <galaxy-home>? What about watching the job output file/export file to see if that's increasingly in size and causing the out-of-space error?
Best, J.
On Mar 19, 2013, at 10:56 AM, Joachim Jacob | VIB | wrote:
Hi all,
Exporting histories fails on our server: " Reason: *Error reading from remote server".
*When looking at the logs and the system: tail /var/log/messages Mar 19 15:52:47 galaxy abrt[25605]: Write error: No space left on device Mar 19 15:52:49 galaxy abrt[25605]: Error writing '/var/spool/abrt/ccpp-2013-03-19-15:52:37-13394.new/coredump'
So I watched my system when I repeated the export, and saw that Galaxy fills up the root directory (/), instead of any temporary directory.
Somebody has an idea where to adjust this setting, so the export function uses any temporary directory?
Thanks, Joachim
-- Joachim Jacob
Rijvisschestraat 120, 9052 Zwijnaarde Tel: +32 9 244.66.34 Bioinformatics Training and Services (BITS) http://www.bits.vib.be @bitsatvib
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Hello Joachim, Couple of things to check:
On Mar 25, 2013, at 10:01 AM, Joachim Jacob | VIB | wrote:
Hi,
About the exporting of history, which fails: 1. the preparation seems to work fine: meaning: choosing 'Export this history' in the History menu leads to a URL that reports initially that the export is still in progress.
2. when the export is finished, and I click the download link, the root partition fills and the browser displays "Error reading from remote server". A folder ccpp-2013-03-25-14:51:15-27045.new is created in the directory /var/spool/abrt, which fills the root partition.
Something in your export is likely not finishing fine, but crashes instead (either the creation of the archive, or the download). The folder "/var/spool/abrt/ccpp-XXXX" (and especially a file named "coredump") hints that the program crashed. "abrt" is a daemon (at least on Fedora) that monitors crashes and tries to keep all relevant information about the program which crashed (http://docs.fedoraproject.org/en-US/Fedora/13/html/Deployment_Guide/ch-abrt....). So what might have happened, is that a program (galaxy's export_history.py or other) crashed during your export, and then "abrt" picked-up the pieces (storing a memory dump, for example), and then filled your disk.
The handler reports in its log: """ galaxy.jobs DEBUG 2013-03-25 14:38:33,322 (8318) Working directory for job is: /mnt/galaxydb/job_working_directory/008/8318 galaxy.jobs.handler DEBUG 2013-03-25 14:38:33,322 dispatching job 8318 to local runner galaxy.jobs.handler INFO 2013-03-25 14:38:33,368 (8318) Job dispatched galaxy.jobs.runners.local DEBUG 2013-03-25 14:38:33,432 Local runner: starting job 8318 galaxy.jobs.runners.local DEBUG 2013-03-25 14:38:33,572 executing: python /home/galaxy/galaxy-dist/lib/galaxy/tools/imp_exp/export_history.py -G /mnt/galaxytemp/tmpHAEokb/tmpQM6g_R /mnt/galaxytemp/tmpHAEokb/tmpeg7bYF /mnt/galaxytemp/tmpHAEokb/tmpPXJ245 /mnt/galaxydb/files/013/dataset_13993.dat galaxy.jobs.runners.local DEBUG 2013-03-25 14:41:29,420 execution finished: python /home/galaxy/galaxy-dist/lib/galaxy/tools/imp_exp/export_history.py -G /mnt/galaxytemp/tmpHAEokb/tmpQM6g_R /mnt/galaxytemp/tmpHAEokb/tmpeg7bYF /mnt/galaxytemp/tmpHAEokb/tmpPXJ245 /mnt/galaxydb/files/013/dataset_13993.dat galaxy.jobs DEBUG 2013-03-25 14:41:29,476 Tool did not define exit code or stdio handling; checking stderr for success galaxy.tools DEBUG 2013-03-25 14:41:29,530 Error opening galaxy.json file: [Errno 2] No such file or directory: '/mnt/galaxydb/job_working_directory/008/8318/galaxy.json' galaxy.jobs DEBUG 2013-03-25 14:41:29,555 job 8318 ended """
The system reports: """ Mar 25 14:51:26 galaxy abrt[16805]: Write error: No space left on device Mar 25 14:51:27 galaxy abrt[16805]: Error writing '/var/spool/abrt/ccpp-2013-03-25-14:51:15-27045.new/coredump' """
One thing to try: if you have galaxy keeping temporary files, try running the "export" command manually: === python /home/galaxy/galaxy-dist/lib/galaxy/tools/imp_exp/export_history.py -G /mnt/galaxytemp/tmpHAEokb/tmpQM6g_R /mnt/galaxytemp/tmpHAEokb/tmpeg7bYF /mnt/galaxytemp/tmpHAEokb/tmpPXJ245 /mnt/galaxydb/files/013/dataset_13993.dat === Another thing to try: modify "export_history.py", adding debug messages to track progress and whether it finishes or not. And: check the "abrt" program's GUI, perhaps you'll see previous crashes that were stored successfully, providing more information about which program crashed. As a general rule, it's best to keep the "/var" directory on a separate partition for production systems, exactly so that filling it up with junk wouldn't intervene with other programs. Even better, set each sub-directory of "/var" to a dedicated partition, so that filling up "/var/log" or "/var/spool" would not fill up "/var/lib/pgsql" and stop Postgres from working. -gordon
Hi Gordon, Thanks for your assistance and the recommendations. Freezing postgres sounds like hell to me :-) abrt was filling the root directory indeed. So disabled it. I have done some exporting tests, and the behaviour is not consistent. 1. *size*: in general, it worked out for smaller datasets, and usually crashed on bigger ones (starting from 3 GB). So size is key? 2. But now I have found several histories of 4.5GB that I was able to export... So far for the size hypothesis. Another observation: when the export crashes, the corresponding webhandler process dies. So now I suspect something to be wrong with the datasets, but I am not able to trace something meaningful in the logs. I am not confident in turning on logging in Python yet, but apparently this happens with the module "logging" initiated like logging.getLogger( __name__ ). Cheers, Joachim Joachim Jacob Rijvisschestraat 120, 9052 Zwijnaarde Tel: +32 9 244.66.34 Bioinformatics Training and Services (BITS) http://www.bits.vib.be @bitsatvib On 03/25/2013 05:18 PM, Assaf Gordon wrote:
Hello Joachim,
Couple of things to check:
On Mar 25, 2013, at 10:01 AM, Joachim Jacob | VIB | wrote:
Hi,
About the exporting of history, which fails: 1. the preparation seems to work fine: meaning: choosing 'Export this history' in the History menu leads to a URL that reports initially that the export is still in progress.
2. when the export is finished, and I click the download link, the root partition fills and the browser displays "Error reading from remote server". A folder ccpp-2013-03-25-14:51:15-27045.new is created in the directory /var/spool/abrt, which fills the root partition. Something in your export is likely not finishing fine, but crashes instead (either the creation of the archive, or the download).
The folder "/var/spool/abrt/ccpp-XXXX" (and especially a file named "coredump") hints that the program crashed. "abrt" is a daemon (at least on Fedora) that monitors crashes and tries to keep all relevant information about the program which crashed (http://docs.fedoraproject.org/en-US/Fedora/13/html/Deployment_Guide/ch-abrt....).
So what might have happened, is that a program (galaxy's export_history.py or other) crashed during your export, and then "abrt" picked-up the pieces (storing a memory dump, for example), and then filled your disk.
The handler reports in its log: """ galaxy.jobs DEBUG 2013-03-25 14:38:33,322 (8318) Working directory for job is: /mnt/galaxydb/job_working_directory/008/8318 galaxy.jobs.handler DEBUG 2013-03-25 14:38:33,322 dispatching job 8318 to local runner galaxy.jobs.handler INFO 2013-03-25 14:38:33,368 (8318) Job dispatched galaxy.jobs.runners.local DEBUG 2013-03-25 14:38:33,432 Local runner: starting job 8318 galaxy.jobs.runners.local DEBUG 2013-03-25 14:38:33,572 executing: python /home/galaxy/galaxy-dist/lib/galaxy/tools/imp_exp/export_history.py -G /mnt/galaxytemp/tmpHAEokb/tmpQM6g_R /mnt/galaxytemp/tmpHAEokb/tmpeg7bYF /mnt/galaxytemp/tmpHAEokb/tmpPXJ245 /mnt/galaxydb/files/013/dataset_13993.dat galaxy.jobs.runners.local DEBUG 2013-03-25 14:41:29,420 execution finished: python /home/galaxy/galaxy-dist/lib/galaxy/tools/imp_exp/export_history.py -G /mnt/galaxytemp/tmpHAEokb/tmpQM6g_R /mnt/galaxytemp/tmpHAEokb/tmpeg7bYF /mnt/galaxytemp/tmpHAEokb/tmpPXJ245 /mnt/galaxydb/files/013/dataset_13993.dat galaxy.jobs DEBUG 2013-03-25 14:41:29,476 Tool did not define exit code or stdio handling; checking stderr for success galaxy.tools DEBUG 2013-03-25 14:41:29,530 Error opening galaxy.json file: [Errno 2] No such file or directory: '/mnt/galaxydb/job_working_directory/008/8318/galaxy.json' galaxy.jobs DEBUG 2013-03-25 14:41:29,555 job 8318 ended """
The system reports: """ Mar 25 14:51:26 galaxy abrt[16805]: Write error: No space left on device Mar 25 14:51:27 galaxy abrt[16805]: Error writing '/var/spool/abrt/ccpp-2013-03-25-14:51:15-27045.new/coredump' """
One thing to try: if you have galaxy keeping temporary files, try running the "export" command manually: === python /home/galaxy/galaxy-dist/lib/galaxy/tools/imp_exp/export_history.py -G /mnt/galaxytemp/tmpHAEokb/tmpQM6g_R /mnt/galaxytemp/tmpHAEokb/tmpeg7bYF /mnt/galaxytemp/tmpHAEokb/tmpPXJ245 /mnt/galaxydb/files/013/dataset_13993.dat ===
Another thing to try: modify "export_history.py", adding debug messages to track progress and whether it finishes or not.
And: check the "abrt" program's GUI, perhaps you'll see previous crashes that were stored successfully, providing more information about which program crashed.
As a general rule, it's best to keep the "/var" directory on a separate partition for production systems, exactly so that filling it up with junk wouldn't intervene with other programs. Even better, set each sub-directory of "/var" to a dedicated partition, so that filling up "/var/log" or "/var/spool" would not fill up "/var/lib/pgsql" and stop Postgres from working.
-gordon
Hello Joachim, Joachim Jacob | VIB | wrote, On 03/26/2013 10:01 AM:
abrt was filling the root directory indeed. So disabled it.
I have done some exporting tests, and the behaviour is not consistent.
1. *size*: in general, it worked out for smaller datasets, and usually crashed on bigger ones (starting from 3 GB). So size is key? 2. But now I have found several histories of 4.5GB that I was able to export... So far for the size hypothesis.
Another observation: when the export crashes, the corresponding webhandler process dies.
A crashing python process crosses the fine boundary between the Galaxy code and Python internals... perhaps the Galaxy developers can help with this problem. It would be helpful to find a reproducible case with a specific history or a specific sequence of events, then someone can help you with the debugging. Once you find a history that causes a crash (every time or sometimes, but in a reproducible way), try to pinpoint when exactly it happens: Is it when you start preparing the export (and "export_history.py" is running as a job), or when you start downloading the exported file. (I'm a bit behind on the export mechanism, so perhaps there are other steps involved?). Couple of things to try: 1. set "cleanup_job=never" in your universe_wsgi.ini - this will keep the temporary files, and will help you re-produce jobs later. 2. Enable "abrt" again - it is not the problem (just the symptom). You can cleanup the "/var/spool/abrt/XXX" directory from previous crash logs, then reproduce a new crash, and look at the collected files (assuming you have enough space to store at least one crash). In particular, look at the file called "coredump" - it will tell you which script has crashed. Try running: $ file /var/spool/abrt/XXXX/coredump coredump ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'python XXXXXX.py' Instead of "XXXX.py" it would show the python script that crashed (hopefully with full command-line parameters). It won't show which python statement caused the crash, but it will point in the right direction.
So now I suspect something to be wrong with the datasets, but I am not able to trace something meaningful in the logs. I am not confident in turning on logging in Python yet, but apparently this happens with the module "logging" initiated like logging.getLogger( __name__ ).
It could be a bad dataset (file on disk), or a problem in the database, or something completely different (a bug in the python archive module). No point guessing until there are more details. -gordon
Hi Assaf, After all, the problem appears not to be total size of the history, but the size of the individual datasets. Now, histories which contain big datasets (>1GB) imported from Data Libraries causes the exporting process to crash. Can somebody confirm if this is a bug? I uploaded the datasets to a directory, which are then imported from that directory into a Data Library. Downloading data sets >1GB from a data library directly (as tar.gz) also crashes. Note: I have re-enabled abrt, but waiting for some jobs to be finished to restart. Cheers, Joachim. Joachim Jacob Rijvisschestraat 120, 9052 Zwijnaarde Tel: +32 9 244.66.34 Bioinformatics Training and Services (BITS) http://www.bits.vib.be @bitsatvib On Tue 26 Mar 2013 03:45:43 PM CET, Assaf Gordon wrote:
Hello Joachim,
Joachim Jacob | VIB | wrote, On 03/26/2013 10:01 AM:
abrt was filling the root directory indeed. So disabled it.
I have done some exporting tests, and the behaviour is not consistent.
1. *size*: in general, it worked out for smaller datasets, and usually crashed on bigger ones (starting from 3 GB). So size is key? 2. But now I have found several histories of 4.5GB that I was able to export... So far for the size hypothesis.
Another observation: when the export crashes, the corresponding webhandler process dies.
A crashing python process crosses the fine boundary between the Galaxy code and Python internals... perhaps the Galaxy developers can help with this problem.
It would be helpful to find a reproducible case with a specific history or a specific sequence of events, then someone can help you with the debugging.
Once you find a history that causes a crash (every time or sometimes, but in a reproducible way), try to pinpoint when exactly it happens: Is it when you start preparing the export (and "export_history.py" is running as a job), or when you start downloading the exported file. (I'm a bit behind on the export mechanism, so perhaps there are other steps involved?).
Couple of things to try:
1. set "cleanup_job=never" in your universe_wsgi.ini - this will keep the temporary files, and will help you re-produce jobs later.
2. Enable "abrt" again - it is not the problem (just the symptom). You can cleanup the "/var/spool/abrt/XXX" directory from previous crash logs, then reproduce a new crash, and look at the collected files (assuming you have enough space to store at least one crash). In particular, look at the file called "coredump" - it will tell you which script has crashed. Try running: $ file /var/spool/abrt/XXXX/coredump coredump ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'python XXXXXX.py'
Instead of "XXXX.py" it would show the python script that crashed (hopefully with full command-line parameters).
It won't show which python statement caused the crash, but it will point in the right direction.
So now I suspect something to be wrong with the datasets, but I am not able to trace something meaningful in the logs. I am not confident in turning on logging in Python yet, but apparently this happens with the module "logging" initiated like logging.getLogger( __name__ ).
It could be a bad dataset (file on disk), or a problem in the database, or something completely different (a bug in the python archive module). No point guessing until there are more details.
-gordon
OK, it seems to be a proxy error. When the proxy does not receive data from the server, it times out, and closes the connection. I think the process that packs the datasets takes too long, so the connection is closed before the packaging is finished? Just a gues... From the httpd logs: ===== [Thu Mar 28 15:14:46 2013] [error] [client 157.193.10.52] (70007)The timeout specified has expired: proxy: error reading status line from remote server localhost, referer: http://galaxy.bits.vib.be/library_common/browse_library?sort=name&f-description=All&f-name=All&id=142184b92db50a63&cntrller=library&async=false&show_item_checkboxes=false&operation=browse&page=1 [Thu Mar 28 15:14:46 2013] [error] [client 157.193.10.52] proxy: Error reading from remote server returned by /library_common/act_on_multiple_datasets, referer: http://galaxy.bits.vib.be/library_common/browse_library?sort=name&f-description=All&f-name=All&id=142184b92db50a63&cntrller=library&async=false&show_item_checkboxes=false&operation=browse&page=1 ===== See if changing time out settings fixes this issue. Cheers, Joachim Joachim Jacob Rijvisschestraat 120, 9052 Zwijnaarde Tel: +32 9 244.66.34 Bioinformatics Training and Services (BITS) http://www.bits.vib.be @bitsatvib On 03/28/2013 02:58 PM, Joachim Jacob | VIB | wrote:
Hi Assaf,
After all, the problem appears not to be total size of the history, but the size of the individual datasets.
Now, histories which contain big datasets (>1GB) imported from Data Libraries causes the exporting process to crash. Can somebody confirm if this is a bug? I uploaded the datasets to a directory, which are then imported from that directory into a Data Library.
Downloading data sets >1GB from a data library directly (as tar.gz) also crashes.
Note: I have re-enabled abrt, but waiting for some jobs to be finished to restart.
Cheers, Joachim.
Joachim Jacob
Rijvisschestraat 120, 9052 Zwijnaarde Tel: +32 9 244.66.34 Bioinformatics Training and Services (BITS) http://www.bits.vib.be @bitsatvib
On Tue 26 Mar 2013 03:45:43 PM CET, Assaf Gordon wrote:
Hello Joachim,
Joachim Jacob | VIB | wrote, On 03/26/2013 10:01 AM:
abrt was filling the root directory indeed. So disabled it.
I have done some exporting tests, and the behaviour is not consistent.
1. *size*: in general, it worked out for smaller datasets, and usually crashed on bigger ones (starting from 3 GB). So size is key? 2. But now I have found several histories of 4.5GB that I was able to export... So far for the size hypothesis.
Another observation: when the export crashes, the corresponding webhandler process dies.
A crashing python process crosses the fine boundary between the Galaxy code and Python internals... perhaps the Galaxy developers can help with this problem.
It would be helpful to find a reproducible case with a specific history or a specific sequence of events, then someone can help you with the debugging.
Once you find a history that causes a crash (every time or sometimes, but in a reproducible way), try to pinpoint when exactly it happens: Is it when you start preparing the export (and "export_history.py" is running as a job), or when you start downloading the exported file. (I'm a bit behind on the export mechanism, so perhaps there are other steps involved?).
Couple of things to try:
1. set "cleanup_job=never" in your universe_wsgi.ini - this will keep the temporary files, and will help you re-produce jobs later.
2. Enable "abrt" again - it is not the problem (just the symptom). You can cleanup the "/var/spool/abrt/XXX" directory from previous crash logs, then reproduce a new crash, and look at the collected files (assuming you have enough space to store at least one crash). In particular, look at the file called "coredump" - it will tell you which script has crashed. Try running: $ file /var/spool/abrt/XXXX/coredump coredump ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'python XXXXXX.py'
Instead of "XXXX.py" it would show the python script that crashed (hopefully with full command-line parameters).
It won't show which python statement caused the crash, but it will point in the right direction.
So now I suspect something to be wrong with the datasets, but I am not able to trace something meaningful in the logs. I am not confident in turning on logging in Python yet, but apparently this happens with the module "logging" initiated like logging.getLogger( __name__ ).
It could be a bad dataset (file on disk), or a problem in the database, or something completely different (a bug in the python archive module). No point guessing until there are more details.
-gordon
I can confirm that the proxy settings are the reason for the failing export. When I go to localhost:8080 directly, I can export large files from the Data Library. When going via the proxy using the URL, download of large files does not work. Here is a hint on what the solution might be (http://serverfault.com/questions/185894/proxy-error-502-reason-error-reading...) *** The error in the browser: Proxy Error The proxy server received an invalid response from an upstream server. The proxy server could not handle the request /POST /library_common/act_on_multiple_datasets <http://galaxy.bits.vib.be/library_common/act_on_multiple_datasets>/. Reason: *Error reading from remote server* *** The error in the http logs: [Fri Mar 29 10:22:03 2013] [error] [client 157.193.10.20] (70007)The timeout specified has expired: proxy: error reading status line from remote server localhost, referer: http://galaxy.bits.vib.be/library_common/browse_library?sort=name&f-description=All&f-name=All&id=142184b92db50a63&cntrller=library&async=false&show_item_checkboxes=false&operation=browse&page=1 [Fri Mar 29 10:22:03 2013] [error] [client 157.193.10.20] proxy: Error reading from remote server returned by /library_common/act_on_multiple_datasets, referer: http://galaxy.bits.vib.be/library_common/browse_library?sort=name&f-description=All&f-name=All&id=142184b92db50a63&cntrller=library&async=false&show_item_checkboxes=false&operation=browse&page=1 *** Our proxy settings I would really appreciate if somebody could have a look at our current Apache proxy settings. Since I suspect the problem to be a time-out, I have tried modifying related parameters, with no luck. ======= [root@galaxy conf.d]# cat galaxy_web.conf NameVirtualHost 157.193.230.103:80 <VirtualHost 157.193.230.103:80> ServerName galaxy.bits.vib.be SetEnv force-proxy-request-1.0 1 # tried this, does not help SetEnv proxy-nokeepalive 1 # tried this, does not help KeepAliveTimeout 600 # tried this, does not help ProxyPass /library_common/act_on_multiple_datasets http://galaxy.bits.vib.be/library_common /act_on_multiple_datasets max=6 keepalive=On timeout=600 retry=10 #tried this, does not help. <Proxy balancer://galaxy> BalancerMember http://localhost:8080 BalancerMember http://localhost:8081 BalancerMember http://localhost:8082 BalancerMember http://localhost:8083 BalancerMember http://localhost:8084 BalancerMember http://localhost:8085 BalancerMember http://localhost:8086 BalancerMember http://localhost:8087 BalancerMember http://localhost:8088 BalancerMember http://localhost:8089 BalancerMember http://localhost:8090 BalancerMember http://localhost:8091 BalancerMember http://localhost:8092 </Proxy> RewriteEngine on RewriteLog "/tmp/apacheGalaxy.log" # <Location "/"> # AuthType Basic # AuthBasicProvider ldap # AuthLDAPURL "ldap://smeagol.vib.be:389/DC=vib,DC=local?sAMAccountName # AuthLDAPBindDN vib\administrator # AuthLDAPBindPassword <tofillin> # AuthzLDAPAuthoritative off # Require valid-user # # Set the REMOTE_USER header to the contents of the LDAP query response's "uid" attribute # RequestHeader set REMOTE_USER %{AUTHENTICATE_sAMAccountName} # </Location> RewriteRule ^/static/style/(.*) /home/galaxy/galaxy-dist/static/june_2007_style/blue/$1 [L] RewriteRule ^/static/scripts/(.*) /home/galaxy/galaxy-dist/static/scripts/packed/$1 [L] RewriteRule ^/static/(.*) /home/galaxy/galaxy-dist/static/$1 [L] RewriteRule ^/favicon.ico /home/galaxy/galaxy-dist/static/favicon.ico [L] RewriteRule ^/robots.txt /home/galaxy/galaxy-dist/static/robots.txt [L] RewriteRule ^(.*) balancer://galaxy$1 [P] </VirtualHost> ====== Thanks, Joachim Joachim Jacob Rijvisschestraat 120, 9052 Zwijnaarde Tel: +32 9 244.66.34 Bioinformatics Training and Services (BITS) http://www.bits.vib.be @bitsatvib On 03/28/2013 03:21 PM, Joachim Jacob | VIB | wrote:
OK, it seems to be a proxy error.
When the proxy does not receive data from the server, it times out, and closes the connection. I think the process that packs the datasets takes too long, so the connection is closed before the packaging is finished? Just a gues...
From the httpd logs: ===== [Thu Mar 28 15:14:46 2013] [error] [client 157.193.10.52] (70007)The timeout specified has expired: proxy: error reading status line from remote server localhost, referer: http://galaxy.bits.vib.be/library_common/browse_library?sort=name&f-description=All&f-name=All&id=142184b92db50a63&cntrller=library&async=false&show_item_checkboxes=false&operation=browse&page=1 [Thu Mar 28 15:14:46 2013] [error] [client 157.193.10.52] proxy: Error reading from remote server returned by /library_common/act_on_multiple_datasets, referer: http://galaxy.bits.vib.be/library_common/browse_library?sort=name&f-description=All&f-name=All&id=142184b92db50a63&cntrller=library&async=false&show_item_checkboxes=false&operation=browse&page=1 =====
See if changing time out settings fixes this issue.
Cheers, Joachim
Joachim Jacob
Rijvisschestraat 120, 9052 Zwijnaarde Tel: +32 9 244.66.34 Bioinformatics Training and Services (BITS) http://www.bits.vib.be @bitsatvib
On 03/28/2013 02:58 PM, Joachim Jacob | VIB | wrote:
Hi Assaf,
After all, the problem appears not to be total size of the history, but the size of the individual datasets.
Now, histories which contain big datasets (>1GB) imported from Data Libraries causes the exporting process to crash. Can somebody confirm if this is a bug? I uploaded the datasets to a directory, which are then imported from that directory into a Data Library.
Downloading data sets >1GB from a data library directly (as tar.gz) also crashes.
Note: I have re-enabled abrt, but waiting for some jobs to be finished to restart.
Cheers, Joachim.
Joachim Jacob
Rijvisschestraat 120, 9052 Zwijnaarde Tel: +32 9 244.66.34 Bioinformatics Training and Services (BITS) http://www.bits.vib.be @bitsatvib
On Tue 26 Mar 2013 03:45:43 PM CET, Assaf Gordon wrote:
Hello Joachim,
Joachim Jacob | VIB | wrote, On 03/26/2013 10:01 AM:
abrt was filling the root directory indeed. So disabled it.
I have done some exporting tests, and the behaviour is not consistent.
1. *size*: in general, it worked out for smaller datasets, and usually crashed on bigger ones (starting from 3 GB). So size is key? 2. But now I have found several histories of 4.5GB that I was able to export... So far for the size hypothesis.
Another observation: when the export crashes, the corresponding webhandler process dies.
A crashing python process crosses the fine boundary between the Galaxy code and Python internals... perhaps the Galaxy developers can help with this problem.
It would be helpful to find a reproducible case with a specific history or a specific sequence of events, then someone can help you with the debugging.
Once you find a history that causes a crash (every time or sometimes, but in a reproducible way), try to pinpoint when exactly it happens: Is it when you start preparing the export (and "export_history.py" is running as a job), or when you start downloading the exported file. (I'm a bit behind on the export mechanism, so perhaps there are other steps involved?).
Couple of things to try:
1. set "cleanup_job=never" in your universe_wsgi.ini - this will keep the temporary files, and will help you re-produce jobs later.
2. Enable "abrt" again - it is not the problem (just the symptom). You can cleanup the "/var/spool/abrt/XXX" directory from previous crash logs, then reproduce a new crash, and look at the collected files (assuming you have enough space to store at least one crash). In particular, look at the file called "coredump" - it will tell you which script has crashed. Try running: $ file /var/spool/abrt/XXXX/coredump coredump ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'python XXXXXX.py'
Instead of "XXXX.py" it would show the python script that crashed (hopefully with full command-line parameters).
It won't show which python statement caused the crash, but it will point in the right direction.
So now I suspect something to be wrong with the datasets, but I am not able to trace something meaningful in the logs. I am not confident in turning on logging in Python yet, but apparently this happens with the module "logging" initiated like logging.getLogger( __name__ ).
It could be a bad dataset (file on disk), or a problem in the database, or something completely different (a bug in the python archive module). No point guessing until there are more details.
-gordon
participants (3)
-
Assaf Gordon
-
Jeremy Goecks
-
Joachim Jacob | VIB |