Hi Assaf, After all, the problem appears not to be total size of the history, but the size of the individual datasets. Now, histories which contain big datasets (>1GB) imported from Data Libraries causes the exporting process to crash. Can somebody confirm if this is a bug? I uploaded the datasets to a directory, which are then imported from that directory into a Data Library. Downloading data sets >1GB from a data library directly (as tar.gz) also crashes. Note: I have re-enabled abrt, but waiting for some jobs to be finished to restart. Cheers, Joachim. Joachim Jacob Rijvisschestraat 120, 9052 Zwijnaarde Tel: +32 9 244.66.34 Bioinformatics Training and Services (BITS) http://www.bits.vib.be @bitsatvib On Tue 26 Mar 2013 03:45:43 PM CET, Assaf Gordon wrote:
Hello Joachim,
Joachim Jacob | VIB | wrote, On 03/26/2013 10:01 AM:
abrt was filling the root directory indeed. So disabled it.
I have done some exporting tests, and the behaviour is not consistent.
1. *size*: in general, it worked out for smaller datasets, and usually crashed on bigger ones (starting from 3 GB). So size is key? 2. But now I have found several histories of 4.5GB that I was able to export... So far for the size hypothesis.
Another observation: when the export crashes, the corresponding webhandler process dies.
A crashing python process crosses the fine boundary between the Galaxy code and Python internals... perhaps the Galaxy developers can help with this problem.
It would be helpful to find a reproducible case with a specific history or a specific sequence of events, then someone can help you with the debugging.
Once you find a history that causes a crash (every time or sometimes, but in a reproducible way), try to pinpoint when exactly it happens: Is it when you start preparing the export (and "export_history.py" is running as a job), or when you start downloading the exported file. (I'm a bit behind on the export mechanism, so perhaps there are other steps involved?).
Couple of things to try:
1. set "cleanup_job=never" in your universe_wsgi.ini - this will keep the temporary files, and will help you re-produce jobs later.
2. Enable "abrt" again - it is not the problem (just the symptom). You can cleanup the "/var/spool/abrt/XXX" directory from previous crash logs, then reproduce a new crash, and look at the collected files (assuming you have enough space to store at least one crash). In particular, look at the file called "coredump" - it will tell you which script has crashed. Try running: $ file /var/spool/abrt/XXXX/coredump coredump ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'python XXXXXX.py'
Instead of "XXXX.py" it would show the python script that crashed (hopefully with full command-line parameters).
It won't show which python statement caused the crash, but it will point in the right direction.
So now I suspect something to be wrong with the datasets, but I am not able to trace something meaningful in the logs. I am not confident in turning on logging in Python yet, but apparently this happens with the module "logging" initiated like logging.getLogger( __name__ ).
It could be a bad dataset (file on disk), or a problem in the database, or something completely different (a bug in the python archive module). No point guessing until there are more details.
-gordon