We upgraded from 1.0-beta7 to 1.0-rc17 on Tuesday evening and looking at today's logs, just four imap processes out of 90,000+ died. I managed to get core files (the server is running Solaris 8 on Sparc, and Dovecot was compiled with gcc 3.3.2). I hope they're useful!
(I should also say that performance between the two versions is roughly similar according to Solaris' process accounting stats, looking at CPU usage and disk blocks accessed.)
Best Wishes, Chris
- Here's the last log entry for each
dovecot: Jan 11 11:39:59 Info: IMAP 5751 xxxxxxxx xxx.xxx.xxx.xxx : Opened Sent (18 msgs) dovecot: Jan 11 11:39:59 Error: child 5751 (imap) killed with signal 11
dovecot: Jan 11 12:28:49 Info: IMAP 12765 xxxxxxxx xxx.xxx.xxx.xxx : Disconnected: Internal error occurred. Refer to server log for more information. [2007-01-11 12:28:49] dovecot: Jan 11 12:28:49 Error: child 12765 (imap) killed with signal 10
dovecot: Jan 11 13:01:49 Info: IMAP 14234 xxxxxxxx xxx.xxx.xxx.xxx : Opened List Reading (124 msgs) dovecot: Jan 11 13:01:54 Error: child 14234 (imap) killed with signal 10
dovecot: Jan 11 19:13:55 Info: IMAP 18786 xxxxxxxx xxx.xxx.xxx.xxx : Opened Sent (45 msgs) dovecot: Jan 11 19:13:57 Error: child 18786 (imap) killed with signal 11
(I have extra logging lines in cmd-select.c, hence the "Opened" lines)
- backtraces!
pid 5751
pid 12765
pid 14234
pid 18786
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Thu, 2007-01-11 at 21:32 +0000, Chris Wakelin wrote:
Didn't it log the internal error?
These are all from malloc/free, which means there's memory corruption, which is bad. :( It can't really be debugged from the backtraces, since the actual bug could have occurred long before it crashed.
Do these happen for different users or only for a couple of users? Do the crashed users have unusually large mailboxes?
Could you run Dovecot with this: http://dovecot.org/patches/debug/memdebug-bof.diff I don't think it'll affect performance much. Only the imap binary needs to be compiled with it.
There's also this patch: http://dovecot.org/patches/debug/memdebug-delayed.diff But it does hurt performance and probably makes Dovecot use a lot more memory, so it's useful only if you can run it only for a couple of users who have crashing problems (make mail_executable to be a script which checks the username and based on that executes different imap binary).
On Tue, 2007-01-16 at 17:23 +0200, Timo Sirainen wrote:
http://dovecot.org/patches/debug/memdebug-bof.diff http://dovecot.org/patches/debug/memdebug-delayed.diff
BTW. Those check different things. The first checks only buffer under/overflows, and the latter checks only double-free()s. So if one doesn't show anything, the other probably then will.
Also could you show me your Dovecot settings (dovecot -n)?
Timo Sirainen wrote:
Sorry, yes, it was: Corrupted Index file <path>/.imap/Gloria/dovecot.index: Record [2].uid=0
I've had four of these between 11th-15th January, all for different folders/users, not always "Record [2]" but all of them had "uid=0".
...
They seem to happen only the once for 3 or 4 different users each day. I guess they're caused by corrupt indexes left over from the earlier version of Dovecot. I don't think the folders are particularly large. I'll look at restoring the relevant index files and folders from backup for some of the users (i.e. before they were fixed by the new Dovecot) and seeing if I can reproduce the crashes. We only backup some of the indexes though (those not stored on a separate partition).
I'd rather not do these to the live server, but if I manage to reproduce the crashes as above, I'll certainly do it.
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Tue, 2007-01-16 at 16:24 +0000, Chris Wakelin wrote:
I'm beginning to think that this has something to do with creating index files and a race condition somewhere in it. I've done a couple of fixes that should change this error message to something else:
http://dovecot.org/list/dovecot-cvs/2007-January/007375.html http://dovecot.org/list/dovecot-cvs/2007-January/007377.html http://dovecot.org/list/dovecot-cvs/2007-January/007385.html
The last patch is the most important one, if my guess is correct. You should then start seeing "Record field cache points outside record size". I'm still not sure why that happens though.
Hmm. Actually I think that would also explain the memory corruption. Great, one less thing to worry about. Now please just tell me that my fix works? ;)
Timo Sirainen wrote:
I've been foraging among our backup indexes and finding tapes that haven't gone off site yet. I've had three cases from today of the "Corrupted Index file" but only two crashed (one with segfault, one with bus error). I'll need to pick up tapes tomorrow morning from our other on-site tape store to get the index files. Some of the folders themselves don't appear to have changed recently, but I'll get the backup copies of them too, to be sure. I think I'll have three cases to play with as a beginning.
If I'm right, and it's the processing of these index files causing the memory corruption, I should be able to reproduce it using the recovered files on my test server. I can then try your debugging patches and the latest fixes.
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
Chris Wakelin wrote:
Just thought I'd update you with what I've managed!
Well I've got one folder plus index directory that causes a segfault crash both times I've tried on the test machine (just "SELECT"/"UID FETCH * BODY[]"/"LOGOUT"):
but no "Corrupted Index" error message like there was on the original.
I've tried adding the memdebug*.diff patches, but hit problems with the lack of a walkcontext function (after originally having problems with Makefile.am and a broken autoreconf on the Solaris box - I ran autoreconf on a Linux box in the end!)
I tried the same folder on a Linux installation of 1.0-rc17, but that was 64-bit, whereas my Solaris version is 32-bit, and so the index was just invalid. I'm now trying to compile a 32-bit Linux version to try instead.
I've tried to apply the patches you suggested from CVS, but they fail. You've not been building nightly CVS snapshots, so I guess I'll have to use cvs and autoreconf on the Linux box to make my own!
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
Chris Wakelin wrote:
...
Well the 32-bit version doesn't crash on Linux (looks the same as the 64-bit version), but the memdebug-delayed.diff patched version on Linux gives a warning:
dovecot: Jan 17 18:55:27 Warning: IMAP 15442 xxxxxxxx xxx.xxx.xxx.xxx : Timeout leak: 0x80b1ff0
Right, dinnertime! Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Wed, 2007-01-17 at 18:59 +0000, Chris Wakelin wrote:
Was your Solaris a Sparc or x86? If it was Sparc, the index files use different endianess and I think Dovecot just silently overwrites the index without using it.
I'm pretty sure anyway that this fixes it: http://dovecot.org/list/dovecot-cvs/2007-January/007395.html
dovecot: Jan 17 18:55:27 Warning: IMAP 15442 xxxxxxxx xxx.xxx.xxx.xxx : Timeout leak: 0x80b1ff0
Yes, that's normal.
Timo Sirainen wrote:
:lightbulb: that explains it :)
...
dovecot-20070117.tar.gz seems to have fixed the reproducible crash, but I did get a couple of extra compiler warnings (gcc 3.3.2):-
mail-index.c: In function mail_index_parse_extensions': mail-index.c:342: warning: comparison between signed and unsigned mail-index.c: In function
mail_index_map_clone':
mail-index.c:1242: warning: comparison between signed and unsigned
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
Chris Wakelin wrote:
Hi Timo,
I've got three reproducible rc17 crashes, all fixed in dovecot-20070117.tar.gz, and I've managed to get the same crashes in Solaris 10 on Sparc. However, Solaris 10 has walkcontext() so I've been able to use versions with the memdebug-* patches. The memdebug-delayed.diff version doesn't crash or log anything interesting, but the memdebug-bof.diff version does:
for two of them and:
for the other.
Is this what you expected? Each case was LOGIN, SELECT, UID FETCH * BODY[], LOGOUT.
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Thu, 2007-01-18 at 17:42 +0000, Chris Wakelin wrote:
memdebug-bof.diff had the possibility to notice it, but it should have logged an error in that case. Your backtraces show that it crashed so it didn't detect the overflow. Anyway, I think that's still somewhat expected because it wasn't exactly ordinary buffer overflow.
On Wed, 2007-01-17 at 22:21 +0000, Chris Wakelin wrote:
mail-index.c: In function
mail_index_parse_extensions': mail-index.c:342: warning: comparison between signed and unsigned mail-index.c: In function
mail_index_map_clone': mail-index.c:1242: warning: comparison between signed and unsigned
gcc 4.1.2 doesn't give these anymore and since the warnings are kind of wrong anyway, I'll leave them as they are.
it did not take long:
dovecot: Jan 16 23:26:08 Info: imap-login: Login: user=<rainer>, method=PLAIN, rip=192.168.1.232, lip=192.168.1.1, TLS dovecot: Jan 16 23:32:16 Info: imap-login: Login: user=<rainer>, method=PLAIN, rip=192.168.1.232, lip=192.168.1.1, TLS dovecot: Jan 16 23:37:14 Error: IMAP(rainer): Fixed index file /home/rainer/Maildir/.INBOX.dovecot/dovecot.index: first_recent_uid_lowwater 330 -> 329 dovecot: Jan 16 23:37:14 Error: IMAP(rainer): Corrupted index file /home/rainer/Maildir/.INBOX.dovecot/dovecot.index: Record field keywords points outside record size (12 < 12+2) dovecot: Jan 16 23:37:14 Error: IMAP(rainer): Corrupted index file /home/rainer/Maildir/.INBOX.dovecot/dovecot.index: Record field keywords points outside record size (12 < 12+2) dovecot: Jan 16 23:37:14 Error: IMAP(rainer): Corrupted index file /home/rainer/Maildir/.INBOX.dovecot/dovecot.index: Record field keywords points outside record size (12 < 12+2) dovecot: Jan 16 23:37:14 Error: IMAP(rainer): Corrupted index file /home/rainer/Maildir/.INBOX.dovecot/dovecot.index: Record field keywords points outside record size (12 < 12+2) dovecot: Jan 16 23:37:14 Error: child 1239 (imap) killed with signal 11 dovecot: Jan 16 23:39:15 Info: imap-login: Login: user=<rainer>, method=PLAIN, rip=192.168.1.232, lip=192.168.1.1, TLS dovecot: Jan 16 23:39:15 Error: IMAP(rainer): Corrupted index file /home/rainer/Maildir/.INBOX.dovecot/dovecot.index: Record field keywords points outside record size (12 < 12+2) dovecot: Jan 16 23:39:15 Error: IMAP(rainer): Corrupted index file /home/rainer/Maildir/.INBOX.dovecot/dovecot.index: Record field keywords points outside record size (12 < 12+2)
my setup: linux 2.4.34, postfix, amavisd, dovecot, thunderbird-1.5.0.9 on XP, with rules moving incoming mails depending on "subject" and "from" into various subfolders.
Timo Sirainen wrote:
-- in diesem Sinne, Rainer
On Tue, 2007-01-16 at 23:48 +0100, Rainer Ophoven wrote:
dovecot: Jan 16 23:37:14 Error: child 1239 (imap) killed with signal 11
Can you get a gdb backtrace from this?
Do you use Dovecot's deliver? Does anything ever delete the index files? Do you have mmap_disable=yes? If not, try if setting it helps? Do you have any keywords set in your maildir (and does dovecot-keywords file contain anything)?
I've been doing pretty heavy stress testing in my own computer the whole day, but I haven't managed to cause this bug to happen with me. Luckily I've found and fixed a few other bugs. :)
participants (3)
-
Chris Wakelin
-
Rainer Ophoven
-
Timo Sirainen