[Dovecot] POP3 error

Tue Mar 8 18:26:51 EET 2011

Hi Thierry,

On Tue, 8 Mar 2011, Thierry de Montaudry wrote:
> On 08 Mar 2011, at 13:24, Chris Wilson wrote:
> >
> >> top - 11:10:14 up 14 days, 12:04,  2 users,  load average: 55.04, 29.13, 14.55
> >> Tasks: 474 total,  60 running, 414 sleeping,   0 stopped,   0 zombie
> >> Cpu(s): 99.6%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
> >> Mem:  16439812k total, 16353268k used,    86544k free,    33268k buffers
> >> Swap:  4192956k total,      140k used,  4192816k free,  8228744k cached
>
> As you can see the numbers (55.04, 29.13, 14.55) the load was busy 
> getting higher when I took this snapshot and this was not a normal 
> situation. Usually this machine's load is only between 1 and 4, which is 
> quite ok for a quad core. It only happens when dovecot start generating 
> errors, and pop3, imap and http get stuck.  It went up to 200, and I was 
> still able to stop web and mail daemons, then restart them, and 
> everything was back to normal.

I don't have a definite answer, but I remember that there has been a 
long-running bug in the Linux kernel with schedulers behaving badly under 
heavy writes:

"One of the problems commonly talked about in our forums and elsewhere is 
the poor responsiveness of the Linux desktop when dealing with significant 
disk activity on systems where there is insufficient RAM or the disks are 
slow. The GUI basically drops to its knees when there is too much disk 
activity..." [http://www.phoronix.com/scan.php?page=news_item&px=ODQ3Mw] 
(note, it's not just the GUI, all other tasks can starve when a disk I/O 
queue builds up).

"There are a few options to tune the linux IO scheduler that can help a 
bunch... Typically CFQ stalls too long under heavy writes, especially if 
your disk subsystem sucks, so particularly if you have several spindles 
deadline is worth a try." [http://communities.vmware.com/thread/82544]

"I run Ubuntu on a moderately powerful quad-core x86-64 system and the 
desktop response is basically crippled whenever something is reading or 
writing large files as fast as it can (at normal priority)... For example, 
cat /path/to/LARGE_FILE > /dev/null ... Everything else gets completely 
unusable because of the I/O latency."
[https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343371]

"I was just running mkfs.ext4 -b 4096 -E stride=128 -E stripe-width=128 -O 
^has_journal /dev/sdb2 on my SSD18M connected via USB1.1, and the result 
was, well, absolutely, positively _DEVASTATING_. The entire system became 
_FULLY_ unresponsive, not even switching back down to tty1 via Ctrl-Alt-F1 
worked (took 20 seconds for even this key to be respected)." 
[http://lkml.org/lkml/2010/4/4/86]

"This regression has been around since about the 2.6.18 timeframe and has 
eluded a lot of testing to isolate the root cause. The most promising fix 
is in the VM subsystem (mm) where the LRU scan has been changed to favor 
keeping executable pages active longer. Most of these symptoms come down 
to VM thrashing to make room for I/O pages. The key change/commit is 
ab4754d24a0f2e05920170c845bd84472814c6, "vmscan: make mapped executable 
pages the first class citizen"... This change was merged into the 2.6.31r1 
kernel." 
[https://bugs.launchpad.net/ubuntu/+source/linux/+bug/131094/comments/235]

One possible cause is that writing to a slow device can block the write 
queue for other devices, causing the machine to come to a standstill when 
there's plenty of useful work that it could be doing.

This could cause a cascading failure in your server as soon as disk 
I/O write load goes over a certain point, a bit like a swap death. I'm not 
sure if the fact that you're using NFS makes a difference; perhaps only if 
you memory-map files?

You could test this by booting with the NOOP or anticipatory scheduler 
instead of the default CFQ to see if it makes any difference.

Cheers, Chris.
-- 
Aptivate | http://www.aptivate.org | Phone: +44 1223 760887
The Humanitarian Centre, Fenner's, Gresham Road, Cambridge CB1 2ES

Aptivate is a not-for-profit company registered in England and Wales
with company number 04980791.