[Dovecot] POP3 error
chris-dovecot-110112 at aptivate.org
Tue Mar 8 18:26:51 EET 2011
On Tue, 8 Mar 2011, Thierry de Montaudry wrote:
> On 08 Mar 2011, at 13:24, Chris Wilson wrote:
> >> top - 11:10:14 up 14 days, 12:04, 2 users, load average: 55.04, 29.13, 14.55
> >> Tasks: 474 total, 60 running, 414 sleeping, 0 stopped, 0 zombie
> >> Cpu(s): 99.6%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st
> >> Mem: 16439812k total, 16353268k used, 86544k free, 33268k buffers
> >> Swap: 4192956k total, 140k used, 4192816k free, 8228744k cached
> As you can see the numbers (55.04, 29.13, 14.55) the load was busy
> getting higher when I took this snapshot and this was not a normal
> situation. Usually this machine's load is only between 1 and 4, which is
> quite ok for a quad core. It only happens when dovecot start generating
> errors, and pop3, imap and http get stuck. It went up to 200, and I was
> still able to stop web and mail daemons, then restart them, and
> everything was back to normal.
I don't have a definite answer, but I remember that there has been a
long-running bug in the Linux kernel with schedulers behaving badly under
"One of the problems commonly talked about in our forums and elsewhere is
the poor responsiveness of the Linux desktop when dealing with significant
disk activity on systems where there is insufficient RAM or the disks are
slow. The GUI basically drops to its knees when there is too much disk
(note, it's not just the GUI, all other tasks can starve when a disk I/O
queue builds up).
"There are a few options to tune the linux IO scheduler that can help a
bunch... Typically CFQ stalls too long under heavy writes, especially if
your disk subsystem sucks, so particularly if you have several spindles
deadline is worth a try." [http://communities.vmware.com/thread/82544]
"I run Ubuntu on a moderately powerful quad-core x86-64 system and the
desktop response is basically crippled whenever something is reading or
writing large files as fast as it can (at normal priority)... For example,
cat /path/to/LARGE_FILE > /dev/null ... Everything else gets completely
unusable because of the I/O latency."
"I was just running mkfs.ext4 -b 4096 -E stride=128 -E stripe-width=128 -O
^has_journal /dev/sdb2 on my SSD18M connected via USB1.1, and the result
was, well, absolutely, positively _DEVASTATING_. The entire system became
_FULLY_ unresponsive, not even switching back down to tty1 via Ctrl-Alt-F1
worked (took 20 seconds for even this key to be respected)."
"This regression has been around since about the 2.6.18 timeframe and has
eluded a lot of testing to isolate the root cause. The most promising fix
is in the VM subsystem (mm) where the LRU scan has been changed to favor
keeping executable pages active longer. Most of these symptoms come down
to VM thrashing to make room for I/O pages. The key change/commit is
ab4754d24a0f2e05920170c845bd84472814c6, "vmscan: make mapped executable
pages the first class citizen"... This change was merged into the 2.6.31r1
One possible cause is that writing to a slow device can block the write
queue for other devices, causing the machine to come to a standstill when
there's plenty of useful work that it could be doing.
This could cause a cascading failure in your server as soon as disk
I/O write load goes over a certain point, a bit like a swap death. I'm not
sure if the fact that you're using NFS makes a difference; perhaps only if
you memory-map files?
You could test this by booting with the NOOP or anticipatory scheduler
instead of the default CFQ to see if it makes any difference.
Aptivate | http://www.aptivate.org | Phone: +44 1223 760887
The Humanitarian Centre, Fenner's, Gresham Road, Cambridge CB1 2ES
Aptivate is a not-for-profit company registered in England and Wales
with company number 04980791.
More information about the dovecot