[Dovecot] POP3 error

Tue Mar 8 19:00:54 EET 2011

On 08 Mar 2011, at 18:26, Chris Wilson wrote:

> Hi Thierry,
> 
> On Tue, 8 Mar 2011, Thierry de Montaudry wrote:
>> On 08 Mar 2011, at 13:24, Chris Wilson wrote:
>>> 
>>>> top - 11:10:14 up 14 days, 12:04,  2 users,  load average: 55.04, 29.13, 14.55
>>>> Tasks: 474 total,  60 running, 414 sleeping,   0 stopped,   0 zombie
>>>> Cpu(s): 99.6%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
>>>> Mem:  16439812k total, 16353268k used,    86544k free,    33268k buffers
>>>> Swap:  4192956k total,      140k used,  4192816k free,  8228744k cached
>> 
>> As you can see the numbers (55.04, 29.13, 14.55) the load was busy 
>> getting higher when I took this snapshot and this was not a normal 
>> situation. Usually this machine's load is only between 1 and 4, which is 
>> quite ok for a quad core. It only happens when dovecot start generating 
>> errors, and pop3, imap and http get stuck.  It went up to 200, and I was 
>> still able to stop web and mail daemons, then restart them, and 
>> everything was back to normal.
> 
> I don't have a definite answer, but I remember that there has been a 
> long-running bug in the Linux kernel with schedulers behaving badly under 
> heavy writes:
> 
> "One of the problems commonly talked about in our forums and elsewhere is 
> the poor responsiveness of the Linux desktop when dealing with significant 
> disk activity on systems where there is insufficient RAM or the disks are 
> slow. The GUI basically drops to its knees when there is too much disk 
> activity..." [http://www.phoronix.com/scan.php?page=news_item&px=ODQ3Mw] 
> (note, it's not just the GUI, all other tasks can starve when a disk I/O 
> queue builds up).
> 
> "There are a few options to tune the linux IO scheduler that can help a 
> bunch... Typically CFQ stalls too long under heavy writes, especially if 
> your disk subsystem sucks, so particularly if you have several spindles 
> deadline is worth a try." [http://communities.vmware.com/thread/82544]
> 
> "I run Ubuntu on a moderately powerful quad-core x86-64 system and the 
> desktop response is basically crippled whenever something is reading or 
> writing large files as fast as it can (at normal priority)... For example, 
> cat /path/to/LARGE_FILE > /dev/null ... Everything else gets completely 
> unusable because of the I/O latency."
> [https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343371]
> 
> "I was just running mkfs.ext4 -b 4096 -E stride=128 -E stripe-width=128 -O 
> ^has_journal /dev/sdb2 on my SSD18M connected via USB1.1, and the result 
> was, well, absolutely, positively _DEVASTATING_. The entire system became 
> _FULLY_ unresponsive, not even switching back down to tty1 via Ctrl-Alt-F1 
> worked (took 20 seconds for even this key to be respected)." 
> [http://lkml.org/lkml/2010/4/4/86]
> 
> "This regression has been around since about the 2.6.18 timeframe and has 
> eluded a lot of testing to isolate the root cause. The most promising fix 
> is in the VM subsystem (mm) where the LRU scan has been changed to favor 
> keeping executable pages active longer. Most of these symptoms come down 
> to VM thrashing to make room for I/O pages. The key change/commit is 
> ab4754d24a0f2e05920170c845bd84472814c6, "vmscan: make mapped executable 
> pages the first class citizen"... This change was merged into the 2.6.31r1 
> kernel." 
> [https://bugs.launchpad.net/ubuntu/+source/linux/+bug/131094/comments/235]
> 
> One possible cause is that writing to a slow device can block the write 
> queue for other devices, causing the machine to come to a standstill when 
> there's plenty of useful work that it could be doing.
> 
> This could cause a cascading failure in your server as soon as disk 
> I/O write load goes over a certain point, a bit like a swap death. I'm not 
> sure if the fact that you're using NFS makes a difference; perhaps only if 
> you memory-map files?
> 
> You could test this by booting with the NOOP or anticipatory scheduler 
> instead of the default CFQ to see if it makes any difference.
> 
> Cheers, Chris.

Hi Chris,

Thanks for your (long) comment and tech details, but having not changed anything on the 7 machines, but moving from dovecot 1.10.13 to 2.0.9, without increasing our traffic, I don't want to start changing tricky stuff in the system when it worked fine for almost 2 years. And the fact that all mails are stored on multiple NFS servers, all machine having 16G RAM, makes me think that it's not an IO problem.
I though it might be the system running out of resources, but there nothing about it in the logs...
For now, we might consider reversing to 1.10.13... but that would be with the loss of the new features that made us upgrade, so not good.