[Dovecot] Ongoing performance issues with 2.0.x

Mon Nov 8 20:39:32 EET 2010

Udo Wolter put forth on 11/8/2010 4:45 AM:
> * Ralf Hildebrandt <Ralf.Hildebrandt at charite.de>:
>>> And I'm guessing you're running a 32bit PAE kernel because VMWare ESX
>>> still doesn't officially support 64bit guests, correct?
>>
>> No, it's supported, but I don'T want to change the whole system.
> 
> That's right, we cannot switch without having several hours downtime. This is
> not acceptable. I'm thinking of a way for switching to 64 bit with exchanging
> disks etc. But I don't know if this will work, I have to test it first.

Does this machine have more than 4GB of RAM?  You do realize that merely
utilizing PAE will cause an increase in context switching, whether on
bare medal or in a VM guest.  It will probably actually be much higher
with a VM guest running a PAE kernel.  Also, please tell me the ESX
kernel you're running is native 64 bit, not 32 bit.  If the VMWare
kernel itself is doing PAE, as well as the guest Linux kernel, this may
fully explain the performance disaster you have on your hands, if it is
indeed due to context switching.

The bigger question is, why does this problem surface so readily while
running Dovecot 2.0.x and not while running Dovecot 1.2.x?  Is 1.2.x
merely tickling the dragon's chin, whereas 2.0.x is sticking it's head
into the dragon's mouth?

>>> Is this the only guest on this host or do you have others?
>>
>> only guest
> 
> Yes, the VM-system has 8 CPUs and that's all the ESX has. Of course, there are
> times, when the ESX doesn't have that much stress so the DRS moves 1 or 2 other
> machines onto it. But since we got that high load, the rest of the machines all
> had been moved off the ESX.
> 
>>> If this is the only guest, you have 2 dual core dies in that Xeon CPU,
>>> 4 cores total.  I assume you've assigned 4 virtual CPUs to this Debian
>>> VM?
>>
>> Yes, something like that
> 
> 8.

Ralf gave me the model number of that server and said it was a single
CPU machine.  I looked up the specs, and if that is the case, there are
4 cores total in that Xeon.  And, IIRC, that Xeon does not have the
HyperThreading circuitry.  So, are there two physical CPUs in the
machine with 4 cores each, or 1 CPU with 4 cores and HT, appearing as 8
cores?  If it's one 4 core CPU with HT enabled, reboot the machine and
disable HT in the BIOS.  HT itself also contributes to high context
switching.  HT is more of a hindrance to ESX performance than a benefit.

www.vmware.com/pdf/vi_performance_tuning.pdf

>>> You may want to run top in the hypervisor console itself (or an SSH
>>> session into the hypervisor) and watch the %CPU of the hypervisor's
>>> kernel threads.  That might tell us something as well.
>>
>> Udo has to answer that, but from what he told me it was fully using
>> all cpus with 2.0, and now it's idling with 1.2
>>
>> More details to follow (from him)
> 
> As I said in the other mail: as long as the load isn't high enough we cannot
> see any problems in the ESX. Only, if we step over some kind of specific
> barrier. I think, it's when even the ESX runs out of possibilities to handle so
> many interrupts.

This very well may be the case.  You need to also look at the CONFIG_HZ=
value of the Linux kernel of the guest.  If it's a tickless kernel you
should be fine.  If tickless, IIRC, you should see CONFIG_NO_HZ=y.

However, if CONFIG_HZ=1000 you're generating WAY too many interrupts/sec
to the timer, ESPECIALLY on an 8 core machine.  This will exacerbate the
high context switching problem.  On an 8 vCPU (and physical CPU) machine
you should have CONFIG_HZ=100 or a tickless kernel.  You may get by
using 250, but anything higher than that is trouble.

-- 
Stan