[Dovecot] Architecture for large Dovecot cluster

Fri Jan 24 14:10:01 EET 2014

Sven, why didn't you chime in?  Your setup is similar scale and I think
your insights would be valuable here.  Or maybe you could repost your
last on this topic.  Or was that discussion off list?  I can't recall.

Anyway, I missed this post Murray.  Thanks Ed for drudging this up.
Maybe this will give you some insight, or possibly confuse you. :)

On 1/5/2014 7:06 AM, Murray Trainer wrote:
> Hi All,
> 
> I am trying to determine whether a mail server cluster based on Dovecot
> will be capable of supporting 500,000+ mailboxes with about 50,000 IMAP
> and 5000 active POP3 connections.  I have looked at the Dovecot
> clustering suggestions here:
> 
> http://blog.dovecot.org/2012/02/dovecot-clustering-with-dsync-based.html
> 
> and some other Dovecot mailing list threads but I am not sure how many
> users such a setup will handle.  I have a concern about the I/O
> performance of NFS in the suggested architecture above.  One possible
> option available to us is to split up the mailboxes over multiple
> clusters with subsets of domains.  Is there anyone out there currently
> running this many users on a Dovecot based mail cluster?  Some
> suggestions or advice on the best way to go would be greatly appreciated.

As with MTAs Dovecot requires miniscule CPU power for most tasks.  Body
searches are the only operations that eat meaningful CPU, and only when
indexes aren't up to date.

As with MTAs, mailbox server performance is limited by disk IO, but it
is also limited by memory capacity as IMAP connections are long lived,
unlike an MTA where each lasts a few seconds.

Thus, very similar to the advice I gave you WRT MTAs, you can do this
with as few as two hosts in the cluster, or as many as you want.  You
simply need sufficient memory for concurrent user connections, and
sufficient disk IO.

The architecture of the IO subsystem depends greatly on which mailbox
format you plan to use.  Maildir is extremely metadata heavy and thus
does not perform all that well with cluster filesystems such as OCFS or
GFS, no matter how fast the SAN array controller and disks may be.  It
can work well with NFS.  Mdbox isn't metadata heavy and works much
better with cluster filesystems.

Neither NFS nor a cluster filesystem setup can match the performance of
a standalone filesystem on direct attached disk or a SAN LUN.  But
standalone filesystems make less efficient use of total storage
capacity.  And if using DAS failover, resiliency, etc are far less than
optimal.

With correct mail routing from your MTAs to your Dovecot servers, and
with Dovecot director, you can use any of these architectures.  Which
one you choose boils down to:

1.  Ease of management
2.  Budget
3.  Storage efficiency

The NFS and cluster filesystem solutions are generally significantly
more expensive than filesystem on DAS, because the NFS server and SAN
array required for 500,000 mailboxes are costly.  If you go NFS you
better get a NetApp filer.  Not just for the hardware, snapshots, etc,
but for the engineering support expertise.  They know NFS better than
the Pope knows Jesus and can get you tuned for max performance.

Standalone servers/filesystems with local disk give you dramatically
more bang for the buck.  You can handle the same load with fewer servers
and with quicker response times.  You can use SAN storage instead of
direct attach, but at cost equivalent to the cluster filesystem
architecture.  You'll then benefit from storage efficiency, PIT
snapshots, etc.

Again, random disk IOPS is the most important factor wil mailbox
storage.  With 50K logged in IMAP users and 5K POP3 users, we simply
have to guesstimate IOPS if you don't already have this data.  I assume
you don't as you didn't provide it.  It is the KEY information required
to size your architecture properly, and in the most cost effective manner.

Lets assume for argument sake that your 50K concurrent IMAP users and
your 5K POP users generate 8,000 IOPS, which is probably a high guess.
10K SAS drives do ~225 IOPS.

8000/225= 36 disks * 2 for RAID10 = 72

So as a wild ass guesstimate you'd need approximately 72 SAS drives in
multiple at 10K spindle speed for this workload.  If you need to use
high cap 7.2K SATA or SAS drives to meet your offered mailbox capacity
you'll need 144 drives.

Whether you go NFS, cluster on SAN, or standalone filesystems on SAN,
VMware with HA, Vmotion, etc, is a must, as it gives you instant host
failover and far easier management that KVM, Xen, etc.

On possible hardware solution consists of:

Qty 1.  HP 4730 SAN controller with 25x 600GB 10K SAS drives
Qty 3.  Expansion chassis for 75 drives, 45TB raw capacity, 21.6TB
        net after one spare per chassis and RAID10, 8100 IOPS.
Qty 2.  Dell PowerEdge 320, 4 core Xeon and 96GB RAM, Dovecot
Qty 1.  HP ProLiant DL320e with 8GB RAM running Dovecot Director

You'd run ESX on each Dell with one Linux guest per physical box.  Each
guest would be allocated 46GB of RAM to facilitate failover.  This much
RAM is rather costly, but Vmware licenses are far more, so it saves
money using a beefy 2 box cluster vs a 3/4 box cluster of weaker
machines.  You'd create multiple RAID10 arrays using a 32KB strip size
on the 4730 of equal numbers of disks, and span the RAID sets into 2
volumes.  You'd export each volume as a LUN to both ESX hosts.  You'd
create an RDM of each LUN and assign one RDM to each of your guests.
Each guest would format its RDM with

~# mkfs.xfs "-d agcount=24" /dev/[device]

giving you 24 allocation groups for parallelism.  Do -not- align XFS
(sunit/swidth) with a small file random IO workload.  It will murder
performance.  You get two 10TB filesystems, each for 250,000 mailboxes,
or ~44MB average per mailbox.  If that's not enough storage, buy the
900GB drives for 66MB/mailbox.  If that's still not enough, use more
expansion chassis and more RAID sets per volume, or switch to a large
cap SAS/SATA model.  With 50K concurrent users, don't even think about
using RAID5/6.  The RMW will murder performance and then urinate on its
grave.

With HA configured, if one box or one guest dies, the guest will
automatically be restarted on the remaining host.  Since both hosts see
both LUNs, and RDMs, the guest boots up and has its filesystem.  This is
an infinitely better solution than a single shared cluster filesystem.
The dual XFS filesystems will be much faster.  If the CFS gets corrupted
all your users are down--with two local filesystems only half the users
are down.  Check/repair of a 20TB GFS2/OCFS2 filesystem will take -much-
longer than xfs_repair on a 10TB FS, possibly hours one you have all
500K mailboxes on it.  Etc, etc.

-- 
Stan