[Dovecot] Please help to make decision

Mon Mar 25 03:45:43 EET 2013

On 3/24/2013 1:45 PM, Timo Sirainen wrote:
> On 24.3.2013, at 18.12, Tigran Petrosyan <tpetrosy at gmail.com> wrote:
> 
>> We are going to implement the "Dovecot" for 1 million users. We are going
>> to use more than 100T storage space. Now we examine 2 solutions NFS or GFS2
>> via (Fibre Channel storage).
>> Can someone help to make decision? What kind of storage solution we can use
>> to achieve good performance and scalability.

This greatly depends upon whose cluster NFS storage product we're
talking about.

> I remember people complaining about GFS2 (and other cluster filesystems) having bad performance. But in any case whatever you use, be sure to use also http://wiki2.dovecot.org/Director Even if it's not strictly needed, it improves the performance with GFS2.

GFS2 and OCFS2 performance is suffers when using maildir due to the
filesystem metadata being broadcast amongst all nodes, thus creating
high latency and low metadata IOPS.  The more nodes the worse this
problem becomes.  If using old fashioned UNIX mbox, or Dovecot mdbox
with a good deal of emails per file then this isn't as much of an issue
as metadata changes are few.  If using maildir with a cluster filesystem
using a small number of fat nodes is mandatory to minimize metadata traffic.

If using fiber channel SAN storage with your cluster filesystem, keep in
mind that one single port 8gb HBA and its SFP transceiver costs
significantly more than a 1U server.  Given that a single 8gb FC port
can carry (800MB/s / 32KB)= 25,600 emails per second, or 2.2 billion
emails/day, fat nodes make more sense from a financial standpoint as
well.  Mail workloads don't require much CPU, but need low latency disk
and network IO, and lots of memory.  Four dual socket Opteron 8-core
servers (16c per server), 128GB RAM, two single port 8gb FC HBAs with
SCSI multipath, dual GbE ports for user traffic and dual GbE for GFS2
metadata, should fit the bill nicely.

Any quality high performance SAN head with 2-4 ports per dual
controller, or multiple SAN heads, that can expand to 480 or more
drives, is suitable.  If the head has only 4 ports total you will need
an FC switch with at least 8 ports, preferably two switches with minimum
4 ports each (8 is the smallest typically available)--this provides
maximum redundancy as you survive a switch failure.  For transactional
workloads you never want to use parity as the RMW cycles that result for
smaller than stripe width writes degrade write throughput by a factor of
5:1 or more compared to non-parity RAID.  So RAID10 is the only game in
town, thus you need lots of spindles.  With 480x 600GB SAS 15K drives
(4x 60 bay 4U chassis) and 16 spares you have 464 drives configured in
29 RAID10 arrays of 16 drives, 4.8TB raw per array, and yielding an
optimal 8x 32KB stripe width of 256KB.  You would format each 4.8TB
exported LUN with GFS2, yielding 29 cluster filesystems, with ~35K user
mail directories on each.  If you have a filesystem problem and must run
a check/repair, or even worse restore from tape or D2D, you're only
affecting up to 1/29th, or 35K of your 1M users.  If you feel this is
too many filesystems to manage you can span arrays with the controller
firmware or with mdraid/lvm2.  And of course you will need a box
dedicated to Director, which will spread connections across your 4
server nodes.

This is not a complete "how-to" obviously, but should give you some
pointers/ideas on overall architecture options and best practices.

-- 
Stan