[Dovecot] Maildir over NFS

Stan Hoeppner stan at hardwarefreak.com
Sun Aug 8 13:36:23 EEST 2010


Noel Butler put forth on 8/7/2010 5:34 PM:

>> Bold statement there sir :-)   From a price performance ratio, I'd argue NAS
>> is far superior and scalable, and generally there is far less management
> 
> 
> and with large mail systems, scalability is what it is all about

True large mailbox count scalability requires a "shared nothing" storage
architecture and an ultra cheap hardware footprint.  The big 3 commercial
database vendors all adopted this shared nothing storage strategy a decade ago
for scaling OLAP, and then for OLTP.  This shared nothing architecture
actually works very well for almost any scalable small data transaction
application, which includes email.

In a nutshell, you divide the aggregate application data equally across a
number of nodes with local storage, and each node is responsible for handling
only a specific subset of the total data.  I'm guessing this is exactly what
Google has done with Gmail, but I've yet to see a white paper detailing the
hardware design of gmail, hotmail, or yahoo mail.  I'd make a very educated
guess that not one of them uses globally shared storage for user mailboxes,
like the shared storage we've been discussing.

I would venture to guess that due to performance scalability needs into the
tens of millions of mailboxen and, as importantly, geographically distributed
scalability, and just as importantly, cost reasons, they probably do something
like this mostly shared nothing model.


web server    1                              imap server cluster 1
web server    2                               ------------------
web server    3    \             /    host 1 | 2 disks mirrored |
web server    4     \           /             ------------------  \ DRDB + GFS
  ...                \  smart  /              ------------------  /
  ...                  director       host 2 | 2 disks mirrored |
  ...                /  IMAP   \              ------------------
web server  509     /   proxy   \             ------------------
web server  510    /             \    host 1 | 2 disks mirrored |
web server  511                               ------------------  \ DRDB + GFS
web server  512                               ------------------  /
                                      host 2 | 2 disks mirrored |
                                              ------------------
                                             imap server cluster 128


An http balancer (not shown) would route requests to any free web server.  The
smart director behind the web servers contains a database with many metrics
and routes new account creation to the proper IMAP server cluster.  After the
account is established, that can log into any web server but that user's
mailbox data transactions are now forever routed to that particular cluster.
Each cluster has 1 level of host redundancy and 2 levels of storage
redundancy.  Each IMAP cluster member would have a relatively low end low
power dual core processor, 4GB RAM, 2 x 7.2k RPM disks, and dual GigE ports--a
pretty standard base configuration 1U server--and cheap.  The target service
level being 100-400 concurrent logged in users per IMAP server cluster, for
around 50,000 concurrent users for 256 IMAP servers.

This is not a truly shared nothing architecture, as we have an IMAP service
based on a 2 node cluster.  However, given the total size of these
organizations' user bases, in the multiple 10s of millions of mailboxen, in
practical terms, this is a shared nothing design, as only a a few dozen to a
hundred user mailboxen exist on each server.  One host in each cluster pair
resides in a different physical datacenter close to the user do a catastrophic
network or facility failure doesn't prevent the user from accessing his/her
mailbox.

Depending on how much redundancy, and thus money, the provider wishes to pony
up, each two node cluster above could be expanded to a node count sufficient
to put one member of each cluster in each and every datacenter the provider
has.  The upside to this is massive redundancy and an enhanced user experience
when an outage at one center occurs, or a backbone segment goes down.  The
downside is data synchronization across WAN links, with an n+1 increase in
synchronization overhead for each cluster member added.

Having central shared mailbox storage for this size user count is impossible
due to the geographically distributed datacenters these outfits operate.  The
shared nothing 2 node cluster approach I've suggested is probably pretty close
to what these guys are using.  If a mailbox server goes down, its cluster
partner carries the load for both until the failed node is repaired/replaced.
 If both nodes go down, a very limited subset of the user base is affected.

If one centralized FC SAN or NFS/NAS array was used per datacenter in place of
the local disks in these cheap clusters, costs would go through the roof.  To
duplicate the performance of the 256 x 7.2k local SATA disks (512 total but
mirrors don't add to performance), you'd need an array controller with big
cache (8-32GB), 40k random IO/s at the spindle level and 7.6GB/s of random IO
spindle throughput.  This would require an array controller with a minimum of
10 x 8Gb FC ports, or 8 x 10GbE NAS ports, and 128 x 15k SAS disks.  Depending
on whose unit meeting these specs that you buy, you're looking at somewhere in
the neighborhood of $250-500k.

And given the cost of the switch and HBA infrastructure required in this
central storage scenario, those 256 single socket cheap IMAP cluster machines
are going to rapidly turn into 8 rather expensive dual socket 12 core
processor nodes (24 cores per node, 192 total cores) with 128GB RAM each, 1TB
total, same as the 256 el cheapo node aggregate.  Each node will have an 8Gb
FC HBA or 10GbE HBA, and a single connection to the SAN/NAS array controller,
eliminating the need/cost for a dedicated switch.  As configured, each of
these servers will run ~$20k USD due to the 128GB of RAM, the ~$1,000 HBA, and
due to the fact that vendors selling such boxen gouge customers on big memory
configurations.  Base price for the box with 2 x 12 core Opteron and 16GB RAM
is ~$6k USD.  Anyway, figure 8 x $20k = ~$160,000 for the IMAP cluster nodes.
 Add in $250-$500k for the SAN/NAS array, and you're looking at ~$410k to
~$660k.

A quantity buy of 256 of the aforementioned cheap single socket boxen will get
the price down to well less than $1,000 each, probably more like $800,
yielding a total cluster cost of about $200k USD for 256 cluster hosts--less
than half that of the big smp SAN/NAS solution.

The cluster host numbers I'm using are merely examples.  Google for example
probably has a larger IMAP cluster server count per datacenter than the 256
nodes in my example--that's only about 6 racks packed with 42 x 1U servers.
Given the number of gmail accounts in the US, and the fact they have less than
2 dozen datacenters here, we're probably looking at thousands of 1U IMAP
servers per datacenter.

-- 
Stan


More information about the dovecot mailing list