[Dovecot] remote hot site, IMAP replication or cluster over WAN

Wed Nov 3 21:51:31 EET 2010

Quoting Stan Hoeppner <stan at hardwarefreak.com>:

> Johan Hendriks put forth on 11/3/2010 3:32 AM:
>
>> Hello, i am working primarly with FreeBSD, and the latest release has a
>> service called HAST.
>> See it as a mirrored disk over the network.

This is similar to the DRBD solution.

>> With CARP in the mix, when the master machine fails, it starts dovecot
>> on the slave.
>> This way you have a failover without user interference.

This is similar to heartbeat, or RHCS, etc.

> 1.  How do you automatically redirect clients to the IP address of the
> slave when the master goes down?  Is this seamless?  What is the
> duration of "server down" seen by clients?  Seconds, minutes?

Usually there is a "floating IP" that the clients used.  Which ever
server is active has this IP assigned (usually in addition to another
IP used for management and such).

The transition time depends on how the master goes down.  If you do
an administrative shutdown or transfer, it is usually just a fraction
of a second for the change to take affect, and maybe a bit longer for
the router/switch to get the new MAC address for the IP and route things
correctly.

If the primary crashes/dies, then it is usually several seconds before
the secondary confirms the primary is in trouble, makes sure it is really
down (STOMITH), and takes over the IP, mounts any needed filesystems,
and starts any needed services...  In this case, the arp/MAC issue isn't
really a problem because the transition takes a longer time.

> 2.  When you bring the master back up after repairing the cause of the
> failure, does it automatically and correctly resume mirroring of the
> HAST device so it obtains the new emails that were saved to the slave
> while it was offline?  How do you then put the master back into service
> and make the slave offline again?

DRBD does (or at least can, it is configurable).  Sometimes you might
just do role reversal (old primary becomes secondary, old secondary stays
the primary).  Other times you might have the original primary become
primary again (say, if the original primary has "better" hardware, etc).

So, these things really depend on the use case, and the failure case...
And are usually configurable. :)

I can give two personal examples.  First I have a file server, which is
active-passive cluster.  Since the hardware is identical, when one fails,
it is promoted to primary.  When the dead one comes back, it stays as
secondary.  It is all automatic via RHCS and DRBD using ext3.  Always
feels like I'm wasting a machine, but it is rock solid...

Second I have a mail cluster which is active-active (still RHCS but with
DRBD+GFS2).  When both nodes are up, one does the pop/imap, mailing list
web/cli/email interface, and slave LDAP services, while the other node
does the mailing list processing, SMTP processing, anti-virus/spam
processing, etc.  When one machine goes down, the services on that
machine migrate automatically to the other machine.  When the machine
comes back up, the services migrate back to their "home" machine.

Time for failover is a second or two for an admin failover, and for a
crash/etc maybe 15-30 seconds max for the fileserver, and 10-15 seconds
for the mail server.  During the failover, connections may hang or fail,
but most clients just retry the connection and get the new machine without
user intervention (or in the case of email clients, sometimes they
annoying ask for the password again, but that is not too bad).  I've never
had anyone contact me during either type of failover, which makes me
think they either don't notice, or they write it off as a "normal network
hiccup" kind of thing (well, they did contact me once, when the failover
failed, and the service went completely down, but that was my fault).

So, again, the answer is, as always, "it depends..."

> --
> Stan

-- 
Eric Rostetter
The Department of Physics
The University of Texas at Austin

This message is provided "AS IS" without warranty of any kind,
either expressed or implied.  Use this message at your own risk.