[Dovecot] Replication protocol design #2

Thu May 1 14:36:17 EEST 2008

Changes:

 - Added goal 8 and rewrote mailbox synchronization plan.
 - Added new SELECT command to change active mailbox and removed mailbox
ID from command parameters

Goals
-----

1. If a single (or configurable number of) server dies, no mails must be
lost that have been reported to IMAP/SMTP clients as being successfully
stored.

2. Must be able to automatically recover from a server desynchronization,
such as:
 - server has been offline for a long time
 - some mail files have been manually added/deleted
 - corrupted data/mail files if they're noticed

3. In multi-master setup if the link between servers die, the servers must
be able to proceed autonomously (kind of conflicts with goal 1 though).
When the link comes back, the changes are replicated as soon as possible.

4. Normal IMAP commands must not be able to cause desynchronization between
servers. For example making conflicting flag changes simultaneously in two
servers must not result in the servers having different flags.

5. Must perform well with at least 3 servers in a multi-master setup,
preferrably still with tens of servers.

6. Latency shouldn't be increased noticeably when using servers distributed
into 3 or more data centers. Must be usable over high-latency links (in
optimistic async mode).

7. In normal operation send minimal incremental changes.

8. Make non-incremental synchronization fast enough to allow replication
with users' untrusted Dovecot servers which may connect only rarely.
Support also super high latency replication (e.g. using USB sticks).

Protocol
--------

There are two major parts in the protocol: Handling the normal incremental
changes and fixing desynchronization.

Originally I thought that maybe the changes could be sent using the same
format as Dovecot's transaction logs, but now I'm beginning to think that
it's probably not that good idea. The code reuse potential is quite minimal
and the format would still have to be extended in several ways so that it
won't be directly compatible anyway.

So I'm thinking the protocol could be something text-based. The main
benefit is that text-based protocols are easier to debug. Stream
compression should drop most of the extra overhead if bandwidth is a
problem.

The exact on-wire protocol anyway doesn't matter in the design that's
discussed below.

The commands have tags similar to IMAP commands, because some commands may
have to be forwarded to other servers and it may take a while to get a
reply. During the wait the server may process other commands.

Mailbox master
--------------

Each mailbox has a single master server selected. In multi-master setups
the master server may be moved by having the destination server simply
request it from the current master. The current master must then give it
up. If link is lost to the current master, one of the remaining servers
will become the new master within the remaining servers.

Each server must be connected to the current master server. Since each
mailbox can have a different master server, this typically means all
servers are connected to each others. However it's possible to create
setups where server connects to only one other server, which in turn
connect to more servers. This is useful if there are bandwidth bottlenecks
between some servers. This kind of a situation can also happen if in a
network A-B-C the link A-C dies, but A-B and B-C continues to work. Because
of this all commands must be able to function in a way that the server
proxies them to the current master, instead of failing the command and
trying to make it the caller's problem to resend the command to the actual
master.

When the cluster starts up, a single server is selected as the root master
for all mailboxes. If a server doesn't know who the current mailbox master
is, it asks from the root. All servers cache the currently known mailbox
masters to avoid constant requests to the root.

If the root server dies, another server is selected as the root. Because
the new root doesn't know what masters have been requested (and asking all
of them from all servers would just waste bandwidth), all the servers are
expected to flush their master caches and drop their own master status. The
new root doesn't respond to any requests before all servers have notified
that they've dropped being a master.

The master status doesn't have to be at mailbox level granularity. It could
just as well be configured to move at user, domain or even global level.
Perhaps this could be done dynamically, so that higher granularity is used
when the master is beginning to change too often between servers.

Mailbox ID
----------

Mailbox IDs are session-specific numbers dynamically assigned for
user+mailbox+UIDVALIDITY combinations. All connections have different
mailbox IDs. Also send and receive directions have different IDs. This
allows the sender to easily replace existing IDs to point to new mailboxes
without causing any confusion.

MBOX:
 - Mailbox ID
 - User name
 - Mailbox name
 - Mailbox UIDVALIDITY
Reply:
 - [Mailbox UIDVALIDITY] (if changed - command failed)
 - Mailbox UIDNEXT
 - Highest modseq

If the receiver finds out it has a different UIDVALIDITY, the mailbox
requires a full resync. UIDNEXT and modseq can be used to determine if
replication servers are out of sync.

After mailbox ID has been assigned, the active mailbox can be changed using
a command:

SELECT:
 - Mailbox ID

Requesting master status
------------------------

MASTER-MOVE:
 - [Destination SID] (if forwarding)

The command is sent to the last known master for the selected mailbox. The
server will keep forwarding the command until it reaches the current
master. During the forwarding other servers may want to request something
from the master. These requests must be delayed by the forwarding servers
until the move is finished.

Saving messages
---------------

SAVE:
 - Received date
 - [IMAP UID] (only if we're the master)
 - Global UID (stays the same when copying the message)
 - Message text
Reply:
 - [IMAP UID] (only if not specified in parameters)
 - [Current mailbox master SID] (if it was moved)

If current server is not the master, the SAVE is sent to the master which
gives the message its IMAP UID. The master server then replicates the
message to other servers with the IMAP UID parameter set.

The mailbox master may have already changed by the time server receives a
save request. If server receives a SAVE without IMAP UID parameter, it's
responsible for finding out the new mailbox master and sending a new SAVE
request to it. Once the new master replies with the IMAP UID, the server
can reply to the original SAVE request, also providing the new master SID
so the future requests can be sent there directly.

To be sure the message doesn't get lost, the server should not reply OK to
the IMAP/SMTP client until it has received a reply from SAVE.

Copying messages
----------------

FIXME: Source or destination mailbox ID parameter should probably be
removed and the selected mailbox's ID used instead.

COPY:
 - Source mailbox ID
 - Destination mailbox ID
 - Source IMAP UID
 - Global UID
 - [Destination IMAP UID] (only if we're the master)
 - Destination received date
Reply:
 - [Destination IMAP UID] (only if not specified in parameters)
 - [Current mailbox master SID] (if it was moved)

Source mailbox ID + source IMAP UID identifies the message to be copied.
It's expected to contain the given global UID (which is just an extra
sanity check). Otherwise it works the same way as SAVE.

Since the message already exists, it's probably not necessary to wait for a
reply before replying OK to originating IMAP client.

Expunging messages
------------------

EXPUNGE:
 - IMAP UIDs 
(No reply)

Expunges also have to be sent via master server (the same way as SAVE) to
avoid COPY command failing in some servers because it was just expunged.

Changing message flags
----------------------

STORE:
 - IMAP UIDs
 - Added flags/keywords
 - Removed flags/keywords
 - [Current modseq] (master sends)
 - [Highest modseq of the messages before this change] (non-master sends)
 - [flag: this is a CONDSTORE STORE UNCHANGEDSINCE] (non-master may send)
[Reply:
 - IMAP UIDs where STORE was rejected to (if CONDSTORE flag was used)
]

Stores also have to be sent via master server to avoid flag
desynchronization. Master first checks if it has higher modseqs in the
messages. Then it applies all the changes and forwards the changes to other
servers. For messages that had higher initial modseqs their flags are sent
to the server sending the STORE to fix a potential desync.

If CONDSTORE flag is set, the change is rejected for messages that had a
higher modseq. Non-masters shouldn't reply to a STORE UNCHANGEDSINCE
command before the change has been replicated to master and the rejections
have been processed.

Mailbox synchronization
-----------------------

Network connections to other servers are always initiated by the same
servers. From replication point of view they could be thought of as
client/server.

The clients keep track of the last seen server state for all mailboxes:
 - UIDVALIDITY
 - UIDNEXT
 - Highest modseq

Client requests mailbox ID with MBOX command. If UIDVALIDITY changed, state
is reset by setting UIDNEXT=1 and highest modseq=0. MBOX command is then
resent using the updated UIDVALIDITY.

If server's UIDNEXT (server-UIDNEXT) is higher than last seen UIDNEXT
(old-UIDNEXT), it means the server has added new messages and we want to
fetch them. If the client's UIDNEXT (client-UIDNEXT) is higher than the
old-UIDNEXT, it means the client has added new messages and they need to be
sent to the server. All messages with UID >= old-UIDNEXT and UID <
min(client-UIDNEXT, server-UIDNEXT) have an UID conflict and they must be
given new unused UIDs.

The synchronization is done in two batches of commands. Client first
selects the mailbox and then starts the sync:

SYNC-BEGIN:
 - server's last highest modseq
 - client's current highest modseq
 - QRESYNC-like sequence + IMAP UID lists for optimizing EXPUNGE replies

Client then proceeds to send all its changes using SAVE, EXPUNGE and STORE
commands. The sync is finished with SYNC-END command.

Server replies with:
 - List of expunged IMAP UIDs (EXPUNGE command):
   - IMAP UIDs
 - Added messages (SAVE command?):
   - IMAP UID
   - Global UID
   - Modseq
   - Flags and keywords
   - Received date
   - [Message text]
 - Changed flags/keywords/modseqs (STORE commands):
   - IMAP UID
   - Modseq
   - Changed flags and keywords (or if not easily determined,
     their current values)
 - IMAP UID conflict fixes (UIDFIX command):
   - Source IMAP UIDs
   - Destination IMAP UIDs
 - Sync finish:
   - Updated UIDNEXT
   - Updated highest modseq

Server may also reply with STORE commands that change nothing except
modseqs. This is needed to get the modseqs synchronized.

If in high-latency + high-bandwidth mode the message texts are always sent.
The message texts are also sent if the server can determine that the client
couldn't have previously seen the message (i.e. the message was saved, not
copied). Otherwise they have to be requested for the messages with
previously unseen global UIDs:

FETCH:
 - IMAP UIDs
Reply:
 - Message texts

By the time client receives the reply from server, it may already have done
further changes. This means it may have to fix modseqs and IMAP UIDs
internally as well.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://dovecot.org/pipermail/dovecot/attachments/20080501/3b09d75b/attachment.bin