[Dovecot] How to get rid of locks

Timo Sirainen tss at iki.fi
Sat Apr 7 22:30:25 EEST 2007

Although Dovecot is already read-lockless and it uses only short- 
lived write locks, it's be really nice to just get rid of the locking  
completely. :)

I just figured out that O_APPEND is pretty great. If the operating  
system updates seek position after writing to a file opened with  
O_APPEND, writes to Dovecot's transaction log file can be made  
lockless. I see that this works with Linux and Solaris, but not with  
OS X. Could you BSD people try if it works there? http://dovecot.org/ 
tmp/append.c and see if it says "offset = 0" (bad) or non-zero (yay).  
The O_APPEND at least doesn't work with NFS, so it'll have to be  
optional anyway.

Currently Dovecot always updates dovecot.index file after it has done  
any changes. This isn't really necessary, because the changes are  
already in transaction log, so the dovecot.index file can be read to  
memory and the new changes applied on top of it from transaction log  
(this is pretty much how mmap_disable=yes works). So I'm going to  
change this to work so that the dovecot.index is updated only if a)  
there are enough changes in transaction log (eg. 8kB or so) and b) it  
can be write-locked without waiting.

Maildir then. It has this annoying problem that readdir() can skip  
files if another process is rename()ing them, causing Dovecot to  
think that the message was expunged. The only way I could avoid this  
by locking the maildir while synchronizing it. Today I noticed that  
this doesn't happen with OS X. I'm not sure if I was just lucky or if  
there really is something special implemented in it, because it  
doesn't work anywhere else. I'm not sure if this is tied to HFS+, or  
if it will work with zfs also (Solaris+zfs didn't work). So perhaps  
the locking could be disabled while running with OS X.

More importantly I figured out that it can also be avoided with Linux 
+inotify. As long as the inotify event buffer doesn't overflow, the  
full list of files can be read by combining the readdir() output and  
files listed by inotify events. If the inotify buffer overflows  
(highly unlikely), the operation can just be retried and it most  
likely works the next time.

So with these changes in place, changing a message flag or expunging  
a message would usually result in:

  - lockless write() call to dovecot.index.log
  - lockless read()ing (or looking into mmaped) dovecot.index.log to  
see if there's some new data besides what we just wrote that needs to  
be synchronized to maildir
  - rename() or unlink() calls to maildir. If a call return ENOENT,  
the maildir needs to be readdir()ed with inotify enabled to find the  
new filename.

Not a single lock in the operation, assuming that dovecot.index file  
wasn't updated.

Assigning UIDs to newly delivered mails would require locking though.  
dovecot-uidlist needs to be locked, and the UIDs need to be written  
to dovecot.index.log file in the correct order, which can also be  
done with dovecot-uidlist locking.

Actually a single write() to dovecot.index.log isn't enough. I think  
there needs to be some kind of a flag written to the beginning of the  
transaction which marks the transaction as truly finished. If the  
flag isn't there, any reader knows to stop and wait until the flag is  
set. So this means that the writer needs to:

1. Do a single O_APPENDed write() call writing the whole transaction
2. Get the current offset with lseek(fd, 0, SEEK_CUR) (this is what  
the append.c tester checks)
3. pwrite() the finished-flag to beginning of the transaction Except  
at least with Linux pwrite() doesn't work if O_APPEND is enabled.  
There are two ways to work around this:
  a) fcntl(disable O_APPEND) + pwrite() + fcntl(enable O_APPEND)
  b) Keep two file descriptors open for the transaction log. First  
with O_APPEND flag and second without. pwrite() to the second one.

a) is probably better because it doesn't waste file descriptors.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 186 bytes
Desc: This is a digitally signed message part
Url : http://dovecot.org/pipermail/dovecot/attachments/20070407/463a72b5/attachment.pgp 

More information about the dovecot mailing list