[Dovecot] How to get rid of locks
Daniel L. Miller
dmiller at amfes.com
Sun Apr 8 10:29:09 EEST 2007
Timo Sirainen wrote:
> Although Dovecot is already read-lockless and it uses only short-lived
> write locks, it's be really nice to just get rid of the locking
> completely. :)
> I just figured out that O_APPEND is pretty great. If the operating
> system updates seek position after writing to a file opened with
> O_APPEND, writes to Dovecot's transaction log file can be made
> lockless. I see that this works with Linux and Solaris, but not with
> OS X. Could you BSD people try if it works there?
> http://dovecot.org/tmp/append.c and see if it says "offset = 0" (bad)
> or non-zero (yay). The O_APPEND at least doesn't work with NFS, so
> it'll have to be optional anyway.
> Currently Dovecot always updates dovecot.index file after it has done
> any changes. This isn't really necessary, because the changes are
> already in transaction log, so the dovecot.index file can be read to
> memory and the new changes applied on top of it from transaction log
> (this is pretty much how mmap_disable=yes works). So I'm going to
> change this to work so that the dovecot.index is updated only if a)
> there are enough changes in transaction log (eg. 8kB or so) and b) it
> can be write-locked without waiting.
> Maildir then. It has this annoying problem that readdir() can skip
> files if another process is rename()ing them, causing Dovecot to think
> that the message was expunged. The only way I could avoid this by
> locking the maildir while synchronizing it. Today I noticed that this
> doesn't happen with OS X. I'm not sure if I was just lucky or if there
> really is something special implemented in it, because it doesn't work
> anywhere else. I'm not sure if this is tied to HFS+, or if it will
> work with zfs also (Solaris+zfs didn't work). So perhaps the locking
> could be disabled while running with OS X.
> More importantly I figured out that it can also be avoided with
> Linux+inotify. As long as the inotify event buffer doesn't overflow,
> the full list of files can be read by combining the readdir() output
> and files listed by inotify events. If the inotify buffer overflows
> (highly unlikely), the operation can just be retried and it most
> likely works the next time.
> So with these changes in place, changing a message flag or expunging a
> message would usually result in:
> - lockless write() call to dovecot.index.log
> - lockless read()ing (or looking into mmaped) dovecot.index.log to
> see if there's some new data besides what we just wrote that needs to
> be synchronized to maildir
> - rename() or unlink() calls to maildir. If a call return ENOENT, the
> maildir needs to be readdir()ed with inotify enabled to find the new
> Not a single lock in the operation, assuming that dovecot.index file
> wasn't updated.
> Assigning UIDs to newly delivered mails would require locking though.
> dovecot-uidlist needs to be locked, and the UIDs need to be written to
> dovecot.index.log file in the correct order, which can also be done
> with dovecot-uidlist locking.
> Actually a single write() to dovecot.index.log isn't enough. I think
> there needs to be some kind of a flag written to the beginning of the
> transaction which marks the transaction as truly finished. If the flag
> isn't there, any reader knows to stop and wait until the flag is set.
> So this means that the writer needs to:
> 1. Do a single O_APPENDed write() call writing the whole transaction
> 2. Get the current offset with lseek(fd, 0, SEEK_CUR) (this is what
> the append.c tester checks)
> 3. pwrite() the finished-flag to beginning of the transaction Except
> at least with Linux pwrite() doesn't work if O_APPEND is enabled.
> There are two ways to work around this:
> a) fcntl(disable O_APPEND) + pwrite() + fcntl(enable O_APPEND)
> b) Keep two file descriptors open for the transaction log. First with
> O_APPEND flag and second without. pwrite() to the second one.
> a) is probably better because it doesn't waste file descriptors.
This is probably a scary thought, but . . . what would it take for the
indexing part of Dovecot to be implemented via an API/plug-in model?
I'm curious about the effect of using an external SQL engine (my vote
would be Firebird) for processing these, and using a open plug-in method
would allow for that without binding Dovecot to a particular implementation.
More information about the dovecot