[Dovecot] mail spool filesystem

Fri Aug 19 11:48:00 EEST 2011

On 8/17/2011 9:42 AM, Adrian Ulrich wrote:
>> I read that XFS is a good choice, but is not 
>> too reliable...
> 
> Are you using Maildir or MBOX?
> 
> In any case: XFS would be my last choice:
> 
> XFS is nice if you are working with large files (> 2GB), but
> for E-Mail i'd stick with ext3 (or maybe even reiser3)
> as it works very well with small files.

XFS was designed for parallelism, whether with large files or small,
though it has been optimized a bit more for large file throughput.  In
yet another attempt to dispel the XFS "small file problem" myth, XFS has
never had a performance problem with "small" files.  In the past XFS did
have a performance problem with large metadata operations due to the way
the delayed allocation had been designed.  The perennial example of this
was the horrible unlink performance when whacking a kernel tree with 'rm
-rf'.  It used to take forever, multiple tens of times slower than
Reiser or EXT.  This metadata bottleneck in the delayed allocation path
was largely resolved by Dave Chinner's delayed logging patch which was
experimental in 2.6.35 and is enabled by default in 2.6.39 and later.
XFS metadata performance is now on par with that of EXT3/4.

Because of this, and XFS' use of allocation groups, today, for a busy
IMAP server with lots of maildir mailboxen, one of the highest
performance storage stack setups is the following:

1.  A dozen or more hardware or software RAID1 mirrors
2.  A linear concat over the mirrors
3.  XFS with 2*num_mirrors allocation groups, mounted with 'inode64'
4.  maildir mailboxes

This setup will give you significantly higher real IOPS than any striped
array setup with any filesystem atop, for a couple of reasons:

1.  No partial stripe width writes, and no unnecessary full stripe
reads.  All reads and writes match the page size and filesystem block
size of 4KB.

2.  In the example above, you have two AGs per mirror pair, 24 total AGs
on 12 mirrors.  The first two maildir directories will be created in AGs
1 and 2 on the first mirror.  The second two in AGs 3 & 4 on the 2nd
mirror pair, and so on.  The 25th/26th directories will 'wrap' back to
AGs 1 & 2 and the directory creation pattern will continue.

Because of its allocation group design XFS is the only filesystem that
can accomplish this level of parallelism with a concatenated array and
small email files.  All others must rely on striped arrays, either
RAID10 or 5/6.  These come with the inefficiencies of writing/reading
files as small as 2KB on a stripe ranging from 256KB-1MB or larger,
depending on the number of disks in the array and the chosen stripe
size.  If you have a high write load, the Linux allocator will pack
multiple files into a single stripe, but one rarely sees 100% efficiency
here.  Even at 100% on writes, at low read rates, you end up reading a
lot of full 256KB-1MB stripes just to get a 2KB file, wasting bandwidth
and filling up the buffer cache with unneeded data, not to mention any
read cache on your hardware RAID controller or SAN head.

The only potential downside to this setup is the rare situation where
your current logged in users all have their mailbox in the same AG or
two AGs on the same spindle.  I've yet to see this happen, though it is
a theoretical possibility, though the probability is extremely low.

-- 
Stan