[Dovecot] resilience suggestion

Bart Smaalders barts at smaalders.net
Fri Feb 9 16:42:50 UTC 2007


David Lee wrote:
> On the whole we are pleased with our trials of dovecot to replace UW-IMAP.
> 
> But (ah!) we have hit one particular problem, in which we think dovecot
> could probably benefit from a resilience improvement.
> 
> We're running dovecot on Fedora Core 5 (FC5), with passwd map details
> supplied by NIS.  We have found that "nscd" sometimes thinks that a
> username is invalid, even though it is valid.  So when "deliver" attempts
> a delivery to the INBOX of that username, it receives "user unknown" from
> the name service, and then does a 5xx permanent failure of valid email.
>>From the user perspective "The System" has incorrectly rejected perfectly
> valid incoming email.  It is rare, but it does occasionally happen on
> large, busy systems.
> 
> Clearly it is fundamentally an "nscd" bug.  But that bug is nevertheless
> out there, in the wild, on such systems, potentially affecting dovecot's
> delivery of valid user email.
> 
> We have had a source code hack since October (in "deliver.c", simply
> replacing a "return ret" occurence with "return EX_TEMPFAIL") and it has
> worked nicely (ported forward from rc8 towards rc22).  Mail re-queues and
> a later delivery attempt then succeeds.
> 
> So it would be both helpful, and good for resilience against such real
> OS/nscd bugs (and similar), if there were a configuration option in
> dovecot to allow a site admin to tell deliver to use a temporary, 4xx,
> failure instead (if the circumstances were appropriate for the site).
> 
> Could this be considered please, Timo?
> 


I wrote the nscd that's used on Solaris back in 1995.  If the Fedora
release's nscd is just bungling the lookup, no work-around is possible
and you need to disable at least the passwd cache in the nscd if that's 
possible.  On the other hand, are you sure this isn't a intermittent
NIS server issue?

The problem about what a program should do if the name service isn't 
actually responding on the other hand, is tricky, whether that program 
is the nscd or postfix or dovecot.  The right answer depends on the 
consequences of failure and what info you can get back from the name 
service.

Obviously, if getpwnam_r() could be convinced to return EAGAIN if one of
the name services was not responding, this would be a GOOD THING, since
this would map directly to a TEMPFAIL.  However, there are other system
services that fail miserably when the user's account into isn't
available, so for those hanging until the NIS server recovers is a
better choice.

[The hard thing about distributed systems is always failure semantics.]

Absent tunable nscd failure semantics, I suggest that the following
may be useful alternatives for intermittent NIS server problems:


1) construct a redundant NIS architecture with additional slave NIS
    servers that fail over... this is what we use internally at Sun
    w/ varying degrees of success.

2) ypcat the passwd map periodically and map it into a local passwd
    file.  Some scripts smarts are required to avoid hideous problems
    if you get a truncated passwd map... this is quite robust if done
    correctly.  I'm one of the odd folks who has
    their mail delivered to their desktop; I keep a copy of my passwd
    entry in the local machine so I don't lose mail if the NIS server
    craps out again.

- Bart


More information about the dovecot mailing list