[Dovecot] some errors on 1.1.3

Karsten Künne kuenne at rentec.com
Mon Oct 6 19:12:51 EEST 2008


On Sunday 05 October 2008 07:16:23 Timo Sirainen imposed structure on a
stream of electrons, yielding:
> On Wed, 2008-10-01 at 21:15 -0400, Karsten Künne wrote:
> > we recently upgraded dovecot from 1.0.5 to 1.1.3+latest patches. After
> > the upgrade I saw some errors in the logs and I'm not sure whether
> > they're a cause for concern. Our setup is 2 servers running openSUSE 11.0
> > and a NetApp Filer as storage backend. Following is the config:
>
> So it's possible that two users are accessing the same mailbox from two
> different servers at the same time?
>

It's not impossible but it should be very rare. Users connect to a generic 
server name which resolves to two different IP numbers via round-robin DNS. 
It's not perfect but it has worked well for us in the past.

> It was working the same way with v1.0 too? Without errors? That sounds
> strange, since that kind of a setup should have been much more broken
> with v1.0.
>

Well, I didn't really look for errors with 1.0 if no user complained so I 
can't say for sure that there were no errors. At least, no user 
complained :-) , and that's the most important. I haven't gotten any real 
complaints with 1.1 either but I'm looking a bit more now because of the new 
version.

> > mail_location:
> > maildir:~/.Maildir:CONTROL=/var/mailctl/%n:INDEX=/var/mailctl/%n
>
> I guess /var/mailctl is also on netapp?
>

Yes, but it has no quotas set.

> > mail_nfs_storage: yes
> > mail_nfs_index: yes
>
> These should have prevented these errors, but I guess they don't fully
> work everywhere then..
>
> > Oct  1 07:39:27 host1 dovecot: IMAP(user1): Invalid transaction log size
> > (83856 vs 83160): /var/mailctl/corey/.INBOX/dovecot.index.log
>
> ..
>
> None of these should ever happen.
>

After further examination the sequence of errors seems to be like this:

First:

Oct  5 23:10:14 host1 deliver(usera): Corrupted transaction log 
file /var/mailctl/usera/.INBOX/dovecot.index.log: Unexpected garbage at EOF

Second:

Oct  5 23:10:14 host1 deliver(usera): Fixed index 
file /var/mailctl/usera/.INBOX/dovecot.index: log_file_seq 222 -> 223

Third:

Oct  5 23:28:07 host1 deliver(usera): Transaction log 
file /var/mailctl/usera/.INBOX/dovecot.index.log: marked corrupted

Fourth:

Oct  5 23:28:07 host1 dovecot: IMAP(usera): Invalid transaction log size (608 
vs 240): /var/mailctl/usera/.INBOX/dovecot.index.log
Oct  5 23:28:07 host1 dovecot: IMAP(usera): Invalid transaction log size (608 
vs 240): /var/mailctl/usera/.INBOX/dovecot.index.log
....

and so on.

I observed this pattern at least 3 times. It IS possible that the index files 
were already corrupted by 1.0.5. Hard to tell now.

> > Oct  1 09:14:55 host1 dovecot: IMAP(user2): Broken
> > file /var/mailctl/user2/.INBOX/dovecot-uidlist line 15299: UIDs not
> > ordered (206464 > 206464)
> >
> > Oct  1 09:16:43 host1 dovecot: IMAP(user3): Broken
> > file /var/mailctl/user3/.Trash/dovecot-uidlist line 22: UID larger than
> > next_uid (153866 >= 153866)
>
> These especially could cause visible problems to clients.
>

That's not good news. It probably doesn't matter for the Trash folder but the 
INBOX is definitely more critical. So far, I haven't heard any complaints.

After we upgraded our second pair of mailservers I believe I found an 
explanation for at least some of these errors. We did all of the upgrades by 
upgrading just one of the servers and then the second one the following day. 
So, for one day we had a mixture of 1.1.3 and 1.0.5 running, accessing the 
same mailstore and the same control and index files. I guess that wasn't such 
a great setup and could have lead to some corruption.

> > Oct  1 07:29:05 host1 dovecot: IMAP(user4): nfs_flush_fcntl:
> > fcntl(/var/mailctl/user4/.INBOX/dovecot.index.cache, F_RDLCK) failed:
> > Interrupted system call
>
> Do you get a lot of these? This means it was trying to flush NFS read
> cache by read-locking the file, but it failed after waiting for 60
> seconds. All of these locks should be really short, so if it's timing
> out things aren't working very well.

Fortunately, these errors disappeared in the last 3 days. I'll keep an eye out 
for them and I'll also check the NetApp logs.


Karsten.
-- 
Paul's Law:
	In America, it's not how much an item costs, it's how much you
save.


More information about the dovecot mailing list