[Dovecot] error in 1.1.2
Ive been getting a LOT of these errors since 1.1.2: So far ive seen this with 1800 customers, and we're getting active complaints about errors in imap clients. When I check the users log I always see this error.. This always happens with COPY commands in Squirrelmail.
Sep 2 20:09:35 userimap4.xs4all.nl dovecot: IMAP(xxxxxxx): rename(/var/spool/mail/dovecot-control/indexes/j/jd/xxxxxxx/.Inveztor/dovecot.index.cache.lock, /var/spool/mail/dovecot-control/indexes/j/jd/xxxxxxx/.Inveztor/dovecot.index.cache) failed: No such file or directory Sep 2 20:09:35 userimap4.xs4all.nl dovecot: IMAP(xxxxxxx): rename(/var/spool/mail/dovecot-control/indexes/j/jd/xxxxxxx/.Inveztor/dovecot.index.cache.lock, /var/spool/mail/dovecot-control/indexes/j/jd/xxxxxxx/.Inveztor/dovecot.index.cache) failed: No such file or directory Sep 2 20:09:35 userimap4.xs4all.nl dovecot: IMAP(xxxxxxx): file_dotlock_replace() failed with index cache file /var/spool/mail/dovecot-control/indexes/j/jd/xxxxxxx/.Inveztor/dovecot.index.cache: No such file or directory Sep 2 20:09:35 userimap4.xs4all.nl dovecot: IMAP(xxxxxxx): file_dotlock_replace() failed with index cache file /var/spool/mail/dovecot-control/indexes/j/jd/xxxxxxx/.Inveztor/dovecot.index.cache: No such file or directory Sep 1 20:14:06 userimap4.xs4all.nl dovecot: IMAP(xxxxxxx): rename(/var/spool/mail/dovecot-control/indexes/j/jd/xxxxxxx/.Inveztor/dovecot.index.cache.lock, /var/spool/mail/dovecot-control/indexes/j/jd/xxxxxxx/.Inveztor/dovecot.index.cache) failed: No such file or directory Sep 1 20:14:06 userimap4.xs4all.nl dovecot: IMAP(xxxxxxx): rename(/var/spool/mail/dovecot-control/indexes/j/jd/xxxxxxx/.Inveztor/dovecot.index.cache.lock, /var/spool/mail/dovecot-control/indexes/j/jd/xxxxxxx/.Inveztor/dovecot.index.cache) failed: No such file or directory Sep 1 20:14:06 userimap4.xs4all.nl dovecot: IMAP(xxxxxxx): file_dotlock_replace() failed with index cache file /var/spool/mail/dovecot-control/indexes/j/jd/xxxxxxx/.Inveztor/dovecot.index.cache: No such file or directory Sep 1 20:14:06 userimap4.xs4all.nl dovecot: IMAP(xxxxxxx): file_dotlock_replace() failed with index cache file /var/spool/mail/dovecot-control/indexes/j/jd/xxxxxxx/.Inveztor/dovecot.index.cache: No such file or directory
FreeBSD 6.2, NFS storage for everything. Any idea?
Cor
On Mon, 2008-09-08 at 17:40 +0200, Cor Bosman wrote:
Ive been getting a LOT of these errors since 1.1.2:
What version did you use before?
*always*? Does the patch here help: http://dovecot.org/list/dovecot/2006-December/018145.html
But even then, errors in cache file handling are ignored and they shouldn't be visible to clients. What error exactly do the clients see? Does the COPY command fail with NO? Does anything else get logged besides these rename/file_dotlock_replace lines?
Ive been getting a LOT of these errors since 1.1.2:
What version did you use before?
I used 1.1-rc4 -> 1.1.1 -> 1.1.2
I phrased it wrong. All complaints are exactly the same, but it of course doesnt happen with every COPY, else our customers would be unable to use imap :)
They all paste some squirrelmail error with a dutch message basically saying: connection dropped by imap server: Query: COPY 983,987 "Trash" It's always a COPY command they're pasting.
They often say it took a while to get the error, and some people are suggesting it's only with large emails. So it could be a PHP timeout related bug, although im not positive.
I dont think it's a coincidence that every single person that complaints has that set of errors in the dovecot error log. Over the past week ive found 1800 seperate users, and thats still only a fraction of our total user base.
I dont know exactly what the clients see. It looks like php loses the connection to the imap server. Ive been trying to find a customer that can repeat this consistently so I can turn rawlog on. I dont see any dovecot crashes though.
I'll add that patch to 1 or 2 servers and see if the errors disappear on those servers.
Cor
On Sep 8, 2008, at 11:28 PM, Cor Bosman wrote:
If copying a single mail takes longer than the dotlocking timeout,
another process may have overridden the lock file and caused errors.
And since it took so long, maybe PHP or something timed out.
This should help figuring out if the problem is due to timeouts: http://hg.dovecot.org/dovecot-1.1/rev/241097889792
FYI, im getting the same problem on local FS if i use dotlocking. Ive switched one server back to local FS indexes and im still getting this error. So far it looks like the error disappears if I switch to fcntl (and mmap/fsync back to normal)
Sep 9 10:54:17 userimap1 dovecot: IMAP(xxxxxx): rename(/usr/local/var/dovecot-index/m/ms/xxxxxx/.INBOX/dovecot.index.cache.lock, /usr/local/var/dovecot-index/m/ms/xxxxxx/.INBOX/dovecot.index.cache) failed: No such file or directory Sep 9 10:54:17 userimap1 dovecot: IMAP(xxxxxx): file_dotlock_replace() failed with index cache file /usr/local/var/dovecot-index/m/ms/xxxxxx/.INBOX/dovecot.index.cache: No such file or directory
It's still possible that 2 processes hit the same lockfile, but at least it's not possibe that 2 different machines do.
I'll do the following, unless you say different.
- keep one server on local indexes with fcntl
- set one server up with the previous patch you sent me
- set one server up with timeout patch
- upgrade one server to 1.1.3.
Where is this dotlocking timeout configured anyways?
Cor
Ive got this patch running on a server, but im not seeing any additional log entries. Here's one that just came by on the server that I added your last patch to.
Sep 9 15:39:07 userimap3.xs4all.nl dovecot: IMAP(xxxxxx): rename(/var/spool/mail/dovecot-control/indexes/a/an/xxxxxx/.Vergadering Plenair/dovecot.index.cache.lock, /var/spool/mail/dovecot-control/indexes/a/an/xxxxxx/.Vergadering Plenair/dovecot.index.cache) failed: No such file or directory Sep 9 15:39:07 userimap3.xs4all.nl dovecot: IMAP(xxxxxx): file_dotlock_replace() failed with index cache file /var/spool/mail/dovecot-control/indexes/a/an/xxxxxx/.Vergadering Plenair/dovecot.index.cache: No such file or directory
I see no difference on 1.1.3, i still get errors. And the patch you sent me in your first reply also doesnt change things.
Only thing that seems to work is just moving away from dotlock/nfs, but im not sure that fixes the actual error people are reporting. I just dont see the error log entry anymore (obviously).
Cor
Oops, I do see a logline, just not in the error log. Weird.
Sep 9 16:42:54 userimap3.xs4all.nl dovecot: IMAP(xxxxxxx): rename(/var/spool/mail/dovecot-control/indexes/d/dr/xxxxxxx/.Trash/dovecot.index.cache.lock, /var/spool/mail/dovecot-control/indexes/d/dr/xxxxxxx/.Trash/dovecot.index.cache) failed: No such file or directory Sep 9 16:42:54 userimap3.xs4all.nl dovecot: IMAP(xxxxxxx): Our dotlock file /var/spool/mail/dovecot-control/indexes/d/dr/xxxxxxx/.Trash/dovecot.index.cache.lock was deleted (kept it 61 secs) Sep 9 16:42:54 userimap3.xs4all.nl dovecot: IMAP(xxxxxxx): file_dotlock_replace() failed with index cache file /var/spool/mail/dovecot-control/indexes/d/dr/xxxxxxx/.Trash/dovecot.index.cache: No such file or directory
This is from the full log file...
Cor
On Tue, 2008-09-09 at 17:16 +0200, Cor Bosman wrote:
60 seconds is also when Dovecot decides the dotlock file is stale. So I guess the cache file compression is taking longer than that. Hmm. I hadn't thought about that before, since it was supposed to happen rarely enough. But I guess during the compression other processes shouldn't be stuck waiting for it. I'll have to think about something - probably make the lock timeout only a couple of seconds and after that just fail to update it.
But the cache compression really shouldn't take that long unless you're really really running out of disk I/O. I wonder if there's some problems with the locking / NFS caching. You do have mail_nfs_*=yes, right?
grep nfs dovecot.conf
mail_nfs_storage = yes mail_nfs_index = yes
What exactly is being compressed? We have 500MB mailspace, and plenty of users have that kind of space actually in use.
I dont think the storage servers are slow in any way. Else we'd be seeing this much more. As i said, so far it's less than 1% of our users that I can find with that cache error, but most of those do say they have a lot of large files. So maybe it's something with lots of emails, and the need to update a large portion of the cache? It probably needs to go over the wire twice right? Once to read, once to write?
One thing is, we dont use deliver, so whenever dovecot hits the user's email it most likely will have to re-index all new email.
Cor
On Tue, 2008-09-09 at 17:31 +0200, Cor Bosman wrote:
Did a couple of changes:
http://hg.dovecot.org/dovecot-1.1/rev/898e3810c014 http://hg.dovecot.org/dovecot-1.1/rev/e3c5acf92b53
Basically deleted space is removed from dovecot.index.cache by recreating the file and leaving out the deleted parts.
You could check how large the dovecot.index.cache file is for those users. Normally it's something like 10-20% of the mailbox size, but it really depends on the emails.
One thing is, we dont use deliver, so whenever dovecot hits the user's email it most likely will have to re-index all new email.
There's no "reindexing". It just sees a new mail and adds it to the index. Not a problem in any way.
Oh, now im getting assert failed with those 2 patches applied..
Sep 9 18:33:59 userimap3 dovecot: Panic: IMAP(xxxxx): file mail-cache.c: line 572 (mail_cache_lock): assertion failed: ((ret <= 0 && !cache->locked) || (ret > 0 && cache->locked))
Cor
- Cor Bosman <cor@xs4all.nl>:
Oh, now im getting assert failed with those 2 patches applied..
Sep 9 18:33:59 userimap3 dovecot: Panic: IMAP(xxxxx): file mail-cache.c: line 572 (mail_cache_lock): assertion failed: ((ret <= 0 && !cache->locked) || (ret > 0 && cache->locked))
Same here.
-- Ralf Hildebrandt (Ralf.Hildebrandt@charite.de) snickebo@charite.de Postfix - Einrichtung, Betrieb und Wartung Tel. +49 (0)30-450 570-155 http://www.arschkrebs.de I'm looking for a job PowerPoint & Windows. Office and NT - I might like'em more after my lobotomy
After applying the recent patches:
- root <root@postamt.charite.de>:
- root <root@postamt.charite.de>:
- root <root@postamt.charite.de>:
- root <root@postamt.charite.de>:
- Cor Bosman <cor@xs4all.nl>:
-- Ralf Hildebrandt (i.A. des IT-Zentrums) Ralf.Hildebrandt@charite.de Charite - Universitätsmedizin Berlin Tel. +49 (0)30-450 570-155 Gemeinsame Einrichtung von FU- und HU-Berlin Fax. +49 (0)30-450 570-962 IT-Zentrum Standort CBF I'm looking for a job!
On Sep 9, 2008, at 7:35 PM, Cor Bosman wrote:
Sorry, this helps: http://hg.dovecot.org/dovecot-1.1/rev/5835adeb6359
Now im seeing these:
Sep 9 19:28:25 userimap3 dovecot: IMAP(xxxxx): file_dotlock_create() failed with index cache file /var/spool/mail/dovecot-control/indexes/u/ul/xxxxx/.Collega Coaching/dovecot.index.cache: Resource temporarily unavailable
Is that any better?
Cor
On Tue, 2008-09-09 at 20:20 +0200, Cor Bosman wrote:
Yes, although the error message could be changed to "locking timed out". But at least now the error shouldn't be visible to clients (other than small slowdowns due to the 2 second lock wait).
Anyway, the real problem is one of:
a) Dovecot is really locking dovecot.index.cache file for a long time for some reason and other processes are timing out because of it.
b) Some process is crashing and leaving stale dovecot.index.cache.lock files lying around. But that'd have to be a .lock from another server, because on the same server Dovecot checks to see if the PID exists and if not it'll just override the lock immediately.
c) NFS caching problems: the .lock file was deleted by server1 but server2 didn't see that, so it keeps assuming that the file exists long after it was really gone.
On Tue, 2008-09-09 at 21:26 +0300, Timo Sirainen wrote:
Oh, right, local fs has the same problem, but not with fcntl. That's kind of weird. Then pretty much none of my ideas are valid. It rather sounds like the lock_method=dotlock handling code would be broken. But I just tried and can't reproduce this problem..
Can you reproduce it with imaptest (using a test account so it won't break your mails :)? http://imapwiki.org/ImapTest
Almost all cache files are very small. There is no reason this should take a long time. Unless there's something weird in the cache building code that keeps it in a never ending state.
That could be more likely. We have 30 servers operating on this spool, so if some of them have crashing processes that keep a .lock on a different server, that may cause issues right? Could even be from some old dovecot version? I checked last weeks logs, and i had almost no crashes. About 100 'killed with signal' log lines, out of a few zillion log entries.
im doing a find now on dovecot.index.cache.lock files on our nfs indexes dir.
But what about this... im also seeing the same problem if I keep nfs=yes and dotlock on a local filesystem instead of NFS. That should exclude any multiple-nfs server issues right? Or will doing nfs=yes on a local FS give weird results?
I should just move everything to Linux..
Cor
On Tue, 2008-09-09 at 20:39 +0200, Cor Bosman wrote:
The old code was overriding the .cache.lock files if their mtime was older than 1 minute. The new code is overriding them after 5 minutes. So unless the processes keep crasing all the time it shouldn't be the problem.
Should work fine.
I should just move everything to Linux..
I just tested on my kvm FreeBSD 7.0 installation. I can't reproduce it there either with "imaptest logout=0".
But yes, Linux's NFS client implementation works better with Dovecot than FreeBSD's, since FreeBSD's NFS caches can't be flushed reliably.
I just tested on my kvm FreeBSD 7.0 installation. I can't reproduce it there either with "imaptest logout=0".
Neither can i..
As a test I set 1 server up with local FS again, but with NFS=yes and mmap/fsync etc as if it's nfs index, and im getting the same errors.
So it doesnt seem related to our NSF server.
Cor
No, i cant reproduce it at all with imaptest, thats why i said 'neither can i'. I get enough of em from customers that i can wait for it though. But i can never predict who gets one. Ive been trying to find a customer that can reproduce it, but so far no luck. Our helpdesk has called some people and they say it usually works fine again if they try again, but then sometimes a day later they get the same problem on a different mail. It usually happens when they delete a file and it gets copied to the trash.
Any specific commands I should try with imaptest?
Still need dovecot -n?
Cor
On Wed, 2008-09-10 at 09:42 +0200, Cor Bosman wrote:
Yes, but the next sentence said you got the errors with a test server. :)
Since it's copying related, try running two instances of imaptest where one copies messages and another runs on the destination box:
./imaptest logout=0 copy=100 copybox=Trash ./imaptest box=Trash append=0
Still need dovecot -n?
I guess it could still show something useful. :)
./imaptest logout=0 copy=100 copybox=Trash ./imaptest box=Trash append=0
Ok, running those 2 now.
Still need dovecot -n?
I guess it could still show something useful. :)
1.1.2: /usr/local/etc/dovecot.conf
base_dir: /var/run/dovecot/ ssl_ca_file: /etc/ssl/certs/verisign.pem ssl_cert_file: /etc/ssl/certs/userimap.pem ssl_key_file: /etc/ssl/private/userimap.pem disable_plaintext_auth: no login_dir: /var/run/dovecot//login login_executable: /usr/local/libexec/dovecot/imap-login login_greeting: XS4ALL User-IMAP ready. login_process_size: 512 login_processes_count: 32 login_max_processes_count: 1024 max_mail_processes: 1024 mail_max_userip_connections: 32 first_valid_uid: 20 mail_location: maildir:%Nu:INBOX=%Nu:INDEX=/var/spool/mail/dovecot-control/indexes/%1u/%2u/%u:CONTROL=/var/spool/mail/dovecot-control/%1u/%2u/%u/INBOX mailbox_idle_check_interval: 60 mmap_disable: yes mail_nfs_storage: yes mail_nfs_index: yes lock_method: dotlock maildir_stat_dirs: yes mbox_read_locks: dotlock_try mbox_write_locks: dotlock_try mail_executable: /usr/local/libexec/dovecot/rawlog /usr/local/libexec/dovecot/imap mail_process_size: 512 mail_plugins: imap_quota quota imap_capability: IMAP4rev1 SORT THREAD=REFERENCES MULTIAPPEND UNSELECT LITERAL+ IDLE CHILDREN NAMESPACE LOGIN-REFERRALS QUOTA imap_client_workarounds: outlook-idle delay-newmail namespace: type: private separator: / location: maildir:%Nu:INDEX=/var/spool/mail/dovecot-control/indexes/%1u/%2u/%u:INBOX=%Nu:CONTROL=/var/spool/mail/dovecot-control/%1u/%2u/%u/INBOX inbox: yes list: yes subscriptions: yes namespace: type: private separator: / prefix: home/ location: mbox:~/mail/:INDEX=/var/spool/mail/dovecot-control/indexes/%1u/%2u/%u hidden: yes subscriptions: yes auth default: cache_size: 1024 passdb: driver: passwd-file args: /usr/local/etc/dovecot.masterusers passdb: driver: pam args: cache_key=%u%r dovecot userdb: driver: passwd plugin: quota: fs
It took about 1.5 hours, but I got 1..running the above 2 tests at the same time.
Sep 10 22:06:41 imap dovecot: IMAP(scorpio): rename(/var/spool/mail/dovecot-control/indexes/s/sc/scorpio/.Trash/dovecot.index.cache.lock, /var/spool/mail/dovecot-control/indexes/s/sc/scorpio/.Trash/dovecot.index.cache) failed: No such file or directory Sep 10 22:06:41 imap dovecot: IMAP(scorpio): file_dotlock_replace() failed with index cache file /var/spool/mail/dovecot-control/indexes/s/sc/scorpio/.Trash/dovecot.index.cache: No such file or directory
Thats on an unpatched server.. default 1.1.2 code. It seems to be a very rare occurance.
Cor
On Wed, 2008-09-10 at 22:49 +0200, Cor Bosman wrote:
Looks like I managed to reproduce it in about 5 minutes with the original cache timeouts. imaptest reported that several commands were hanging for over 60 seconds then. I'll try to figure out if this can be prevented.
On Thu, 2008-09-11 at 11:11 +0300, Timo Sirainen wrote:
This should fix it: http://hg.dovecot.org/dovecot-1.1/rev/7772018e917c
I'll still do some small tweaks for the cache locking code.
I've been following this thread: the bug is related with copying mail across NFS in combination with cache locking, right? Cor uses FBSD; but is this a bug that might impact other platforms as well?
Cheers,
Jan
----- Original Message ----- From: "Cor Bosman" <cor@xs4all.nl> To: "Dovecot Mailing List" <dovecot@dovecot.org> Cc: "Cor Bosman" <cor@xs4all.nl> Sent: Thursday, September 11, 2008 5:48 PM Subject: Re: [Dovecot] error in 1.1.2
On Sep 11, 2008, at 9:00 PM, Jan van den Berg wrote:
The main bug here was that with lock_method=dotlock when cache file
was being compressed, it could have tried to lock the cache file
again, seeing that the .lock file already exists (created by itself)
and waiting for it until timeout and then overriding it and causing
the compression to fail. The OS or filesystem didn't matter.
Ive been able to grab the dovecot.rawlog output of a few people with this problem.. basically this is what I see with all of them:
Sep 10 13:43:13 userimap7.xs4all.nl dovecot: IMAP(xxxxxx): rename(/var/spool/mail/dovecot-control/indexes/c/co/xxxxxx/.Junk E-mail/dovecot.index.cache.lock, /var/spool/mail/dovecot-control/indexes/c/co/xxxxxx/.Junk E-mail/dovecot.index.cache) failed: No such file or directory Sep 10 13:43:13 userimap7.xs4all.nl dovecot: IMAP(xxxxxx): file_dotlock_replace() failed with index cache file /var/spool/mail/dovecot-control/indexes/c/co/xxxxxx/.Junk E-mail/dovecot.index.cache: No such file or directory
da56 UID COPY 11078 "Junk E-mail"
And I see no result in the .out file, so the client waits for a result, doesnt get it, and times out.
I can also absolutely verify that if I take the NFS server out of the equation completely I still get this problem.
Cor
On 9/10/2008 8:18 AM, Cor Bosman wrote:
da56 UID COPY 11078 "Junk E-mail"
So this is Outlook and its Junk Email Filters auto-copying these messages...
Nah, thats coincidental, the target folder doesnt seem to matter.
Ok, but what about the client - is this outlook specific?
--
Best regards,
Charles
participants (6)
-
Charles Marcus
-
Cor Bosman
-
Jan van den Berg
-
Ralf Hildebrandt
-
Scott Silva
-
Timo Sirainen