[Dovecot] Spam filtering (was: Re: Sieve mails with decoded subject)

Fri Dec 11 08:06:29 EET 2009

On 12/10/2009 2:28 PM, Johannes Bauer wrote:
> Eduardo M KALINOWSKI schrieb:
>> On Qui, 10 Dez 2009, Johannes Bauer wrote:
>>> I'm thinking about filtering all such encoded subjects (as there's no
>>> reason to encode them US-ASCII), but suppose it were UTF-8 or something:
>>> how can I filter on the actual content, not the encoded subject? Surely
>>> someone has solved that problem already?
>>
>> Yes, such as the guys behind SpamAssassin, or dspam, or any of the many
>> spam filtering programs that exist. Actually, they make much more
>> complicated decisions instead of only looking for bad words in the
>> subject field. I'd suggest you try installing one of them.
>
> I had SpamAssassin running once and was pretty disappointed. All those
> complicated rules and scoring and "smart" bayesian filtering did not
> work very well, although I taught it in around 50k mails right from
> wrong. I had both lots of false-positives and lots of false-negatives,
> which was kind of annoying.
>
> However, analyzing 274 spam mails I deleted in the last 5 months I can
> conclude that by using that extremely simple filter list I'd catch 258
> of them (that's 94%). So I'd like to stick to KISS in this case.

 From what I've seen, SA has been extremely good and accurate for us. 
We use amavisd-new to interface, but SA is at the end of a long chain of 
checks.

Between the (3) HELO checks, clamav-milter, and a SPF policy daemon, 
we're killing ~60% of all connections at SMTP time.  (I analyzed that in 
November, instead of 65/day hitting my inbox I would've seen 6x that 
amount if it wasn't for those checks.  So ~80% of all spam was getting 
blocked at SMTP time.)  If we were to pay for the Spamhaus Zen list, we 
could probably boost that percentage to 90%.

All of the domains we do business with get a -2 or -4 score using 
amavisd-new.  Specific addresses get a larger negative score.  I ran a 
few thousand spam & ham messages at the SA bayes filter, then turned it 
on.  We tag messages with a [spam] flag at 5.0 and quarantine at 9.0. 
Tagged messages go to the user's Inbox, quarantined messages get sieve'd 
into a sub-folder in the user's mailbox.

So far (in a month), no false positives.  Or at least none that people 
have complained were quarantined when they should not have been.  I'm 
considering lowering the quarantine threshold next month.

It's been nice to have my Inbox back, without 65 spams/day cluttering it 
up.  Now I might see 2-5 per day that slip through without getting 
tagged as borderline spam (at 5.0 or higher).  Those are mostly zero-day 
spam that haven't made it to the URIBLs or DNSBLs yet.

I'm still debating grey-listing, Razor, DCC or paying for the Spamhaus 
Zen list.

Compared to another, commercial, product that we were using a few years 
ago, SA is very very good.  Not perfect, but really does a good job of 
classifying things with decent accuracy.