Who Here Likes Spam?

Is it really necessary to start a post like this with “I hate spam”, or “I really hate spam”, or “I think all spammers should die horribly in a plane crash”? Is there anyone who relishes the latest home mortgage or PediPaws offer?

This site has been using SpamAssassin for quite a while now. It works great and allows for all the mail users on the system to filter their mail for spam with no effort from them. But it has been falling down on the job lately. No one else complains, they all get very few spam messages coming through. But I am getting around 20-30 per day now, up from 2 or 3 just a few months ago. I attribute this to all the wonderful vendors I’ve used over the years, purchasing or downloading things I want, who have given me the gift of unwanted mail by selling my e-mail address to, well, pretty much everyone (thanks for that).

How many spam filters fit on the head of a nuclear elephant?

So, I came across DSpam. You’ve gotta love any open source project hosted at a site named NuclearElephant.com. DSpam seems to hold a lot of promise. It takes a different approach from Spamassassin. It uses adaptive filtering, as opposed to Spamassassin’s cocktail approach. This means it has to see lots of what you consider to be spam as well as non-spam so it can learn to tell the difference. The problem with that approach, in my opinion, is that it is just as likely to get a false positive (something you want misclassified as spam) as a false negative (spam that makes it into your inbox). Spamassassin, as is it setup on marrin.com, pretty much never gets false positives, but lets a bit of spam through because of it. A little bit is ok, but 20-30 is above my threshold of pain, so I had to do something.

So I decided to keep Spamassassin and add DSpam as a second filter. This is easy with procmail. Here is my recipe:

DROPPRIVS=yes
:0fw: spamassassin.lock
* < 256000
| /usr/bin/spamc

:0
* ^X-Spam-Status: Yes
* ^X-Spam-Level: \*\*\*\*\*\*
{
:0
{ RULE=”SPAM” }
:0:
/dev/null # toss it
}

:0
* ^X-Spam-Status: Yes
{
:0
{ RULE=”SPAM” }
:0:
$HOME/mail/spam
}

:0fw
| /usr/local/bin/dspam –stdout –deliver=innocent,spam –mode=toe

:0
* ^X-DSPAM-Result: spam
$HOME/mail/dspam

This is basically 5 rules. The first simply runs spamassassin on each incoming message and marks it with headers which score the message. I have Spamassassin consider anything with a score greater than 4 to be spam, so they get marked as such. Next I look at any message which Spamassassin has marked as spam AND has a score greater than 6 and just toss it outright. Spamassassin added a line with one star for each point it scores, so if that line has 6 or more stars, it gets the axe. Then I place any message with a score between 4.0 and 5.9 into a spam folder for further review. I’ve never seen false positives here, but I can check it nonetheless.

Next come the DSpam rules. First, I pass the message through DSpam, which marks the message with headers similar to those that Spamassassin added. Finally, I take anything DSpam considers to be spam and place it into a dspam folder for review. 

Setting up DSpam

Setting up DSpam is pretty much of a pain in the ass. It assumes a much higher level of understanding that installing Spamassassin. It kind of assumes you will be installing it directly into the MTA (your mail server), which I am not about to do. Procmail works just fine. But the only tutorial offered for Procmail assumes you will be doing a per-user install. But I was able to cobble together enough information to figure out how to do it. The result is above. But first I needed to setup DSpam itself. That mostly just involves downloading and installing. I had to install the mysql developer tools and point the configure script at the /usr/local/lib64 folder (since my server is 64 bit), but it went pretty painlessly.

Next I had setup a global database so that when I trained DSpam, all my users would be able to use that training. I added a /var/dspam/group file with the following:

dspam:shared:user1,user2,user3

where user1 etc. were the actual users on my system. This apparently allows me to train the dspam pseudo-user and have all the real users use it. But DSpam is pretty inscrutible about what it’s doing, so I’m not sure this is working. Next I downloaded a corpus of good and bad mail messages and used dspam_train to train the dspam pseudo-user with all these messages.

Unleashing the DSpam

With everything setup I started looking at messages coming in. Lo and behold, DSpam was catching some spam. It has also caught a couple of good messages. I have a couple of mailboxes (one for false positives and one for false negatives) where I put the misclassified messages, and then a couple of cron jobs check these once and hour and retrains. For instance, here is the script to retrain a good message misclassified as spam:

/bin/cat /home/cmarrin/mail/dspam-LearnAsHam | /usr/bin/formail -s /usr/local/bin/dspam –user dspam –mode=teft –class=innocent –source=error

Once the mail has been reclassified, I toss it:

/bin/cat /dev/null > ~cmarrin/mail/dspam-LearnAsHam #Reclassify mail misclassified as spam

And that’s about it. I can supply more details if anyone is interested…