The conversation was short, and unsuccessful.
"It's really a simple request; each of the employees needs an additional paid half week each year.", I pointed out.
"For what?", my boss asked. "Vacation? Community Service? An office retreat?"
I shuffled uncomfortably. "Well, no. Filtering spam out of their email."
It sounds ridiculous, right? It isn't. Handling spam is a major expense for corporations, even if it doesn't show up as a line item in the budget.
I get about 120 spams/day (would anyone care to bet that number will drop? Didn't think so. :-) ) That's 43,800 spams a year. Let's assume I do nothing but identify and discard the spam, and can do so in 3 seconds per message - pretty fast, but I'm trying to come up with a conservative estimate - I'm spending 36.5 hours a year doing nothing but filtering spam.
*sigh*
I consider it possible that since my email address is visible on a lot of web sites and articles that I get more spam than most; lets assume I get twice as much as the average email address holder. That means that the average email recipient is spending a half week a year filtering spam. If that's done on work hours, that means the equivalent of 1 salary for every hundred employees is wasted on filtering spam.
This is why my blood pressure goes up when I see a spammer in an interview saying, "What's the problem? If you don't like the message, just delete it."
In addition to the time spent handling it, we also have to account for the time spent reporting illegal spams, the steady interruptions, and the time and equipment spent on filtering it out. I'll try to minimize this latter cost by leading you through setting up a very powerful spam filter - Spamassassin - with the worlds most common mail server - Sendmail.
Your current mail setup probably acts something like this. Incoming mail arrives at port 25 of your mail server, where Sendmail is listening for incoming connections. Sendmail verifies that the mail is destined for someone with a local account. Sendmail hands the mail off to the procmail program. If you don't have a .procmailrc in your home directory on the mail server, the message is placed directly in /var/spool/mail/{your_user_id}. If you do have a .procmailrc, procmail consults the rules (such as "place all mail from [email protected] in the family folder") in that file to decide what to do with the message.
We're going to add a few rules to .procmailrc that first run a program that scores this message on how likely it's spam, and then based on this probability, filter it into a almost-certainly-spam or probably-spam folder. Mail that doesn't hit either probability cutoff gets sent through to the original email folder.
Why Spamassassin? I've used 6 different spam filtering approaches over the past 6 years (junkfilter, spambouncer, a home-grown filter, razor1, razor2, and spamassassin). All of them helped, to some degree, but none of the others have as comprehensive a list of tests as Spamassassin. Here are the tests Spamassassin can use:
Note that none of the above criteria, by themselves, is enough to say a message is definitely spam; I can come up with examples for any of the above where a given rule will misfire and incorrectly increase or decrease the spam score. However, when taken together, the collection is marvelously strong and accurate at identifying spam and ham.
There are a number of steps to take, but many of them only need to be done once by a mail server administrator.
First off, make sure that your mail server is working correctly, accepting and delivering mail. I'm assuming you're using Sendmail on an rpm-based distribution. This latter is not a problem if you're not; for debian users the install may be as simple as "apt-get {packagename}", and other non-rpm distribution users are probably comfortable installing these programs from source.
Instructions and hyperlinks for a number of distributions are at the download page. I'm going off the RPM approach, so I pull down the perl-Mail-Spamassassin, spamassassin, and spamassassin-tools i386 rpms from Theo Van Dinter's site.
Most of the commands in this section should be performed by the root user.
cd ~ mkdir spamassassin cd spamassassin wget http://spamassassin.kluge.net/perl-Mail-SpamAssassin-2.51-2.i386.rpm wget http://spamassassin.kluge.net/spamassassin-2.51-2.i386.rpm wget http://spamassassin.kluge.net/spamassassin-tools-2.51-2.i386.rpm wget ftp://ftp.kluge.net/pub/felicity/RPMS/perl-Net-DNS-0.33-0tvd.noarch.rpm
Before we can install these, we need to get some perl modules. Many of these will be right on your vendor's CD; the remainder should be at Theo's supplementary RPM site.
#For Redhat 7.2 rsync -av zaphod.stearns.org::redhatmirror/pub/redhat/linux/7.2/en/os/i386/RedHat/RPMS/perl-HTML-Parser-3.25-2.i386.rpm . rsync -av zaphod.stearns.org::redhatmirror/pub/redhat/linux/7.2/en/os/i386/RedHat/RPMS/perl-HTML-Tagset-3.03-3.i386.rpm . #For Redhat 7.3 rsync -av zaphod.stearns.org::redhatmirror/pub/redhat/linux/7.3/en/os/i386/RedHat/RPMS/perl-HTML-Parser-3.26-2.i386.rpm . rsync -av zaphod.stearns.org::redhatmirror/pub/redhat/linux/7.3/en/os/i386/RedHat/RPMS/perl-HTML-Tagset-3.03-14.i386.rpm .
Now we install the Spamassassin RPMs:
rpm -Uvh perl-Mail-SpamAssassin-*.i386.rpm spamassassin-*.i386.rpm perl-HTML-Parser-*.i386.rpm perl-HTML-Tagset-*.i386.rpm perl-Net-DNS-*.noarch.rpm
On RedHat 7.2, you may need to add --nodeps if rpm complains of a missing perl(HTML::Parser); the perl-HTML-Parser obviously provides this resource but doesn't appear to correctly declare so.
To avoid the overhead of starting a fresh copy of perl each time a new mail message comes in, there's a background daemon called spamd that holds most of the spam scoring code. Let's start that up:
/etc/rc.d/init.d/spamassassin start
To check that it's running, try:
[root@slartibartfast spamassassin-kit]# netstat -anp | grep spamd tcp 0 0 127.0.0.1:783 0.0.0.0:* LISTEN 4753/spamd -d -c -a unix 2 [ ] DGRAM 5988777 4753/spamd -d -c -a
This says that spamd is running under PID 4753 - your PID will differ. It's listening on a Unix socket and TCP port 783 but only for connections coming from localhost.
Note that we don't have to do anything about making it start on next boot as the RPM has done that for us. If you're not using rpms, use whatever approach is appropriate for starting a given service in your default runlevel; you may need to run tools like ntsysv or chkconfig or may need to rename a file in /etc/rc3.d or /etc/rc5.d .
To demonstrate, I'll do all the following with a bogus user called "spamtest", which I'll add now as root.
adduser spamtest
The following steps will need to be taken for each person that would like their mail filtered; substitute the correct username everywhere you see spamtest. Everything from this point on is done as the user for whom we're filtering mail.
su - spamtest cd ~ mkdir .spamassassin cd .spamassassin cp -p /usr/share/spamassassin/user_prefs.template user_prefs cat <<EOF >>user_prefs rewrite_subject 1 report_header 1 use_terse_report 1 defang_mime 0 report_safe 0 use_razor2 0 use_bayes 1 auto_learn 1 ok_locales en EOF
All of the configuration options between cat and EOF will be added to user_prefs by the cat command. See "perldoc Mail::SpamAssassin::Conf" for more info on these settings.
Let's see if spamassassin's working before we go on:
spamc -R </usr/share/doc/spamassassin-*/sample-nonspam.txt -6.3/5.0 * -6.3 -- Contains a PGP-signed message spamc -R </usr/share/doc/spamassassin-*/sample-spam.txt 8.4/5.0 * 0.7 -- From: does not include a real name * 0.6 -- Invalid Date: header (not RFC 2822) * 1.4 -- Valid-looking To "undisclosed-recipients" * 1.5 -- BODY: Information on how to work at home (2) * 1.5 -- BODY: Drastically Reduced * 0.8 -- BODY: List removal information * 0.7 -- BODY: Once in a lifetime, apparently * 0.2 -- Date: is 12 to 24 hours before Received: date * 0.6 -- RBL: Received via a relay in relays.osirusoft.com [RBL check: found 142.249.10.63.relays.osirusoft.com., type: 127.0.0.3] * 0.4 -- Message-Id is not valid, according to RFC 2822
When the nonspam message is fed in to spamc, spamc hands the text off to spamd which calculates the actual spam score. Because it has no spam characteristics it has no plus points, but a -6.3 because it's a PGP signed message. The final score is a -6.3, far below the needed 5.0 to categorize it as spam.
The second message has a bunch of spam characteristics. None, by themselves, are enough to categorize it as spam, but together they give it a score of 8.4.
If you don't something like this output (format and exact score may vary, that's OK) for the two test messages, you should take the time to figure out why before going on.
Now that spamc is correctly identifying mail, lets set up the spamtest user to actually use it and filter mail.
cd ~ mkdir .procmail touch .procmail/proclog mkdir mail touch mail/mbox
We need to create a .procmailrc file in /home/spamtest . If the user doesn't already have one, here are some suggested starting points. If the user does have one, add the lines between "#Spamasssassin start" and "#Spamassassin end" from the appropriate example to their existing file.
If the user gets their mail from this machine via IMAP or POP:
SHELL=/bin/sh PATH=/bin:/usr/bin PMDIR=$HOME/.procmail LOGABSTRACT=all MAILDIR=$HOME/mail #you'd better make sure it exists LOGFILE=$PMDIR/proclog #recommended VERBOSE=off #Spamassassin start :0fw: spamassassin.lock | /usr/bin/spamc #Spamassassin end
In the above example, procmail sends the message through spamc for scoring and changing the headers like Subject if it is a spam, but the message is allowed to pass straight through to its original destination (usually /var/spool/mail/spamtest). This one folder is the IMAP INBOX.
The spam messages can now be moved from folder to folder inside the user's mailreader; this job is made easier by the modified Subject line and the X-Spam-Status and X-Spam-Level headers - read on for more detail.
If the user reads their mail right on this machine (say, with pine) use this .procmailrc instead:
SHELL=/bin/sh PATH=/bin:/usr/bin PMDIR=$HOME/.procmail LOGABSTRACT=all MAILDIR=$HOME/mail #you'd better make sure it exists LOGFILE=$PMDIR/proclog #recommended VERBOSE=off DEFAULT=$MAILDIR/mbox #Mailing list start #If you subscribe to any mailing lists, you might want to filter them off first: :0: * ^X-BeenThere: [email protected] dshield :0: * ^X-BeenThere: [email protected] uml-devel #Mailing list end #Spamassassin start :0fw: spamassassin.lock | /usr/bin/spamc :0: * ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\* spam10 :0: * ^X-Spam-Status: Yes spamassassin-spam #Spamassassin end
A little explanation is needed. The first block sets up environment variables for use in the rest of the file.
The "mailing list" block is optional. If you're subscribed to mailing lists and want to have them sent off to different folders automatically, find a line in the header that identifies the list. "X-BeenThere:" is most commonly used, but "Errors-To:", "List-Id:", "Mailing-List:", "Reply-To:", "Sender:", and "X-Loop:" are other good ones to look for.
The spamassassin block is where we finally get some useful spam filtering. The first rule (the ":0fw: spamassassin.lock" and "| /usr/bin/spamc" lines) feed the entire message off to spamc, which hands it off to spamd for spam score calculation. spamd also adds and/or modifies headers to indicate that the message is spam; in particular, it adds a "X-Spam-Status: Yes" header for any messages with a spam score 5.0 or above, and it also adds a "X-Spam-Level: *******" with the number of asterisks equal to the integer part of the score. In other words, a spam with a score of 8.3 would get a "X-Spam-Level: ********" header.
We use the X-Spam-Level header to in effect give us two cutoffs; 5.0 and 10.0. Those with a score 10.0 and higher get relegated to a different folder, one that we don't need to check as often, or depending in your needs, at all. These are the ones that are almost certainly spam with very little chance of false positives. With those out of the way, we can send all the remaining spams (with scores between 5.0 and 9.9) to the spamassassin-spam folder. This one should probably be checked from time to time as some messages may be incorrectly identified as spam.
As a side note, if the message doesn't match either test, we have no more procmail rules in this file to specify what to do with it. In that case, the message is sent to the mail folder specified in the DEFAULT variable; in this case, /home/spamtest/mail/mbox .
With this file in place, send a test message to [email protected]. The messages should show up in /home/spamtest/mail/mbox with minimal headers saying that the message score is less than 5 and should not include the "X-Spam-Status: Yes" or "X-Spam-Level: **..." headers.
Now bounce a spam message you've received to [email protected] . If the spam score is high it should show up in /home/spamtest/mail/spam10 . If the score is between 5 and 9.9, it should show up in /home/spamtest/mail/spamassassin-spam .
If this didn't work, go back and find out why. That's why we do this with a test user that won't get angry if mail is lost. :-) Some files that may help in the query will be /var/log/maillog and /home/spamtest/.procmail/proclog ; these may tell you where the mail was sent and possibly even why. With the three test messages I sent, proclog shows where they went:
From [email protected] Sun Mar 23 21:55:00 2003 Subject: quick check Folder: /home/spamtest/mail/mbox 1569 From [email protected] Sun Mar 23 21:56:02 2003 Subject: *****SPAM***** wow em this summer...start now Folder: spamassassin-spam 3299 From [email protected] Sun Mar 23 21:58:07 2003 Subject: *****SPAM***** Earn great money from home! DVCMXNI Folder: spam10 10863
While you're doing your investigation, you may wish to temporarily restore the user's original .procmailrc or simply rename this new one to something else so that more incoming mail is not misdirected.
At this point, I'm going to leave you to set up your users using this approach. Many of the checks are automatically working, including the Auto-Whitelist and Bayesian filtering, although both could be helped if you explicitly feed them known ham and spam. Razor2 is an excellent addition, and will probably show up in a future version of this article. The RBL checks should be working as well; some of your messages should include lines like:
* 0.6 -- RBL: Received via a relay in relays.osirusoft.com [RBL check: found 142.249.10.63.relays.osirusoft.com., type: 127.0.0.3]
With spamassassin in place, your users' mail should now be automatically filtered into folders. They still get just as much, but it's not a constant interruption. By filtering into "almost certainly spam" and "probably spam" folders, you actually have cut down on the number of messages to which your users have to actively pay attention. I've spent quite a bit of time teaching my spamassassin installation about known spam and ham, so I feel comfortable making the second cutoff at 8.0 (spams with a score higher than this I won't look at at all, but they're still available if I later find out that something was misclassified). This means I never look at 78% of the incoming spam, changing my yearly spam week into a yearly spam day.
And I figure I can spend 2 of my 4 free days this year writing a spam filtering article for you. :-)
By adding lines such as:
blacklist_from [email protected] blacklist_from *@bonanzaoffers.com blacklist_from *@deal-seeker.com blacklist_from *@hispeedmediaoffers.com blacklist_from *@jumpjive.com blacklist_from *@*.ew01.com
to /etc/mail/spamassassin/local.cf or ~/.spamassassin/user_prefs , you tell spamassassin that mail from any of these domains gets a +100 spam score, effectively blocking them. I'm compiling a list of spammer domains in a format you can cut and paste right into the spamassassin config files. The list, and a script that can be run from cron to automatically pull down and install the latest version, can be found at http://www.stearns.org/sa-blacklist/ . The current version of the list is at http://www.stearns.org/sa-blacklist/sa-blacklist.current . This list works for me, but you may wish to at least briefly look it over to see if it works for you.
Many thanks to everyone that has contributed blacklist entries.
If you want to make any user configuration options (like the above lines) apply to all spamassassin users, place them in /etc/mail/spamassassin/local.cf .
Warning - currently untested, use at your own risk.
By the way, did I mention this hasn't been tested even once?
The obvious next question is, "Can I just get spamassassin to process everyone's mail without having to mess with everyone's .procmailrc?"
Obviously the answer's yes, or I wouldn't have asked the question. *grin*
Please make sure you have spamassassin working correctly for at least a test user before going for the big Kahuna. Screwing up for all your mail users tends to hurt your chances for long-term employment.
With the warnings out of the way, here's what we'll do. We're going to run spamc out of the system-wide procmail configuration file, /etc/procmailrc . Requests made in here are performed for all locally delivered messages.
Set up a conservative set of configuration choices in /etc/mail/spamassassin/local.cf . In particular, it might be a good idea to raise the default cutoff for spam to 8.0, at least initially. If everything works and you're not getting any false positives in a week, lower it to 7, and then 6.5 or 6 a week later. This gives the AWL and bayes databases a chance to learn a bit before they're really crucial.
required_hits 8 rewrite_subject 1 report_header 1 use_terse_report 1 defang_mime 0 report_safe 0 use_razor2 0 use_bayes 1 auto_learn 1 ok_locales en
Now tell procmail to run spamc on everyone's mail. Add these to /etc/procmailrc :
DROPPRIVS=yes :0fw | /usr/bin/spamc
Finally, remove the last two lines from everyone's individual /home/{user}.procmailrc files. The scoring and header changes were done when the message first passed through /etc/procmailrc; the custom requests like "filter all messages with a score of 10 or higher into this folder" still need to get done locally in /home/{user}/.procmailrc .
If you have a small number of users that specifically don't want their spam filtered, add a line like the following for each user or domain to /etc/mail/spamassassin/local.cf :all_spam_to [email protected] all_spam_to *@masochists.org
The above approach works fine for new mail, but what about mail that a user has already received?
If that mail is in an imap folder, you're in luck. Roger Binns has written IMAP Spam Begone to reprocess mail that is already in an IMAP folder.
There are a couple of assumptions:
./isbg.py --imaphost your-mailserver.com --imapinbox /path/to/your/inbox --spaminbox /path/to/your/spam/folder --delete --expunge
./isbg.py --imaphost mail.goober.com --imapinbox /var/spool/mail/joeblow --spaminbox /home/joeblow/mail/ilovespam --delete --expunge
It will prompt you for your imap password, then it will go through the inbox you specified (which may take awhile), and then report what it found -- something like:
4 spams found in 6 messages
The --delete command marks the messages for deletion from your inbox. The --expunge option used in conjunction with --delete will cause the marked messages to be actually removed from inbox (they will still be in ilovespam though).
isbg is smart, and will only look through messages it hasn't seen before (i.e., it won't go through your ENTIRE inbox each time -- only new messages will be scanned.) It does this via imap's use of unique message ids.
If you run isbg with the --savepw option the first time, it will remember your imap password (saved on disk in an obfuscated way) such that you can then make a cron job to run the script automatically.
Because the isbg script and spamassassin can run on a machine other than the mailserver, this approach can be used to filter mail on a remote IMAP mailserver that may not be able to run Spamassassin directly.
The Spamassassin team - and that includes anyone that has contributed to it - gets 2 thumbs up from me for an excellent tool! This is the first spam filtering tool I've used that I feel confident is doing an accurate job of identifying and filtering the spam onslaught.
Bill Stearns wrote the main text of the article. Marion Bates contributed the section on ISBG. Both Marion and Drew Como were kind enough to review an early draft of this article.
The spammers were kind enough to contribute the spam. .