Bogofilter
SpamAssassin has been getting some false negatives, more than when I started using it. I typically get between one and five spams in my inbox per day, instead of getting them in my spam folder where SpamAssassin-detected stuff goes. So after hearing some people on an unnamed IRC network talk about Bogofilter successes, I decided to give it a whirl.
Bogofilter itself comes in an RPM, but it requires a library called
Judy which does not. Judy’s
build process seems really complex, and I question the need for that.
I’m in no position to question it, having hardly inspected it, but I
still question it. So there. Anyway, here is my RPM .spec file for
Judy. You should be
able to grab the Judy sources and rpmbuild -ba (remember
rpmbuild -ba not rpm -ba under RH 8.0) this to your heart’s
content. I also note that the spec file was done up quickly and
doesn’t do things it probably should, like patching their install
scripts rather than half-assed writing my own. There is a path
containing ia32 in the spec file, too, so it will only build under
x86 without some very minor changes. Though I question whether the
library is capable of being built under different architectures.
Anyway, with that out of the way, I decided to hook Bogofilter into my
.mailfilter file. Since it needs to be trained, and I’m not
sure I entirely trust it, I just set it up to add a header to the
message. I also set it up such that spams identified by SpamAssassin
are reported to Bogofilter as spam. Here’s the relevant portions of
my .mailfilter:
# This is my Bogofilter test code. Adds a header based on the result
# of Bogofilter. Note that this header is strictly informational right
# now; it is not used for any filtering.
trash=`bogofilter`
if ($RETURNCODE == 0)
{
BOGOFILTER="yes"
}
else
{
BOGOFILTER="no"
}
xfilter "formail -I "X-Bogofilter: $BOGOFILTER""
# Filter the message through SpamAssassin. If it's spam, hand it to
# bogofilter as spam (we can always take it back with "bogofilter -H")
# and then put it in the spam folder (a Maildir folder).
xfilter "spamc"
if (/^X-Spam-Status: Yes/:hD)
{
trash=`bogofilter -s`
to mail/spam
}
I also added some macros to my .muttrc, based on the ones
recommended with Bogofilter (manually wrapped to fit on the page):
macro index \cd "<enter-command>unset wait_key\n<pipe-entry> bogofilter -s\n<enter-command>set wait_key\n<delete-message>" macro index \cr "<enter-command>unset wait_key\n<pipe-entry> bogofilter -s\n<pipe-entry>spamassassin -r\n<enter-command> set wait_key\n<delete-message>" macro index \cf "<enter-command>unset wait_key\n<pipe-entry> bogofilter -h\n<enter-command>set wait_key\n" macro index \ef "<enter-command>unset wait_key\n<pipe-entry> spamassassin -d | bogofilter -H\n<enter-command>set wait_key\n"
These commands are all from the index, as you may have noticed. A brief summary, with keys in Emacs notation:
C-d: report the message to Bogofilter as spam and delete the message.C-r: report the message as spam to both Bogofilter and SpamAssassin (my install should report to Razor), and then delete the message.C-f: report the message to Bogofilter as ham (non-spam). This is used when training Bogofilter. Use it on any message you receive in your inbox that is non-spam.M-f: change the disposition of a message from spam to ham in Bogofilter. This is used when a valid message is mistakenly marked as spam by SpamAssassin, put in the spam folder, and (via my.mailfilter) automatically reported to Bogofilter as spam. This callsspamassassin -dto remove any “SpamAssassin markup” from the message before feeding it to Bogofilter. (Actually, this key is Escape-f; in Emacs that’d be equivalent to M-f though, AFAIK.)
.mailfilter. The message is first filtered through
spamassassin. If the message is spam, the message is then sent to
Bogofilter. However, in the process of executing the xfilter command
on the message, the SpamAssassin markup is now in the message. Not
good. I’ll try changing the rule to read trash=`spamassassin -d |
bogofilter -s` and hopefully it’ll interpret the | properly.
(Need to test this.)
So far back as I can remember, Bogofilter has correctly been identifying the spam that gets erroneously dropped into my inbox as spam. It hasn’t been identifying as spam every message that SpamAssassin does, though. It seemed to be having special trouble with Base64-encoded spam — of which there is surprisingly a lot. Apparently spammers, the wily bunch they are, have been encoding text messages in Base64 or quoted printables entirely. This is to avoid simple text filters. SpamAssassin decodes these for analysis (and gives them something like 3.9 points against them for doing it), but I’m not certain whether Bogofilter does or not. I think it does, based on some web searches, but its results don’t necessarily reflect this. In any case, my results are kind of fubar because of the flaw I found above. I’ll probably have to kill the entire database I’ve built and rebuild from scratch, or at least the spam messages.
BTW, there is information on using SpamAssassin and Bogofilter together. Haven’t tried it, but it looks pretty simple. I may very well end up using this after I’ve got Bogofilter trained better.