Catch the spam, keep your data - PHP spam filtering
Spam is annoying. Anytime you open up comments to the world, you must contend with all the crazies who wanted to place spammy links on your site, or just be a general nuisance that makes you want to block comments entirely.
Then there's the constant offers of SEO (Search Engine Optimization) services in my inbox from the contact form, dressed up to look as legitimate as possible.
I don't really care if my site is not at the top of Google for many of my key terms, random person (or robot) from India. It's a personal site.
Besides, I do know a little bit about SEO myself and I've tried to work as much of that into the site already, but the main way to get noticed by search engines is to first get noticed by actual people. This requires that you have interesting content, so if people like what they see they'll link to you and your search engine rank will grow organically. You can't fake that, either people like the site enough to pass it on or they don't. For a personal site such as this one, it doesn't much matter either way.
Unfortunately, it does appear to matter to the spammers of the world. So, if you are going to accept comments, spammers are foaming at the mouth with desire to get their junk on the page. Something certainly needs to be done to address the issue.
When it came around time to redeveloping my site, I wanted to find a way to handle spam detection that met 2 conditions:
- Kept my data local, so that comments and contact details did not need to be submitted to an external spam detection service like Akismet or controlled completely like Disqus. (Both are very excellent servcies and are a great solution for many websites.)
- Avoided the "simple" yet annoying solution of putting a CAPTCHA field on all comment forms. This puts up an unneccessary barrier to anybody who would potentially comment.
I'm a programmer and don't tend to like using services that provide functionality that I should be able to handle myself for personal projects. It certainly adds to my development time, but it also adds to my personal satisfaction and overall sense of accomplishment.
That said, I wouldn't want to sit down and write a bayesian spam filter from scratch. It's a very interesting project, but one that I unfortunately would never have time to complete.
Enter b8, an open-source statistical spam filter by Tobias Leupold. I ran across this way back, maybe a couple years ago and always had it on my list of things to try out. When it came time to redevelop my site, I finally had the platform to play around with it.
I was pleasantly surprised how easily I could work this into the site and it didn't take much training for the filter to start working it's magic. I now have b8 integrated into my comment and contact forms and it's blocking the vast majority of the spam coming into the site, with very few false positives and negatives. For the few that slip through, I developed a simple little tool to manage incoming comments and reclassify them as necessary.
Currently I'm sitting at 367 spam comments and 49 spam contact emails. Satisfying indeed.