Contact Form Spam detection with Machine Learning (Part 2)
In this blog post, I'll walk through how I implemented a Naive Bayes classifier to detect spam submissions, label messages as either "ham" (legitimate) or "spam," and how the classifier can be retrained dynamically based on feedback.
A Naive Bayes classifier is a simple yet powerful probabilistic classifier based on applying Bayes' theorem. In the context of spam detection, it uses the words in a message to estimate the probability of a message wether it's spam or legitimate.
Project Overview
- Initial Training: The classifier is initially trained using an open-source dataset.
- Classification: Incoming messages are classified as either spam or ham.
- Reclassification: Messages can be reclassified if they are falsely marked.
- Updating the Classifier: The classifier is retrained with the new data and stored as a serialized class to disk for quick retrieval.
Step-by-Step Implementation
Setting Up the Initial Training Dataset
To start, I need a dataset of pre-labeled messages to train the classifier. There are many open-source datasets online, such as the Enron email dataset or the UCI SMS Spam Collection.
Building the Classifier in PHP
A barebones implementation of the Naive Bayes classifier implementation:
class NaiveBayesClassifier
{
private array $vocabulary = [];
private array $wordCount = ['spam' => [], 'ham' => []];
private array $classCount = ['spam' => 0, 'ham' => 0];
private array $classProbabilities = ['spam' => 0, 'ham' => 0];
private int $totalDocuments = 0;
public function trainFromArray(array $dataset) :void {
foreach ($dataset as $data){
$this->addWords($data);
}
$this->calculateClassProbabilities();
}
public function trainFromFile(string $filePath) :void{
//code to load dataset from csv file
}
private function addWords(array $data) :void {
$label = $data[0];
$text = $data[1];
$this->totalDocuments++;
$this->classCount[$label]++;
$words = $this->tokenize($text);
foreach ($words as $word) {
if (!isset($this->vocabulary[$word])) {
$this->vocabulary[$word] = true;
}
if (!isset($this->wordCount[$label][$word])) {
$this->wordCount[$label][$word] = 0;
}
$this->wordCount[$label][$word]++;
}
}
private function tokenize(string $text) :array {
$text = strtolower($text);
$text = preg_replace('/[^a-z0-9\s]/', '', $text);
$words = explode(' ', $text);
return array_filter($words);
}
private function calculateClassProbabilities() :void {
foreach ($this->classCount as $label => $count) {
$this->classProbabilities[$label] = $count / $this->totalDocuments;
}
}
public function classify(string $text) :string {
$words = $this->tokenize($text);
$scores = ['spam' => log($this->classProbabilities['spam']), 'ham' => log($this->classProbabilities['ham'])];
foreach (['spam', 'ham'] as $label) {
foreach ($words as $word) {
$wordProbability = $this->wordProbability($word, $label);
$scores[$label] += log($wordProbability);
}
}
return $scores['spam'] > $scores['ham'] ? 'spam' : 'ham';
}
public function storeModel(string $modelPath) :void{
file_put_contents($modelPath, serialize($this));
}
public static function loadModel(string $modelPath) :object {
return unserialize(file_get_contents($modelPath));
}
}
Training the Classifier
I initialize the classifier and train it with the open-source dataset:
$classifier = new NaiveBayesClassifier();
$aMessages = [
['spam','Buy cheap products'],
['spam','Limited time offer'],
['spam','Click here for free money'],
['ham','Hello, I would like to inquire about your services'],
['ham','Can we schedule a meeting?'],
['ham','Thank you for your assistance'],
];
$classifier->trainFromArray($aMessages);
$classifier->storeModel('classifier.ser');
Classifying New Messages
For every new message received through the contact form, I can classify it:
$classifier = NaiveBayesClassifier::loadModel('classifier.ser');
$newMessage = 'Schedule a meeting now to get free products';
$result = $classifier->classify($newMessage);
Reclassifying Messages
If a message is incorrectly classified, I can reclassify it and retrain the classifier:
$aMessages = [
['ham','Thank you for your assistance'],
];
$classifier->trainFromArray($aMessages);
$classifier->storeModel('classifier.ser');