Contact Form Spam detection with Machine Learning (Part 2)

In this blog post, I'll walk through how I implemented a Naive Bayes classifier to detect spam submissions, label messages as either "ham" (legitimate) or "spam," and how the classifier can be retrained dynamically based on feedback.

A Naive Bayes classifier is a simple yet powerful probabilistic classifier based on applying Bayes' theorem. In the context of spam detection, it uses the words in a message to estimate the probability of a message wether it's spam or legitimate.

Project Overview

  1. Initial Training: The classifier is initially trained using an open-source dataset.
  2. Classification: Incoming messages are classified as either spam or ham.
  3. Reclassification: Messages can be reclassified if they are falsely marked.
  4. Updating the Classifier: The classifier is retrained with the new data and stored as a serialized class to disk for quick retrieval.

Step-by-Step Implementation

Setting Up the Initial Training Dataset

To start, I need a dataset of pre-labeled messages to train the classifier. There are many open-source datasets online, such as the Enron email dataset or the UCI SMS Spam Collection.

Building the Classifier in PHP

A barebones implementation of the Naive Bayes classifier implementation:


class NaiveBayesClassifier
{
     private array $vocabulary = [];
     private array $wordCount = ['spam' => [], 'ham' => []];
     private array $classCount = ['spam' => 0, 'ham' => 0];
     private array $classProbabilities = ['spam' => 0, 'ham' => 0];
     private int $totalDocuments = 0;

     public function trainFromArray(array $dataset) :void {
         foreach ($dataset as $data){
             $this->addWords($data);
        }
         $this->calculateClassProbabilities();
     }
     public function trainFromFile(string $filePath) :void{
         //code to load dataset from csv file
     }
     private function addWords(array $data) :void {
         $label = $data[0];
         $text = $data[1];
         $this->totalDocuments++;
         $this->classCount[$label]++;
         $words = $this->tokenize($text);
         foreach ($words as $word) {
             if (!isset($this->vocabulary[$word])) {
                 $this->vocabulary[$word] = true;
             }
             if (!isset($this->wordCount[$label][$word])) {
                 $this->wordCount[$label][$word] = 0;
             }
             $this->wordCount[$label][$word]++;
         }
     }
     private function tokenize(string $text) :array {
         $text = strtolower($text);
         $text = preg_replace('/[^a-z0-9\s]/', '', $text);
         $words = explode(' ', $text);
         return array_filter($words);
     }
     private function calculateClassProbabilities() :void {
         foreach ($this->classCount as $label => $count) {
             $this->classProbabilities[$label] = $count / $this->totalDocuments;
         }
     }
     public function classify(string $text) :string {
         $words = $this->tokenize($text);
         $scores = ['spam' => log($this->classProbabilities['spam']), 'ham' => log($this->classProbabilities['ham'])];

         foreach (['spam', 'ham'] as $label) {
             foreach ($words as $word) {
                 $wordProbability = $this->wordProbability($word, $label);
                 $scores[$label] += log($wordProbability);
             }
         }

         return $scores['spam'] > $scores['ham'] ? 'spam' : 'ham';
     }
     public function storeModel(string $modelPath) :void{
         file_put_contents($modelPath, serialize($this));
     }
     public static function loadModel(string $modelPath) :object {
         return unserialize(file_get_contents($modelPath));
     }
}

Training the Classifier

I initialize the classifier and train it with the open-source dataset:


$classifier = new NaiveBayesClassifier();
$aMessages = [
    ['spam','Buy cheap products'],
    ['spam','Limited time offer'],
    ['spam','Click here for free money'],
    ['ham','Hello, I would like to inquire about your services'],
    ['ham','Can we schedule a meeting?'],
    ['ham','Thank you for your assistance'],
 ];

$classifier->trainFromArray($aMessages);
$classifier->storeModel('classifier.ser');

Classifying New Messages

For every new message received through the contact form, I can classify it:


$classifier = NaiveBayesClassifier::loadModel('classifier.ser');
$newMessage = 'Schedule a meeting now to get free products';
$result = $classifier->classify($newMessage);

Reclassifying Messages

If a message is incorrectly classified, I can reclassify it and retrain the classifier:


$aMessages = [
    ['ham','Thank you for your assistance'],
];

$classifier->trainFromArray($aMessages);
$classifier->storeModel('classifier.ser');