Contact Form Spam Detection with Machine Learning (Part 2)

This post describes the implementation of a Naive Bayes classifier for detecting spam in contact form submissions. The classifier labels messages as either ham (legitimate) or spam, and it can be updated over time using newly labeled data.

Overview

The classifier is initially trained on a labeled dataset.
Incoming messages are classified as ham or spam.
Misclassified messages can be corrected.
The classifier is retrained with updated data and stored as a serialized model.

Initial Training

Training begins with a set of pre-labeled messages. Public datasets such as the Enron email corpus or the UCI SMS Spam Collection can be used.\ During training, each message is tokenized, word frequencies are counted per class, and Laplace smoothing is applied:

` (count(word, class) + 1) / (totalWordsInClass + vocabularySize) `

This ensures that unseen words do not produce zero probabilities.

Classifier Structure

The classifier tracks:

Document counts per class
Word counts per class
Vocabulary size
Class prior probabilities
Smoothed conditional probabilities for each word

Classification is performed using log-likelihoods to avoid numerical underflow.

Example Training

$classifier = new NaiveBayesClassifier();
$aMessages = [
    ['spam','Buy cheap products'],
    ['spam','Limited time offer'],
    ['spam','Click here for free money'],
    ['ham','Hello, I would like to inquire about your services'],
    ['ham','Can we schedule a meeting?'],
    ['ham','Thank you for your assistance'],
];

$classifier->trainFromArray($aMessages);
$classifier->storeModel('classifier.ser');

Classifying New Messages

$classifier = NaiveBayesClassifier::loadModel('classifier.ser');
$newMessage = 'Schedule a meeting now to get free products';
$result = $classifier->classify($newMessage);

Reclassifying and Updating

If a message is misclassified, corrected data can be added and the classifier retrained:

$aMessages = [
    ['ham','Thank you for your assistance'],
];

$classifier->trainFromArray($aMessages);
$classifier->storeModel('classifier.ser');

The model can be incrementally improved using real-world feedback.