Spam Be Gone: Technical

Okay, I've got a few minutes. Teach me something funky.

Spam Be Gone! was developed as an example application for one of the most technically sound machine learning systems. The system that has been developed is known as K* (short for Kolmogorov *) and is classified as an instance-based, complexity theoretic, machine learning system. Initial versions of K* have been included in a state-of-the-art machine learning toolkit.

K* strengths include:

The ability to predict both enumerated and numeric classes

Strong theoretical grounding in complexity theory

It is wickedly good

K* seems to be good at almost everything!

The Spam Be Gone! seemed like a good application for the K* algorithm. If you have downloaded and installed Spam Be Gone! you would have noticed how easy it was to use, and how well it performed. As you teach it, it learns.

About K*

K* is an instance-based classifier, that is the class of a test instance is based upon the class of those training instances similar to it, as determined by some similarity function. The underlying assumption of instance-based classifiers such as K*, IB1, PEBLS, etc, is that similar instances will have similar classes. The following description is an overview, consult the definitive K* paper for more details. This document is in .PDF format only. But the K* thesis will be available soon :-)

There are two important parts to any instance-based classifier: the similarity function that determines how alike two instances are, and the classification function that determines how to use the class of similar instances in predicting the class of the test instance.

Similarity

The K* classifier assumes that if two instances are similar then there is a high probability of one instance randomly "transforming" into the other by some sequence of small mutations. A transformation between instances may be envisaged as a series of smaller, basic transformations. For example a numeric value may be mutated by adding one or subtracting one. Assigning probabilities to the basic transformations allows us to calculate the probability of the overall sequence. However there are many possible sequences to transform instance A into instance B. K* incorporates all possible transformation paths into its similarity function and takes the probability of A transforming into B in an unspecified manner (i.e the sum of the probabilities of all possible transformation programs). The beauty of this approach is that different attribute types (such as integer, symbolic) can be modelled within the same framework. Adding support for a new attribute type consists of defining an appropriate set of basic transformations and their probabilities.

Classification

For a simple instance-based classifier, such as a k-nearest-neighbour algorithm, the k closest neighbours to the test instance are found and the predicted class is determined by majority vote. If 4 of the 5 nearest neighbours have class B, then class B is predicted as the class of the test instance.

K* can be thought of as a k-nearest-neighbour classifier with votes weighted according to similarity. The probability that a test instance is of class C is the probability of the instance randomly transforming into one of the training instances of class C (i.e the sum of the probabilities of transforming into each training instance of that class). Because similar neighbours have a higher transformation probability than dissimilar neighbours, these instances have a higher weighting in the final classification. The predicted class is the one in which the test instance is most likely to randomly transform into.

Blend parameter

The probabilities assigned to the basic transformations determine what is considered a small or large difference. This may be controlled through a blend parameter. The blend parameter sets the size of a "sphere of influence" around the test instance. A blend setting of 100% sets transformation probabilities so that all differences are considered small, and thus all of the training instances appear to be similar to the test instance. A blend setting of 5% will result in approximately 5% of the training instances being considered "close".

Advanced Symbolic metric

K* optionally includes an advanced symbolic similarity function that incorporates ideas from Stanfill & Waltz's Value Difference Metric.