In the aftermath of the Censorware Affair, SC promised to write about classification algorithms and the various ways one might go about banning a website. Since the situation seems to have been amicably resolved, I want to make it clear at the outset that everything that follows is *not* about Websense, not the company, not the technology -- the classification methods of which remain undisclosed. I've delayed starting this series so that it will also be quite clear that it's not intended as a revanchist screed.

We'll start off today with something about one of the simplest techniques for classifying data, "naive" Bayesian classifiers. These are interesting to talk about not only because of their applications, and the fact that the math behind them is neat, but also because there's a whole philosophy of how to reason about the world that this just barely scratches the surface of.

First, a little terminology. A classifier is a piece of software that takes in a piece of data, and gives back a label. People sometimes call this "partitioning", on the grounds that as in formal logic, a classifier is splitting the data into sets, where membership in one entails non-membership in all of the others. That kind of exclusivity is called a "partition". However, there are cases where you don't want to assign just one label to each piece of data, so we'll stick to "classifying".

Now, what makes a classifier Bayesian is the application of a statistical rule called Bayes' theorem. There are a lot of formulations of Bayes' theorem for different sciences -- SC had a fascinating conversation with a medical-student friend not too long ago, on the subject of "evidence-based medicine", which sounds like what you hope they were always doing, but which refers specifically to a Bayesian approach to diagnostics. For our purposes, we'll cast Bayes' theorem in terms of events (the data you actually come across) and models (the statistics used to classify events). We also need to add a notion of "conditional probability", which we'll write as P(x|y). Read that as: "the probability of x given y". In these terms, Bayes' rule is:

P(e|m) * P(m) = P(m|e) * P(e)

This reads as "the probability of an event e given a model m, multiplied by the probability of the model, is equal to the probability of the model given the event, times the probability of the event". Of course, this leads us to a problem almost immediately -- assuming that we've got some way of deriving a statistical model for some process, how do we calculate the probability that it's correct? And how do we calculate the probability of the model given the event?

The good news is that we get to ignore one of these terms. For a given model, the P(m) term is constant, and so we get to disregard it as long as we stick to using the same model. This leaves us with the formula: P(e|m) = P(m|e) * P(e), which is what we want. The tag we pick as the classification is just the one which results in the highest probability of the event given the model.

What makes a classifier using this rule "naive" is a simplifying assumption. Everybody knows the order of the words in the previous sentence is nonrandom, and that each word is chosen based on what came before. In the naive Bayesian approach, we just quit on the hard task of trying to model the nonrandomness, and assume that the documents being classified are "bags of words", where the order is irrelevant. This has a lot less of a negative impact on classification performance than you would think.

The naive Bayesian scheme works best when you're just trying to classify things in an either-or scenario. For example, that this blog is about linguistics, as opposed to chemistry. Or that the word bank is being used to mean "side of a river", instead of "financial institution". The more categories you have available, the worse the performance of a Bayesian classifier gets, so it's advisable to train a few of them on the pairs of categories you're most interested in deciding between.

Training these things is also pretty easy. Get a couple thousand of examples, label them with whatever category you want, and then count up the results. There are sophisticated ways of revising your estimates of the probabilities, but we'll leave them alone for now.

Although SC is sticking to the promise not to target the makers of any Web filtering software, we're still going to wrap this up by talking about the failure modes that might lead this kind of classifier to ban Language Log. One imagines that in creating the category "sex", somebody training a Bayesian classifier would give it hundreds of examples of web pages containing various dirty words. They would also give it plenty of other examples of pages not in the sex category, which would contain those words much less often. So words like "sex" and "porn" might become very good predictors of a page being likely to fall into the "sex" category...and a couple of mentions of them might therefore lead a page to be banned. The moral of the story? If you don't want your site being classified badly, make sure to change topics early and often. Which, fortunately, is excellent advice for writing a blog anyways.

"Which, fortunately, is excellent advice for writing a blog anyways."

Had I written the sentence quoted above, I had used "anyway" as the last word. Do you have any preference for the one or the other?

Posted by: Henry IX | February 20, 2004 at 10:58 AM

I think I've used both "anyway" and "anyways" in different cases. I'm not sure if my alternations between the two are predictable, but they both sound fine to me.

Posted by: Semantic Compositions | February 20, 2004 at 01:48 PM