As SC has alluded to before, his employment is with a commercial firm, doing artificial intelligence research (it's far from the company's main line of business, though). As many large companies do, they employ an automated web content filter to prevent employees from reading inappropriate material.
Today, SC was shocked to discover that Language Log is now blocked as a page in the category "sex". Was it because Mark Liberman and Geoff Pullum have been setting up the Linguistic Beefcake Pinup Calendar? That list of contributors sure looks like it could be describing who their models are...
Nope. It's because of the deficiencies in how automated content filters work. The particular filtering software used at Employer of SC is made by a company called Websense. Websense goes to great lengths to avoid saying how their software actually works, but based on their obsession with categorization of pages, SC will hazard a guess.
The software is configured to block pages that fall into categories, which requires that somebody assign them as such. This can be done in one of two ways: 1) by hand, where someone has to inspect each page and make a decision, or 2) by using an automated classifier, trained on sample data, or more ominously, just going by a few key words thought up by the filter software designer.
Automated classification algorithms come in a variety of flavors, but in order to have categories with useful names, you have to define some examples to train them with. The easiest way to do this is to simply label a whole bunch of web pages with terms: "XXX", "jobs available at a more successful competitor", "sports", etc. Then, allow the algorithm to learn statistical correlations between the words in each document and the category. The danger of algorithms like this is that unless you use a really sophisticated one, a few words with strong associations can be enough to erroneously tag a document. And judging by the fact that Language Log is now a banned sex site, and the filter people won't say how they do it, SC can only conclude that the software is so naive, he could have written it as an undergrad. SC suggests that the Language Log people might want to get together with Brian Weatherson, and figure out how to better disguise their perverse cheesecake site.
Inspired by this, SC is going to spend some time over the next week talking about various classification algorithms, and how they might have led his employer to ban Language Log.
Have you ever been to Peacefire? They talk about a lot of the weirdness of censorware. (Having gone to a public high school, I have a few arbitrary banning stories of my own... but I'm sure everyone who's interacted with censorware has. My mom's a second grade teacher, and the censorware at her school bans *my* site - and the worst things I've ever said there are "darn" and "heck"!)
Posted by: Rachel | February 10, 2004 at 07:10 PM
Sorry, I inadvertently left out a word - that should read "everyone who's interacted with censorware has *some*."
Posted by: Rachel | February 10, 2004 at 08:56 PM
I'm reading your blog right now on a Websense-infested computer at work. Ask your sysadmin to make your company's Websense settings a little bit less paranoid.
Posted by: speedwell | February 11, 2004 at 12:26 PM
Unfortunately, I work for a company which defines paranoia. Web-based e-mail is blocked, on the theory that it will bring in viruses, but the M$ Exchange system has gone down hard as a result of every virus that you've seen make it to the news in the last 2 years. Just to give an idea of how bad it is, the IT staff refused to do anything when this site was classified as "inappropriate". I work at a company with 40,000+ people, and the policies are set way above anybody I can talk to.
Posted by: Semantic Compositions | February 11, 2004 at 08:29 PM