Friends of Semantic Compositions

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    

Site Statistics

Blog powered by TypePad

May 23, 2007

How much is a typo worth?

A friend directed SC to this article in Business 2.0 about a man named Kevin Ham, who built a fortune on trading in Internet domain names. Mr. Ham engages in many of the usual practices employed by people who buy domain names by the ton -- ads go up on any page he's got parked until he receives an offer to buy the domain name -- but this is a gentleman who is far smarter than the typical search engine spammer. Indeed, the article opens with a retelling of an auction he attended, where the reporter put himself inside Ham's head to ask the question, "If it's a typo, is it a mistake a lot of people would make?". The fact that Kevin Ham (and his partner Robert Seeman, whose name is on the patent application to be discussed) came up with an answer to that question that nobody else did makes him a very smart man indeed.

The traditional way of making money on other people's typos has been to register domain names which are a letter or two off of sites that people want to visit. The idea is that you'll grab all the "direct navigators" who just idly type in a name and hope something sensible pops up (sort of like putting your car in drive and hoping there's a road ahead of you, and that it goes in the same direction you had in mind -- which is not to say that there's anything wrong with typing an address if you actually know it). For example, newyorktime.com takes you to a site which has nothing to do with the New York Times except that the owner has thoughtfully placed a subscription link for the newspaper near the top of the page (and the rest of it is pretty much a conventional ad farm). "Typo-squatting", as the practice is known, has attracted the hostility of legitimate business for both diluting their trademarks and damaging their reputations by hosting malware (interesting how Google is on both sides of this one). The latter article mentions some interesting research done by Microsoft to systematically trace typo-squatting, and you can read the technical report from Microsoft Research here.

What Kevin Ham and his partners have done is not to focus on typos in the names themselves, but to focus on the domain codes that end every Internet address. The trick isn't to buy up domain names, but to get the domain administrator to redirect traffic for unregistered names to a site you own. And there might not be a richer prize for this typo than the country code for Cameroon -- .cm. The story actually broke almost a year ago, but until the Business 2.0 article, nobody knew who was behind it. As the story notes, this is a nifty legal gimmick that gets around the question of violating trademarks, because you never purchase any misspellings which are arguably the property of someone else (although it should be pointed out that the property rights aren't in the misspelling itself, but in the clearing of a certain edit distance around strings like your own). It's working so well, Ham is working on adding .co (Colombia) and .et (Ethiopia).

It's hard for SC to imagine being the sort of person who mistypes an address, gets a link-farm instead of what he was expecting, and says, "Oh, gee, this looks just as good. Let's click through random ads instead of trying again." Clearly, though, enough people must do this to make it worth Mr. Ham's while, which is not to say that this is the sort of business your host would want to build if he was running a company, especially if he had training in something like medicine. We can get at least one useful piece of information out of the article, though; we're told that he pulls in about $70 million a year , that he gets about 30 million unique visitors a month, and that he has about 300,000 domain names (we'll count .cm as a typo-domain for our purposes, since it doesn't matter what came before that). While the $70M figure includes unspecified "other ventures", we'll assume they're all typo-related, which at least gives us an upper bound on the value of each typo-derived hit. That $70M/year works out to $5.83M/month, and so with 30M visitors each month, that means he's looking at about 19 cents per unique visitor. If you spend a little time playing with the Google AdWord price estimator, you'll find that will buy you first-page Google search placement for an awful lot of words -- like, say, "wedding shoe" (actually, only 18 cents to appear in the 4th-6th positions for that phrase -- such a deal!). According to the article, Kevin Ham is paying about $7/year in overhead to maintain these sites.That suggests that the average Kevin Ham "typo tax" pays off after just 35-40 hits. It's not obvious to SC quite how you would fix this problem from the advertiser's perspective, since the initial error and subsequent judgments are made by the customer, but if you're paying 19 cents for every visitor who would have happily come to your site at no cost to you if only they could type straight, that's got to be worth money to someone.

September 27, 2005

Take a walk with an orc

SC readers might have thought that, based on your host's past in the video gaming industry, he would've had the good sense to walk away and just be happy as an occasional player. But what else gives you an opportunity to have so much fun while doing research?  Hence, for the just-concluded Fall 2005 Simulation Interoperability Workshop, a position paper coauthored by yours truly and two coworkers on the need for people in the simulation field to quit working on realistic physics and behavior models, and get back to what really matters -- big explosions! In all seriousness, it's actually about using techniques from the gaming industry to improve simulations as teaching tools. Go have a look.

BTW, yes, there are a couple of errata regarding dropped references to graphics. Don't bother e-mailing or commenting about them; they will all be corrected in the final conference proceedings.

January 19, 2005

The triumph of Lou Pearlmanism

Crap, crap. Crap, crap, craaap...Ahhh. Now here's good work. --The Joker, Batman

George Orwell imagined in 1984 that novels would be produced mechanically, subject to orders from Planning Committees, edited by Rewrite Committees, every new one as dull and designed to specifications as the ones before. The process would go like so:

'Oh, ghastly rubbish. They're boring, really. They only have six plots, but they swap them round a bit. Of course I was only on the kaleidoscopes. I was never in the Rewrite Squad. I'm not literary, dear -- not even enough for that.'

SC used to think this was satire, or at least an overly pessimistic view of the future. But then one thinks of Lou Pearlman, who made a fortune cranking out boy bands by formula ([technically, it was his second fortune -- ed.]), or the paint-by-numbers approach to cranking out books under the V.C. Andrews (who wrote at most 7 novels in her lifetime; "her" 40th comes out in March) or James Patterson (the subject of an admiring profile in the Harvard Business Review) names, and it's hard not to think that music and publishing executives see Orwell's vision as a beautiful dream.

So your host was very interested when Mrs. SC relayed news that she heard on the radio yesterday, of a company specializing in computer-based analysis of music to determine its "hit" potential. That company, Polyphonic Human Media Interface, promises to ensure that our future will be an unbroken chain of Britneys and Backstreet Boys for eternity.

Actually, the PHMI folks would take considerable umbrage at that statement. Here's their view of what they're up to, extracted from various parts of their FAQ:

"[E]very new style of music that has come into being: country, rock, punk, grunge etc. have all had similar mathematical patterns and the hits in those genres have all come from the same hit clusters that exist today...artistic integrity and creativity are the lifeblood of the music industry and are of paramount importance to our business...Our computers have not invented anything, rather they've only detected patterns and parameters that already existed...Hopefully by identifying them [the patterns] musicians can become better composers and more insightful."

But all this hemming and hawing about the importance of the human element is pretty quickly laid to rest by examining their explanation of the technology. While PHMI is fairly cagey about the exact clustering algorithms used, they acknowledge attempting to categorize song features like "melody, harmony, tempo, pitch, octave, beat, rhythm, fullness of sound, noise, brilliance, and chord progression" (SC doesn't claim to get all of this; what exactly could they mean by "brilliance"?). The associated graphs indicate that their analyses attempt to find how similar new songs are to previous commercial successes. Aside from a point regarding the inability of their system to account for lyrics, it is very difficult to see how this analysis is anything but an attempt to prove that new music is more of the same. If two songs have similar melodies, harmonies, tempos, etc., then it stands to reason that they'll sound alike.

Looking at a sample report provided by PHMI gives some room for optimism, albeit of a very limited sort. Assuming the data is not entirely fictional (although your host couldn't find any song called "Wild Party" by an artist named "Hal"), their software claims that a tested song is very similar to both a raucous rap hit by 50 Cent, and a rather more laid back R&B hit by R. Kelly. Admittedly, most of what sounds similar about them are the intonations of Messrs. Cent and Kelly; the background music sounds very different to SC's ears. But then, he had to hear the Black-Eyed Peas' Where Is the Love? dozens of times before coming to the stunning realization that it was in fact the old Nestle Crunch jingle (all clips courtesy of Tower Records' website). So the common features don't necessarily have to beat you over the head. But there's also no denying that their list of hits all fall very strongly into the hip-hop/R&B genre and are very much examples of conventional genre thinking.

The PHMI folks are of course attuned to the concern that their work will be used to crank out "soulless digi-hits". But they also promise to cut back on the number of songs on CDs that consumers won't like. And they're sure that they're onto a "a highly accurate and scientific tool", which means nothing if not repeatable. SC sees no reason not to take them at their word on this point, nor on their claim that major labels are already trusting them to help make decisions. Anything that cuts down on risk is music to an executive's ears. The question is whether it will be to ours.

December 24, 2004

What do you want?

In the previous post, your host happened to mention that he's been thinking about most-frequently-visited sites and what this has to do with advertising. The issue was prompted a few days ago while browsing Amazon, and discovering a feature called "Improve Your Recommendations". The link only exists for users, with accounts, when they're signed in; if you've got an Amazon account, it can be found under the "Your Store" tab at the top of the page.

Now, it has to be said that SC has rarely found Amazon's recommendations to be useful, largely because they appear to be based on the notion that people are basically closed-minded, and want to stuff their brains with nothing but more of the same. They might even be right about that, but a few experiments will suffice to demonstrate the shallowness of whatever algorithm they're using.

First, though, it might be instructive to review what Amazon thinks your host is interested in based on previous purchasing habits. Their records go back to 2000, at least in SC's case. This is an exhaustive list of everything SC has bought from them, BTW; many other things that would seem like obvious candidates (i.e., the Star Wars trilogy), were purchased in real-world stores:

You're welcome, Mr. Bezos.

In all seriousness, however, this is just one of four things which Amazon attempts to get you to help them out with; they also keep track of "items on your wish list", "items you've rated", and "items you've marked 'not interested'". That last item is all they've got by way of negative feedback, and it's not much, since SC hasn't marked any items that way.

Given this, and with no recent searches, Amazon assumes that SC must be obssessed with: Fox TV shows, comedies, science fiction, and horror. Considering that your host bought Arthur Machen's little horror novel in 2001, when he was going through an H.P. Lovecraft-related binge, this is an awfully thin reed to hang recommedations on, but the data is sparse. What did Amzaon come up with?

The second, third and fourth seasons of The Simpsons (good guess), the first season of 24 (bad guess), the first season of King of the Hill (very bad guess), the first three seasons of Seinfeld (extraordinarily bad guess), a collection of the first three Harry Potter movies (good thing their programmer's life didn't depend on this), a new collection of Machen's short stories, the "Ultimate Matrix collection" (safe guess), and Spider-Man 2 (reasonably likely, but only when a larger boxed set comes out, as SC doesn't like buying the same movie twice).

Since the data is relatively sparse, it's easy to make Amazon's user-profiling software get way off-target with relatively little effort. Click on the Harry Potter link, search for George Lakoff, and voila! New links on the front page, recommending:

Icon Moral Politics by George Lakoff
Icon Metaphors We Live By by George Lakoff, Mark Johnson
Icon How the Democrats and Progressives Can Win DVD
Icon Women, Fire, and Dangerous Things by George Lakoff

The Page You Made
Icon
 An H. P. Lovecraft Encyclopedia by S. T. Joshi, David E. Schultz
Icon The White People and Other Stories by Arthur MacHen, S. T. Joshi
Icon Harry Potter and the Prisoner of Azkaban (Widescreen Edition) DVD ~ Daniel Radcliffe
Icon H.P. Lovecraft's Magazine of Horror #1 by Marvin Kaye, et al
Icon Moral Politics by George Lakoff

Now, "The Page You Made" is designed to be very short-term, based on recent searches, and so it's not the best way to examine what's going on. Fortunately, it's not hard to tweak the underlying recommendation model. Simply removing Dilbert apparently makes SC more political, changing the main recommendation list to:

Icon 24 - Season One DVD ~ Kiefer Sutherland
Icon King of the Hill - The Complete First Season DVD ~ Mike Judge
Icon Moral Politics by George Lakoff
Icon What's the Matter with Kansas? How Conservatives Won the Heart of America by Thomas Frank

Also listed is Futurama's Season 2 (which gets a much more lavish recommendation that SC doesn't feel like pasting in). Dumping all Fox shows (but reinserting Dilbert), changes the list like so:

Icon Foundations of Statistical Natural Language Processing by Christopher D. Manning, Hinrich Schtze
Icon Futurama, Vol. 3 DVD ~ Matt Groening
Icon Moral Politics by George Lakoff
Icon What's the Matter with Kansas? How Conservatives Won the Heart of America by Thomas Frank

Ah, suddenly I'm a linguist again, albeit apparently quite a political one. Let's take everyone but George out of the picture:

Icon How the Democrats and Progressives Can Win DVD
Icon Moral Politics by George Lakoff
Icon Metaphors We Live By by George Lakoff, Mark Johnson
Icon MoveOn's 50 Ways to Love Your Country by Moveon

With nothing otherwise linguistic to go on, the profiling system isn't sure if SC's interest in Lakoff is because of his other work as an author, or because of his political affiliations. So it hedges. But the system definitely has a bias for authors over subjects. How does SC know? By putting Jurafsky and Martin's book back into the mix:

Icon Foundations of Statistical Natural Language Processing by Christopher D. Manning, Hinrich Schtze
Icon How the Democrats and Progressives Can Win DVD
Icon Moral Politics by George Lakoff
Icon Metaphors We Live By by George Lakoff, Mark Johnson

For the record, the typo in Prof. Schutze's name belongs to Amazon ([but the one about the umlaut belongs to you -- ed.]). That said, with one NLP book and one of Lakoff's political books to go by, the system recommends the above plus Thomas Frank's book (noted in several of the above lists). Clearly, the bias is towards works of Lakoff (the DVD is him, too) and towards an assumption that the user is more interested in reading things sharing his opinions than in particular subjects (seeing as Frank's book gets play over another linguistics volume, despite the all-linguistics nature of the input).

Of course, the system could be further tweaked by giving it actual ratings of the items SC has purchased. To this point, it's all based on an assumption that your host likes/dislikes everything he's bought equally. This isn't true, but the possible permutations are exponentially larger with each item examined, seeing as there are 6 choices (1-5 stars, or no rating) for each one. One simple experiment illustrates that Amazon expects you to want to hear more of the same over any other consideration. Here's what happens when SC assigns one star to his Lakoff purchase (don't worry, folks, this doesn't affect its overall rating), and makes no other changes:

Icon The Savage Nation by Michael Savage
Icon Slander by Ann Coulter
Icon Let Freedom Ring by Sean Hannity
Icon Shut Up and Sing by Laura Ingraham

SC will guess that not too many of these are on the average Lakoff fan's bookshelf; none of them are actually on his own, either. Life's too short to read this many polemics. Give Lakoff a 5-star rating, though, and the politics are back -- but oddly, not as strongly:

Icon Futurama, Vol. 3 DVD ~ Matt Groening
Icon Futurama, Vol. 4 DVD ~ Matt Groening
Icon Moral Politics by George Lakoff
Icon What's the Matter with Kansas? How Conservatives Won the Heart of America by Thomas Frank

SC has no special explanation to offer as to why there are still DVD suggestions with a 5-star Lakoff review while there are none with a 1-star review. Perhaps Lakoff consulted on later episodes of Futurama? (I'm kidding.)

We've talked a bit before about this application, known as "collaborative filtering". The general problem with this kind of recommendation system is that when you have huge numbers of "votes" (or rather, correlations between particular selections), the data in favor of pushing season 2 of a show that you bought season 1 of will tend to trump just about any other concern. Or maybe it'll be the fact that people who buy one book agreeing with them are likely to buy another book agreeing with them (as nicely demonstrated some months ago by Valdis Krebs). It may not actually be the case that any of the people buying multiple DVDs or books in a series is really so narrow-minded as to not be interested in anything else, but with enough data, all the random, more individual preferences inevitably get canceled out. Going back to the "ABC game" that started this post, though, we find it interesting because it tells us something about a person to know what sites they've visited. If we knew which sites they visited most frequently, we might be able to get even more insight into what they were interested in. But in the limiting case, if we continue to aggregate data -- say, getting all linguists or all computer scientists to contribute -- the exercise starts becoming less useful again, a snapshot of what everyone's into that doesn't really tell us anything we couldn't guess at the outset.

November 13, 2004

I can see clearly now

During SC's time at USC, he was fortunate to take a class on Brain Theory from a legend, Michael Arbib. Prof. Arbib's research focuses on "mirror neurons", a part of the brain which helps us imitate actions that we see performed by other people. It happens that they may also play a role in explaining how human speech evolved, which we might discuss some other time.

Because the action of mirror neurons depends crucially on interaction with the human visual system, we spent a lot of time learning about approaches to simulating vision. One key insight of computational research on vision is the idea that we don't perform spatial computations on the entire glorious mosaic of visual information, but rather pick out particular surfaces and edges which provide opportunities for interaction, called affordances. While not all neuroscientists accept the existence of affordances, theorists who work with them argue that they provide a solid explanation of how we manage to sometimes do very mistaken things in our interactions with objects in the world. For example, a product liability expert suggests that affordances explain how a person might misuse a child's play table as a stepping stool, erroneously extracting the affordance for stepping from the fact that it was a low, flat object. Considerations of mass and rigidity, not to mention actual identity, sometimes fail to enter into the computations that precede a decision to act.

At least, that better be this woman's defense.

November 11, 2004

A new search engine

Prominent in the news today is a new search engine from Microsoft, which can be found here.

Naturally, my first impulse was to see if it would do better than Google on searching for my very most favorite of all possible topics. So I'm ambiguously pleased to report that the top hit for my name is this very page, which not only can't be said for Google -- it's in fact far better than Google on that point (Languagehat's mention shows up on the first page, but not a single SC hit to be found). It's only ambiguously good because as I said before, I'm not really trying to proliferate hits for my name at the moment.

More interesting is the "near me" feature, which semi-correctly deduced that I was in Los Angeles -- for the millionth time, I'm not! -- and restricted itself to this page and a hit from the Information Sciences Institute in Marina del Rey, which is an eminently reasonable result. Google's local search can't do that. "Near Me" isn't perfect; trying my name in the New York City area still returns the ISI hit, but no Semantic Compositions. Other, generally intermediate, cities return no hits for any of the people sharing my name, which is a good result to the extent that they're not out there, either. Before you ask, San Diego is one of the cities that returns no hits for me, demonstrating that this is not based too heavily on actual text content. The AP news story suggests that it's based on IP addresses, which makes sense but doesn't explain the New York result for me.

Despite the claim in the AP article to pull up answers to natural-language questions , I didn't see much evidence of that. "Who is Mark Liberman?" pulled up Language Log first, as does searching for his name alone. "Who is Bill Poser?" pulls up Prof. Poser's home page and UPenn Linguistics, as does just searching for his name. The natural language processing involved seems limited to throwing out "irrelevant" terms in favor of keywords. I also tried to push the engine to answer more encyclopedic questions, since the article suggests Encarta is used for answers, but "Where is Mount Fuji?" and "Who was the first president of the United States?" produced results no better than one would expect from raw keyword lookup -- and no Encarta articles, either. Maybe that's still coming, since they only advertise it as a beta release.

So the jury's out on whether Microsoft has come up with a better search engine than Google. Certainly, they have if you're looking for me, but they aren't obviously better on other topics. And the "Near Me" feature is interesting, if somewhat more gimmicky than genuinely useful. There are a lot of promises they've got to catch up on, but we'll give them the opportunity to correct that when the "official" release finally comes out. In the meantime, however, SC will be sticking with Google.

August 11, 2004

A painful lesson

For the last few weeks, your host has been engaged in a bit of defense work with rather little direct bearing on linguistics. Actually, none. But there's still something interesting to talk about.

Without saying anything about the customer beyond the fact that, like all of SC's clients, they came to a defense contractor to get their work done, the project is intended to deliver an environment for training people to use simulations. The simulations are wholly separated from the training task itself, but it's nice to have some way to bring these things together, and to be able to bounce back and forth between reading about something and actually doing it.

As usual when you have two tasks which are clearly best handled by separate pieces of software, there is an obvious need for standards to enable the software to communicate. The standard preferred by SC's customer is unsurprisingly a DoD product, called the Sharable Content Object Reference Model. It's generally referred to by its acronym, with usage like so: "I have nothing but SCORM and contempt for the people who put together such a poorly designed standard".

In all seriousness, though, working with the "learning management systems" (the type of software that usually implements SCORM) has been an eye-opening experience about how teaching can be done online. SCORM allows for lessons to be constructed as HTML pages -- preferably linked to each other through the SCORM interface rather than through links embedded within the pages -- and for that content to call other software to run once you've read about it.

It's easy to imagine all sorts of uses this, especially in teaching things about linguistics or computer science that involve algorithms or processes. For example, anyone who has written a chart parser knows that there are three primary operations: scanning, completing, and predicting. This comes across rather awkwardly on printed pages, where multiple copies of graphs have to be shown, with each new addition to the graph taking up more and more space, so that it's not practical to show them operating with anything more than toy examples. Integrating a parser with graphical output into a SCORM-based environment would allow students to watch the whole thing in operation on real data. While you can get much the same effect now with presentation programs like PowerPoint, the demos still have to be 100% built in advance. Online tutorials built with SCORM could allow a lot more flexibility. It's not hard to imagine doing the same thing in teaching AI search strategies, the operation of circuits in the brain, or even (gasp!) the operations of various linguistic theories in moving phrases around trees or building attribute-value matrices.

The major obstacle to making much use of SCORM is that, as SC has spent the past three weeks learning, much of the freely-available software for playing SCORM-compliant content is badly written and even more poorly documented ([how's your glass house construction coming, pal? -- ed.]). It doesn't help that the specification itself often demands what software engineers would consider rather suboptimal methods for doing things. But once you get one of these systems properly configured, building lessons is no harder than building any other web page, and tying in other software is much easier than it would be if you had to build all the integration code yourself.

June 04, 2004

Fake dialogues with fake people

SC prides himself on a certain amount of Internet savvy. More than a few books and recordings in the SC collection have been bought entirely through Internet-based transactions. The wedding trip? Booked online. Ditto for a cruise and trip to Jamaica last summer. Not all SC commerce is conducted online (he'll never buy a car that way), but enough of it is to recognize some patterns. This one is especially annoying:

Some companies doing business with SC feel obliged to create human faces for themselves, "signing" e-mails with names and job titles of people who are completely fictional.

For example, since 1999, SC has regularly taken online market surveys administered by this company. Although some very nominal compensation is involved, your host is mostly just interested in taking them to get insight into how other people think they can manipulate opinions. Truly, some of the polls are a garden of linguistic delights, trying (for example) to get the subject to endorse the idea that a cardboard box for beer bottles will not only perform a useful function, but will make the beer somehow taste better, and the carrier look more hip and with it ([a phrasing which shows that you're not -- ed.]).

In order to coax SC to take these polls, communications from the company are all wrapped in the guise of originating from a mildly attractive woman in her late-30s named "Lauren", a set of biographical/physical facts themselves likely to have been selected through focus group testing. In the 6 years that your host's been at these things, her picture has never changed once (or rather pictures; they show several different poses on different pages of the site). Even the lowest-budget online magazines that feature pictures of contributors update every year or two. Not having ever had any customer-service issues, SC can't say if e-mails requesting help are answered by "Lauren", but he strongly doubts that she's anything more than some clip art and a few templates for e-mail messages.

Another person SC is convinced doesn't exist goes by the name "Chris Monroe", purported to be an "online travel advisor" working for Travelocity. Every 2-3 weeks -- and substantially more often if SC's got an upcoming trip -- e-mail arrives with some offer or other "signed" in a vague, cursive-like squiggly line which is theoretically Mr. Monroe's signature. Oddly for a real human being, Mr. Monroe never answers replies to his e-mail, suggesting further that he occupies that ontological realm where Pegasus and the present King of France have been dwelling for quite some time now.

Assuming that neither of these people are anything more than friendly faces designed to make advertising look personal, the question of why this is necessary arises. SC will take a wild leap of faith and guess that the web browser the reader used to get to this page does not include such a chatty, fake-person user interface, and yet it was no obstacle to having a look around. Despite the fact that Macintosh computers smile at the user when turned on (at least in OS 8, still used by Mom SC), neither those machines (nor Windows) go to such great lengths to act as though they had human personalities, the laughable Microsoft Office Assistants excepted. For that matter, when Microsoft tried to bolt a friendly, human-like interface onto Windows, it was an unmitigated disaster.

SC has done some work on user interfaces, specifically in the context of designing a natural-language search engine ("building" would be too strong a description for the status of the project when it was snuffed out by management). While the amount of resources available to study user preferences for interacting with the system was limited, it doesn't take a genius -- or even your host -- to figure out that people like Google's interface, which is the opposite of chatty, human-centered, and anything else that describes dialogue. A coworker who SC esteems quite highly argues that we don't treat computers anthropomorphically because they don't look the part, but that we'll want to deal that way with robots if they ever get at least to the C-3PO stage, or at least Number 5. Or even ASIMO. Which makes all of this work by the companies discussed above to pretend that you're dealing with humans quite silly in the here and now.

May 26, 2004

SC can't make this stuff up

Today, the first day on which SC could have received an e-mail from Site Meter regarding his statistics after writing about how they're always dumped in the spam folder, something funny happened.

For the first time since your host started using Site Meter...the e-mail showed up in his inbox.

SC can't help but suspect that this was actually the inevitable result of some universal law, but he hesitates to give it a name.

May 25, 2004

Some insight into Yahoo's spam filtering

Today, while checking the Official E-mail Address of Semantic Compositions, your host noticed something unusual. Each day, he gets an e-mail from the good folks at Site Meter, notifying him of the daily web statistics. Even though he subscribed to this service intentionally -- voluntarily, even -- Yahoo dutifully deposits the messages in SC's "bulk mail" folder, treating them as spam. For the first three months of Site Meter usage, your host was in the habit of flagging each message as "not spam", but has recently given up on trying to persuade Yahoo's filtering software otherwise.

This does not mean that SC doesn't visit his "bulk mail" folder, if only to see what the latest statistics are (and lately, they've been down -- it's like all the ".edu" readers went away with the end of the school year). Courtesy of a glitch in the Yahoo spam filter, this was the subject line of yesterday's statistics:

***SPAM*** Score/Req: 07.52/05.00 - xxxxxxx traffic report for Monday, May 24, 2004

Normally, only the part starting with "xxxxxxxx" (edited out by SC) shows up in the subject line; therefore, we may assume that the filter has added the additional part about "***SPAM***" and the apparent score annotation. Given this, we can determine that Yahoo's filter operates on some kind of formula where various features of the message are assigned points that count towards being spam.

So what might some of the features be that incur Yahoo's wrath? Well, the sending e-mail address is generic (reports AT sitemeter dot com), and so it probably sends mail to enough addresses that Yahoo considers it to be a spam originator. Strike one. Today's date in the address is likely to be a tactic also used by other spammers -- think mortgage companies pushing "today's rate is X% (plus thousands in hidden fees)". Strike two. And at the bottom of the message, we find the string "click here to unsubscribe", presumably also a frequent string in spam messages. SC feels especially confident on this last point, because his mail from the L.A. Music Center always contains this string, and also always gets dumped into the bulk mail folder. Strike three.

Your host is pretty sure that it has to be a rule-based system, largely designed like what's proposed above, because the text of the message otherwise defies anything we know about natural language processing. The body of Site Meter e-mails otherwise consists of two large tables full of numbers, describing your traffic over the past 7 days, as well as a few lines containing some aggregate statistics covering the history of the site in question. There are no creative misspellings, and no words falling into the usual spam categories (Viagra, mortgages, penny stocks).

Of course, those with accounts at frequently abused mail-service providers, like Yahoo, Hotmail, or MSN, might try to test these claims by sending themselves -- or SC -- messages containing today's date in the subject line and "click here to unsubscribe". While these features most likely count for something, we haven't said a word about how they're weighted, because one message gives us little to no insight into that question. SC will hazard a guess that the volume of mail coming from the sending address is actually the #1 flag for Yahoo. As a test, SC did in fact try this stunt, and it went straight to his normal inbox. Of course, it could be that it did so with a score of 4.99/5.00; there's just no way to be sure.

May 14, 2004

And you thought you had privacy concerns before

Today, SC wishes to direct his readers' attention to something they wouldn't be able to find out about otherwise.

This month's issue of Reason magazine is tailored to subscribers in a way that no other has ever been before. The headline on the cover says "SEMANTIC COMPOSITIONS -- They Know Where You Are!" ([technically, it features his real name -- ed.]). Above that is a satellite photo of Chez SC. On the back cover is an ad for the Institute of Justice, threatening the tearing down of said house under eminent domain laws.

As is explained by the editor's note inside, this is what's already doable as a result of us becoming "Database Nation". The note includes customized demographics about SC's neighbors, like the percentage of them with college degrees.

Newsstand buyers see a very generic cover this month. So you won't see it at the store. But you might want to know.

May 11, 2004

Get a new "mor tgage"

A little while ago, Mark Liberman linked to an interesting exercise in computing all of the possible ways to spell "Viagra" without triggering spam filters (or maybe violating Pfizer's trademark).

Your host was reminded of it this morning when an interesting bit of spam showed up in his office e-mail. Now, it has to be said that for an account which is 5 years old, the address in question has been remarkably spam-free. Lately, this has started to change. Below, the text of the latest message:

This is a courtesy offer for our team of fina ncial experts to lower your Mor tgage rate and sa ve

you thousands.

Our consul tants are at your disposal to assist you in reaching optimal savin gs & your goals.

We guar antee the low est rate s in the country.

You will be contacted by a fi nancial specialist promptly. Your satisfaction is our primary goal.

Our specialists will do everything they can to help you sa ve money starting today.

*We are a member of the BBB. All information is confidential.

**Ra tes as low as 3 . 05 %.

The use of spaces within words is too frequent to be merely accidental. So even though SC has no idea how most people configure their spam-filtering software, it's pretty easy to guess at the sender's strategy for avoiding being blocked by keywords.

If the words "financial", "mortgage", "save", "consultants" and "rates" were blocked from SC's e-mail, though, then plenty of other messages wouldn't get through. So he's not sure whether or not this really is a good strategy for avoiding keyword-based filters.

Oh wait, I forgot something:

puckish magnuson absentee excerpt byrd zucchini execute kissing madras confront iodide dirac apprentice angora accentuate muddy confectionery gunmen tantalus angel aghast drub hamper sketchbook goat phobic

A line from a William S. Burroughs novel? Nope -- a line of text included over 100 lines below the end of the text reprinted previously in this post. This brought a smile to SC's face, if only out of appreciation for a clever adversary.

It would be foolish to design an e-mail filter that simply looked for a couple of keywords and dumped anything that included them. Nope, to provide a safe guess, you've got to have some way of estimating the probability that a message is actually spam. The simplest way to do this is with a Bayesian classifier; if a high enough proportion of the words in the text relate to mortgages, or Viagra, the message will be flagged as spam. Including all this additional text, none of which is likely to correlate with the usual spam topics, decreases the percentage of potential filter-triggering terms. It's a good defense, even if the filter writers manage to defeat the insertion of spaces to break up the mortgage-related terms.

Of course, the question that this doesn't answer is: who in their right mind would send down payment-sized amounts of money to someone using a fake e-mail address to send out messages full of spelling errors?

May 03, 2004

Unacceptable symbols

This morning, your host updated his Windows installation under orders from Corporate IT that all "critical service updates" currently available must be installed immediately. Among them was one with this ominous warning:

This item updates the Bookshelf Symbol 7 font included in some Microsoft products. The font has been found to contain unacceptable symbols. After you install this item, you may have to restart your computer.

At first, it struck SC as possible that some symbol, however improbably, was erroneously formatted, and the result allowed a buffer overflow somewhere, and maybe there was a resulting opportunity for a bit of malicious code...but this was so improbable that he dismissed it immediately.

Naturally, the next thing to do was to look for a pre-rendered version of the font on the web, to see how it looked and what could possibly be at issue. Here it is. Doesn't take long to figure out what's going on, but the symbol in question is in the third row from the bottom. And the second.

Now, the two versions of the swastika in the font are not identical; one is the Nazi version, and one is the Hindu version. There's more about that here. The Hindu one ought not be offensive to anyone, although SC admits that the 45-degree tilt that distinguishes them is perhaps not as salient to most people as the other features of the hakenkreuz.

At first glance, this seems to be a pernicious technique for censorship. By labeling it a "critical service update", and obfuscating the exact nature of the issue, Microsoft has moved to treat a matter of speech as being exactly like a software bug -- if people find it offensive, simply erase it ([since when have you thought that Microsoft is in the business of erasing bugs? -- ed.]).

On the other hand, the font is one included by Microsoft in their own installation pacakage to begin with. Nothing in the software used to alter the font actually stops the user from finding some other font which contains the same symbols, and installing that as a replacement. If Microsoft doesn't want to propagate what they feel is unacceptable speech, so long as they don't censor anybody else's ability to do so, it's hard to quarrel with their behaviors.

Ultimately, SC is more bothered by the way they went about it than the fact of what they did. I suppose they could argue that they need to take every available step to keep their fonts standardized, and if they remove a character from some installations, they need to remove it from as many as possible, in order to ensure that the font displays properly across all computers. However, once any piece of code is out there -- fonts or otherwise -- there's no way to guarantee that all users will really keep it patched to the same version. It's not exactly unheard of for people to avoid "updates" that they view as undesirable, either. Aside from these facts, there is considerable irony in the fact that the Nazis made a habit of burning "unacceptable" books publicly, for much the same reason that Microsoft wants to delete their memory. Given these facts, SC thinks it would have been a lot less heavy-handed to place this package in the non-critical update section, and to be more forthcoming about the specific symbols involved. It would accomplish the same goals, and carry much less of the whiff of totalitarianism emanating from their present strategy.

April 13, 2004

All Consuming

The SC referral logs turned up another unusual website this evening, All Consuming. I'll let them explain what they're up to:

All Consuming is a website that visits recently updated weblogs every hour, checking them for links to books on Amazon, Barnes & Noble, Book Sense, and other book sites. Every book on this site has a list of all the weblogs that have mentioned it, and every weblog that has mentioned books in the past also has a page here listing which books it has mentioned. If you have a weblog, search for it here to see if we've picked anything up from it yet.

It's an interesting idea, basically a blog version of a technique known as "collaborative filtering". Amazon runs perhaps the best-known collaborative filtering application -- you see it in action every time you see the lines "customers who bought this book also bought", "so you'd like to...", and "listmania!". The idea of collaborative filtering is simple -- people who have one thing in common are likely to have more, so if you're interested in "How To Eat Fried Worms" (the book SC linked to that attracted All Consuming's attention), you might also be interested in other children's books. Or if you're interested in "Children of Cthulhu", another recent SC mention, maybe Amazon can also sell you a copy of "At the Mountains of Madness" (although by the time you hit CoC, you've probably already read all of Lovecraft's original works).

Collaborative filtering is really just an application of the same sort of relationship networks that have been used to defend -- successfully -- claims that people with serious linguistics interests don't have much overlap with readers of popular prescriptivism, and that people tend to buy political polemics that agree with each other. You're never going to find out about anything too surprising -- unless the data is really sparse, as here (the odds of the prototypical Jessye Norman listener being a Rod Stewart fan seem low to SC). By way of comparison, this obscure Orson Welles film, a favorite of SC's, is apparently only of interest to Welles enthusiasts -- the data may be just as sparse, but the buyers are more single-minded. More typical? Unrelieved sameness.

March 24, 2004

Search me

Mark Liberman has lately been displaying a certain Philadelphia-area idiom (or so a diet of Stallone movies leads your host to believe). Had it just happened once, it would have escaped SC's notice altogether, but hey, as Goldfinger said, "The first time is happenstance, the second coincidence, the third time it's enemy action." (For more such lines, see here.)

Anyways, the construction in question:

"Human Social Dynamics, yo."

"Construction grammar, yo. (And have you noticed that idiom creeping into general usage?)"

Since Prof. Liberman's usage had at least reached the level of coincidence, your host was curious about where else he might have used the line. Attempting to search Language Log with the string "yo" and the search engine provided on the site yielded hundreds of hits -- anywhere that "you" turned up, basically. Given that it's being used as an end-of-sentence tag, though, the Semantic Compositions research staff hit upon the brilliant idea of searching for "yo." instead. The results?

4 hits, but only the above two examples actually included the string "yo."; this one and this one are returned by the search engine as well, but don't appear to include any instances of the three-character sequence that was searched for. Google returned the older of the two valid examples (the newer one is too new to have been indexed as SC writes this), as well as two unrelated posts where "yo" (without the period) occurred.

As a check on the Language Log search function, just to be sure that it wasn't just stripping out punctuation, your host also tried searching for "commentators", with and without a period at the end. 6 hits without; one with. So it's not the case that the search engine is trying to be clever; it just seems to genuinely be malfunctioning. But what an interesting bug it is...

March 21, 2004

Teaching character (sets)

Today, Mark Liberman commented about the problems of getting less-frequently used characters to display properly in web browsers, because of unfortunate compromises in Unicode. Your host really likes this line:

Since there are no economically important languages that use vowels with underdots, the Unicode Consortium in its wisdom has determined that such characters must be handled in the virtuous fashion, by composition of character features, rather than in the convenient and workable fashion, using pre-composed characters such as those provided for the major European and East Asian languages.

For readers who have not spent a whole lot of time programming, "virtuous" is a nice way of saying "pain-in-the-keister".

This brought to mind one of things that has long frustrated your host about the pedagogy of computer science. In theory, the C++ standard libraries include full support for the Unicode character set. In practice, what this means is that for characters recognized by your compiler/editor, you're OK, but otherwise, you're on your own. This latter fact is something which is not generally mentioned in introductory classes, and by the time you have to write anything more sophisticated, it's assumed you already know how to deal with it.

SC has written programs to deal with text in Japanese and French in the past. Programming is not, despite the patents, a strength of his, and he has frequently resorted to ugly hacks involving reading the characters byte-by-byte so that they will be properly read and displayed by the programs that he wrote. Not a good way to go about things. Unfortunately, despite dozens of hours of searching for better references or tools, the SC technical staff has largely failed to find a better way to handle non-English text. Sites like this provide plenty of applications to extend, but not the tools for doing basic development.

Admittedly, your host originally approached computational linguistics from the linguistics side, not the computational one. However, he has long been frustrated by the way that the necessary computer science skills are taught. CL students either are already computer scientists accustomed to learning new skills the CS way, or else they're very frustrated. So here's what your host would like to see:

Syllabus of a semester-long "Tools for Computational Linguistics" course -- assumes basic knowledge of Java, C++ and Perl syntax, but not significant application development experience:

Weeks 1, 2 and 3: Anatomy of a chart parser written in C++ or Java. Take a reasonably fast one, and go through the implementation of the data structures and important functions. Assignment: Turn it into a probabilistic parser, starting from the existing source code.

Weeks 4, 5 and 6: Practical machine learning. Anatomy of an implementation of a Bayesian classifier, or better yet, C4.5. Same discussions as before. Assignment: If Bayesian classifier, train it to classify Penn Treebank articles into "Mergers & Acquisitions", "Earnings Announcements" and "Other". If C4.5, make students derive features for doing this experimentally.

Weeks 7, 8, and 9: Basic databases. Demo real code using ODBC, showing students how to define a database table, populate it with something mildly complicated, like a list of HTML pages and classifications of the sort from last section of the class. Assignment: Make students locally recreate a copy of ESPN's baseball player statistics from last year, by downloading pages, parsing out relevant info, loading it into DB.

Weeks 10, 11 and 12: Larger-scale NLP projects. Using a predefined grammar for the parser from the beginning of the course, and a corpus of documents prepared by the professor ahead of time, demonstrate integration of above software into a slightly useful search engine. Take questions in restricted natural language about baseball players and statistics, convert parse tree into SQL query, get info from database, and bring back some documents about the player(s), sorted by whether or not the paper falls into a category which is relevant to the question. Assignment is to complete skeleton integration code from a reasonably self-explanatory set of header files provided by professor.

If your semester runs longer, then add a final paper or something. The important thing is to have a class that demonstrates examples of real working code, and helps students who haven't been raised as programmers to understand the sorts of coding practices that make NLP software run well, and not merely conform to the very abstract specifications in present-day textbooks. Your host isn't really aware of courses like this in other branches of computer science, but at least those students get a lot more practice programming than the average linguist who decides this branch of the field is interesting.

March 03, 2004

SC is totally unimportant

This morning, D.F. Moore managed to get SC worked up with a single line:

No, Google, I rank the importance of your page to be 5/10!

SC assumes that, like him, Mr. Moore (Dr.-to-be, but the style guide says Mr. for now) has installed the insanely useful Google toolbar in his web browser. If you have surrendered to the giant of Redmond, and are running IE, then you might as well console yourself with the one piece of add-on software genuinely worth having.

For present purposes, the salient feature of the toolbar is that it will display, on a scale of 0-10, the importance of your page as measured by Google's proprietary PageRank algorithm. Frankly, your host hadn't been paying the slightest bit of attention to that tidbit of information on his very own blog (which he usually only accesses through the TypePad interface). However, as readers are aware from previous postings, your host lives for gratuitous ego-boosters. So he was at least mildly annoyed to see that Semantic Compositions gets a big, fat 0. The notoriously overpaid SC research staff was then tasked to go dig up some "oppo" on other language blogs (hey, it's politics season; at least one professional political consultant has told SC that this is the slang among his peer group for 'opposition research'). The major language blogs, 'Log and 'Hat, both rank a prestigious 6/10, as does the X-Bar. The Audhumlan Conspiracy pulls in a solid 5. A Tear in the Fabric of Spacetime is good for 2.

So how does it work? Google's official explanation is here. The important thing to understand about PageRank is that it says nothing whatever about the quality of your content ([that's what you have to tell yourself to feel better, at any rate -- ed.]). It's basically a measurement of how many links there are to your site, with weights assigned to those links in turn by considering the ranking of the pages doing the linking. So if someone was to miraculously find all of the contents of the ancient library of Alexandria and post it all on the web, their page would get a 0 until other people started noticing it.

Of course, since this scheme only tells you about the quantity and quality of links as measured by other people who are presumably interested in the same topics covered by a given page, it doesn't actually solve the problem of which pages you want in response to a given search. If one asks Google to return all of the 10/10 pages, even if there are only several thousand of them, the topics covered will probably be so diverse that you couldn't hope to find anything on a particular topic except by going through the pages one at a time. It might be, though, that Google reserves 10/10 for itself -- go poke around there for a few minutes, and see if you notice a trend, even on pages that aren't obviously of interest to most people.

It's hard to say exactly how Google uses this information. SC's first guess was that it was used as a final sorting criterion -- after finding the pages with a particular string, and perhaps sorting them by how recently they were indexed, how many times the string appears, etc., that it would be used as a tiebreaker. On the other hand, if one page drastically outranks another, you might expect that it would come up first. So the research staff had a look at a favorite of this blog, "cracker barrel philosophy of science", and was shocked to discover that the 5/10-ranked page put up by Geoff Pullum's publisher is outranked by 0/10 SC.

Regardless of how Google uses the PageRank data, the original concern implied by Mr. Moore's retort to Google is "How do I get a higher ranking?". Obviously, SC needs to work harder to attract links, and behave even more outrageously than he already does. Perhaps he'll have to stoop to this level, but that's hardly the worst thing SC could do for attention. Of course, your host has to hope that nobody else is reaching the conclusion that he has about PageRank's mechanism -- namely, that linking to less-reputable pages must lower your own rank, or over time, everyone would be ranked highly...

March 01, 2004

Computing musician similarity

This morning, Mark Liberman links to a study out of MIT entitled "The Quest for Ground Truth in Musical Artist Similarity". The paper contrasts two different techniques for computing the similarity of musical groups/acts. One is based on the Erdös number, a somewhat tongue-in-cheek game of bragging rights played by mathematicians (the number counts the number of coauthors between a given mathematician and Paul Erdös). The other infers similarity from collections of user data taken from an online file-sharing service.

The article only weakly addresses the following concern, which SC thinks makes the latter technique pretty valueless. Namely, the fact that a person listens to the music of multiple artists does not imply that those artists are similar, although the exact meaning of similarity is left carefully undefined. Basically, a collection of songs can be seen as a "bag of artists"; the paper makes the assumption that if two artists cooccur in someone's list, their musical works are somehow related, which is the same as the naive Bayesian assumption. The authors defend this choice on the grounds that "even if users are striving for variety in their collections, it is significant if they find variety in the same artists". Maybe, but...

SC owns two multi-disc CD players; one is a 6-disc changer in his car, and the other is a 5-disc changer at home. Two of the discs presently in the car are primarily spoken-word recordings of comedians, and ought to be excluded in a comparison of musician similarity. Of the other four, there are 2 CDs of soundtracks from games (SC warned you he was a geek), a Rush album, the Sex Pistols' lone album, and the most recent Steely Dan album. At home, we can add a CD by British metal group Motorhead, 2 CDs from a compilation of Duke Ellington's big band recordings (spanning a period of 40 years), a recording by the Minnesota Orchesta of infernally-themed classical music, and a CD from Scott Ross' definitive 34-disc recording of the harpsichord sonatas of Domenico Scarlatti (no, SC has not finished listening to the entire set, a multi-year undertaking). All of these are just what's in the players at present; there is an additional library of another 70-80 discs (counting Mr. Ross' set as 1) available in Chez SC, but the reader may safely assume that it expands along these lines.

While this list has a few obvious similarities -- a taste for metal is clearly evident, and perhaps some argument could be made for a connection between that and the specific orchestral recording cited -- there are some wholly incompatible genre differences. The paper's authors would argue that if other people also combined some of these apparently disparate tastes, this would be an important finding, and SC is prone to agree -- but that strikes your host as a different sort of knowledge than calling music from plainly different genres "similar".

One area that SC hopes the authors revisit is their attempt to model the Erdös-style data as a network of resistors. This assigns weights to the connections by assuming that if two artists are linked by a large group of the same people, that they are in fact closer than artists who are linked through just one connection. This parallels (no pun intended) the fact that when multiple resistors join the same two points in a circuit, their combined resistance is less than any one of them alone. The authors note that this is biased towards making the most popular artists appear similar to everyone else, and that compensating for it by adjusting the values of those links didn't work very well. SC thinks the trick isn't increasing the resistances of the most popular artists, it's adding other circuit components. Make the top artist a voltage source, treat the next 20 most-popular artists as diodes, and everything will fall into place. Or not. (Note to people whose EE credentials are more current than SC's: I know better than that. I might be willing to take a serious crack at a complicated model, but only if there's actual reader interest.)

Readers might be wondering what this could possibly have to do with linguistics. On the surface, nothing. However, computational linguists are always on the lookout for better similarity metrics -- compare Google's link-based measurements to Excite's latent semantic indexing-based measurements to the old term-frequency measurements used by just about everyone back in 1996-7, and you can see how far we've come by looking for new similarity measurements. Any new models that show improvement over what we're used to -- or can do as well without requiring clustered supercomputers -- are worth trying to duplicate, even if they can only function in limited domains.

February 06, 2004

Do you know what I'm thinking? -- Part II

In Part I of this series, I laid out the case for the semantic web, albeit with rather less enthusiasm than its backers. Today, the #1 issue with a bullet: agreement.

SC spent most of his paid employment time during 2003 working on building an ontology. The rest of that time, he was devoted to working on software to map terms across ontologies, a task itself so nebulous, so poorly defined, that it probably shouldn't be attempted right now. Any academics thinking they can identify SC from publications now: you're wrong. SC's team didn't publish anything about it. Never mind that, though.

To catch up any readers who don't know what an ontology is, the use of the term "ontology" without a determiner refers to the branch of philosophy that studies what we know, and how we classify it. When a linguist or computer scientist says "an ontology", it means a structured classification of the things that we know, generally sorted into a hierarchy.

Building an ontology using a framework you didn't define should be a mandatory experience for anyone presuming to tell the world how to represent meaning. Take a second to consider the meaning of the word "address", and then continue reading.

Did you think of a post office? e-mail? computer memory? giving a speech? OK, all that proves is that unambiguous terms are a good thing. So maybe our ontology needs to have 4 kinds of address, each with unique names. Let's standardize on one example to keep going, though. What terms do you need to represent a postal mail address?

ZIP code? street address? city? state? name? apartment number? salutation? Did you think to separate salutation from the rest of the addressee's name? 5 digit ZIP, or 9?

The problem is that different people store this differently in their minds, and produce very different representations as a result. Lazy programmers will want to store everything in a few strings; more conscientious or anal-retentive types will split the strings by parts, will check the format of each string for acceptability and so forth. It might not even be laziness; if you're keeping a directory of addresses for people in your school, you don't really need the same sort of elaborate validation procedures you might want for administering a criminal database. So two developers who both need to represent an address might come up with completely different data structures for the job, even though they both know all of things we listed in the previous paragraph. Remember that slow-loading web page? It includes 14 separate definitions for the word "action", which should give you some idea about how hard it is for people to agree on the meaning of a single term.

Agreeing on the format of data isn't the only challenge for the semantic web, though. Once you've got a lot of terms, you'll probably want to organize them into a hierarchy like we discussed before. Now, the question is how you want to organize that hierarchy. Here, we present a fairly serious disagreement between two ontologies prepared by teams of people with Ph.D.s in CS, linguistics, philosophy, and other related disciplines.

Consider the word "communication". For SUMO, a publicly available ontology, it's an event where information is transferred. It inherits from other concepts like so:

entity -> physical -> process -> intentionalProcess -> socialInteraction -> communication

The inheritance chain simply means that each term is held to have all of the properties of whatever comes above it, as well as having some specific facts that distinguish it from everything above.

Another ontology, which I can't name, organizes things a bit differently. Communication is a field of study.

object -> mentalObject -> abstractObject -> fieldOfStudy -> communication

Not only do the terms not mean the same thing, but what I haven't shown is that SUMO considers physical and abstract to partition the world at the highest level, so these aren't even related.

The reader might object that this just means that there's some other term which ought to mean in the second ontology what "communication" means in the first. And you're right. It's:

event -> mentalEvent -> communicativeEvent

This more readily represents the "act of information transfer", but it does not carry several pieces of meaning explicitly present in SUMO: it's not necessarily intentional, it's not necessarily social, and it's not necessarily a process implemented strictly through physical means. So even though one might write a bit of code to translate between the two terms, "communication" and "communicativeEvent", it still wouldn't tell you anything about the concepts that each one inherits from.

Worse, there's no way to automate understanding what's covered in one ontology versus another. Although there have been some fairly serious attempts at it, all of them require some kind of hard-coded mappings to begin with, and the approaches often don't generalize well beyond the pairs of ontologies they're written for. Even when successful term-to-term mappings are found, there's still the problem of enforcing agreement between the data encapsulated inside each term.

XML, RDF, DAML, and future extensions to those languages will all allow you to automate reading the way these things are represented. In that sense, they've simplified the semantic web problem. But unless you enforce the use of a standard ontology and standard features, the applications built with those languages will still not be able to talk to each other. That's why, as nice as the semantic web idea sounds, SC doesn't think it will work until somebody says "enough!", and takes control of what documents mean -- an event nobody is particularly hoping will happen.

(Edited on 1/12/05 at 3:03 p.m. to update SUMO link at request of SUMO editor Adam Pease.)

February 03, 2004

Says who?

Mark Liberman has lately been playing with the notion "attributional abduction", which he defines as "reasoning to the most likely explanation for the publication of this bone-headed remark." Today, it came up in the context of an article about distributed computing in plants. SC was fine with the comments all the way up until Prof. Liberman couldn't resist the temptation to commit Cracker Barrel