Friends of Semantic Compositions

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    

Site Statistics

Blog powered by TypePad

May 03, 2007

Linguistics for fun and profit -- mostly profit

From a rather unlikely source, National Review's Jonah Goldberg, comes an excellent Bloomberg News article on the state of the art in natural language processing and artificial intelligence for financial applications. NLP and finance being longtime interests of your host, this was an article he found well worth his time.

There's a bit of the usual hand-wringing about how artificial intelligence didn't live up to its promises in the 1960s, and the resulting skepticism which followed. And SC really wishes the HAL 9000 references whenever AI is discussed would just go away, because it frustrates your host no end that people end up thinking that a murderously malfunctioning computer is an aspirational goal (although if the AI community wants to perpetuate this meme, SC can only shrug). On that note, check out Goldberg's post for a reasonably funny HAL-in-finance parody. Having said all that, there are a couple of projects of interest that you can read about in the article:

  • Professor Michael Kearns' Penn-Lehman Automated Trading Platform, involving machine learning with a stock market simulator to test real-time trading strategies
  • Collective Intellect, a Colorado-based startup, offers a system for filtering financial blogs to find information that might give you a trading edge
  • Professor Kathy McKeown's Newsblaster (you'll need to request a username/password to play with it, or see this example) which hasn't been used for financial applications yet, but which is mentioned for ongoing research in conducting "what-if" analyses of how news events might impact markets

A common feature of these lines of research is that they're largely oriented towards the short term; projects done by Prof. Kearns' students relied on an evaluation procedure that didn't allow overnight holding of positions, never mind anything that might qualify for long-term capital gains (that's 12 months). This strikes SC as not being a wholly desirable approach to things, because while everyone and their brother wishes they could pick off events like the giant spike in Dow Jones stock from two days ago (and some options traders may have done so illegally), there's a lot more to financial analysis than simply trying to guess where the upticks will occur. Prof. McKeown's research sounds like a stab in the direction of real financial fundamental modeling, but at least as presented by the Bloomberg article, right now the emphasis seems to be on minimizing risk and maximizing returns through short-term exposure to high-probability trades. That's not a bad strategy -- the importance of proper risk management cannot possibly be overstated, and it would take a real jerk to complain about a reliably profitable approach -- but it leaves a whole lot of other investing approaches on the table.

April 24, 2007

As the machine laid translating

Courtesy of Arts & Letters Daily, here's a Washington Times book review of a new biography of Edith Wharton that argues for a higher place for her, as well as Willa Cather and Dawn Powell, in the literary canon. SC is regrettably less familiar with the works of Ms. Wharton than he ought to be ([you've read the complete Dune pastiche of Messrs. Herbert and Anderson, but not The Age of Innocence? 'Nuff said -- ed.]), and the article forcefully makes the point that no small part of her comparative neglect is her failure to live a debauched life of crudeness and self-promotion. Put more bluntly, "Misters Fitzgerald, Hemingway, and Faulkner were all alcoholics", and the article follows that up with examples of the hard living that guaranteed them a steady stream of headlines until death.

But the part that caught the eye of this computational linguist was the smackdown of the grossly-overrated Faulkner that went like so:

Miss Powell's New York books re-create a milieu every bit as richly imagined and unforgettable as Mr. Faulkner's Yoknapatawpha County — and a lot more, um, intelligible. "I can't read Faulkner," confesses Mr. Page. "He does absolutely nothing for me."

He's not the only one.

Some enterprising soul has posted on the Internet "Machine translation or Faulkner?" — a quiz asking you to deduce whether quotations are computer-translated text from the German or samples of Mr. Faulkner's prose.

Needless to say, SC immediately Googled "machine translation or Faulkner?" and was rewarded with this site. The machine-translated German texts are revealed to be largely appropriate literary comparisons, albeit not of Faulkner's writing (think Goethe), although some might hold that an ideal test would be a machine translation of a human translation of Faulkner's writing back into English. SC scored a meager 42% on the quiz, and was frankly shocked to see that some of the questions he got wrong revolved around quotes from Faulkner works that he's actually read. Evidently your host didn't read them attentively enough.

June 22, 2006

Red light, green (web) site

Today in the Wall Street Journal, Walter Mossberg has a column (subscription required) about two new nannybots web content safety tools. The tools in question, Scandoo and Site Advisor, are supposed to protect you not only from the usual litany of bad words, but also from sites likely to hit you with pop-ups, spam, and the other usual sorts of malware that make web browsing so much fun.

Much like the hated Websense, SiteAdvisor works off of a database of naughty pages which have been categorized through unspecified algorithms. SC hopes that someone with a sense of humor at one of these companies calls their database the Index Librorum Prohibitorum, perhaps with an update to reflect the fact that we're dealing with web pages, not books. Scandoo is probably of greater linguistic interest, as it purports to conduct its scanning on-the-fly, and therefore must be doing some kind of near-real-time classification of words (this is done only on search results, not on whole pages).

Naturally, SC had to test the ability of these sites to perform as advertised. SiteAdvisor irked your host right at the start, by requiring the installation of a plugin for Firefox (the same is true for any browser you'd use it with). The resulting "safety" indicator that displays in the lower-right corner of your browser presumably works in the same manner as the Google Toolbar's PageRank display, pulling in the needed data as you go from site to site. And incidentally, giving SiteAdvisor (as with Google) a convenient way to track your browser history and correlate it with your IP address. No wonder it's free. (In fairness, the plugin includes controls to use SSL encryption for communicating with the SiteAdvisor servers, as well as for turning the plugin off -- mitigating privacy concerns somewhat.)

That's not to say it doesn't do something useful. Once you install the plugin, SiteAdvisor offers you a collection of suggested test links to see their work in action. So SC used their search terms as a test corpus. For "screensavers", SiteAdvisor flagged 7 of Google's top 10 results as dangerous (all of them look the part, for what it's worth), with warnings about links to spyware-related sites, infected downloads, and number of spam e-mails per week received from the website as a result of visiting. Scandoo flagged two of the same sites, but gave passes to the other 8. For "free ipods", ScanAdvisor, flagged just one of the top 10 sites, but 7 of 11 sponsored ads. Having said that, SC only thought that one of the Google results should count as a miss in their precision statistics. Scandoo gave them all a clean bill of health. For "p2p", both pieces of software gave a pass to all 10 of the Google links, but SiteAdvisor flagged 5 of 8 displayed ads as potentially malware-related. For "free downloads", SiteAdvisor marked 2 of the 10 Google hits with "caution" flags, and one as dangerous, as well marking 3 of 9 ads as dangerous. Scandoo again was fine with them all. For "wallpapers", SiteAdvisor marked 4 of 10 Google hits as dangerous, and one with a "caution", plus one more of each in the 8 ads that showed up. Scandoo gave one of the top 10 wallpaper hits a "sex/nudity" flag, and a "use caution" flag to another. So as far as avoiding intrusive pop-ups and malware goes, SiteAdvisor is actually a pretty useful tool, and SC will be leaving the plugin installed in his copy of Firefox.

However, the linguistic fun is to be had in Scandoo's "security preferences" page, where you can "customize the type of sites you would like advanced warning for". Warnings include the predictable "hate and discrimination", "sex/nudity", "weapons", and "illegal activities" (which are set as your preferences by default). For the more easily intimidated, security warnings can also be issue for "arts and culture", "finance", "food and drink", "nature", "fashion and beauty", and about 15 other categories as well.

Naturally, SC therefore wanted to test Scandoo's effectiveness as a classifier for those other site types. Setting up the filter to just flag "arts and culture", and then searching for "classical music", it flagged 9 of 10 sites correctly (in SC's judgment), the lone exception being this link to Duke's library, which categorizes some 1500 links as various resources for classical music. Not a bad showing, albeit an anecdotal one. SC was less impressed when he set the filter to "Politics and Society", searched for "George W. Bush", and found that it only categorized 6 of the 10 hits as political, missing on two official White House pages, as well as two sites maintained by the Republican National Committee. It performed no better on a search for "John F. Kerry", getting only 5 of 10 right, and failing to flag either Kerry's campaign site or his official Senate homepage. Whoops!

Finally, SC subjected both filters to the dreaded "breast test", to examine their ability not to generate false positives on a subject including a word with both inoffensive and pornographic usages. Both filters gave clean bills of health to all of the links on the first three pages when "breast cancer" was searched for. All of the ads were also cancer-related, and so it was no failure for SiteAdvisor to give them a pass. However, when the search term was just "breast", while Google's unpaid results were all clean on the first page, there were a couple of ads for scams or pornography to go with the results. And these also got a pass from SiteAdvisor (Scandoo has nothing to say about ads). This was a bit of a surprise -- perhaps the engineers have tuned the algorithms to give the benefit of the doubt to sites that might trigger this complaint from civil libertarians. SC was somewhat surprised that the dodgy ads didn't get flagged for spam or malware, though -- do magic pill vendors really have better ethics than screensaver sites?

It would appear that content filtering has taken somewhat of a step forward. While SC doesn't think the topic categorization provided by Scandoo is good enough that he would want to plug it into an application of his own, it makes for a decent enough first approximation. Neither of them is as obtrusive as the filters that made headlines a decade ago, and if a library chose to install SiteAdvisor on their web browsers, SC wouldn't have any objection. Having said that, if you're the sort of parent who doesn't want your child coming anywhere near MySpace (which SiteAdvisor and Scandoo both had no issues with), these aren't quite the solutions you're looking for.

February 22, 2006

On a non-trivial substring recognition problem

.coCourtesy of SC's favorite information technology scandal rag, The Register, comes news that Yahoo! has banned the registration of usernames containing the string "allah", presumably in order to avoid offending Muslims. This was discovered not by someone wishing to express an opinion on the Islamic faith, but rather by an individual bearing the last name "Callahan". Seeing as this demonstrates that Yahoo's filter must be case-insensitive, for the remainder of the post, SC will continue to spell out the string in all-lowercase, as no reference to anything but a particular string is intended.

At first, your host's instinct was, "That's an idiotic oversight; couldn't they have just done something to ban patterns like "allah2000" or "myopinionaboutallah", where it clearly isn't part of another word?" They've got the data on what sorts of names people are trying to register; surely they could construct a regular expression that would catch the bulk of it, and maybe use a dictionary to filter out the rest?

But actually constructing such a filter is harder in practice than it sounds. Suppose we assumed (to take just one of many possibilities) that "allah" at the end of a proposed usermane is probably it, and wrote an expression along the lines of "*allah". This would shut down "ILikeChallah", an expression of opinion on a wholly unrelated matter. Of course, this is why you would then need to follow up the expression with a dictionary, to try to match increasingly-long substrings to existing words, which would verify that the occurrence of "allah" was a false positive. But this would be complicated by the potential of having to expand the substring in both directions as part of a search; after all, "LivinInTallahassee" ought to pass as well, not to mention the name that started this whole line of inquiry (Callahan).

And then, even if the two stages of the filter concluded that you probably did have a match to the Muslim deity in particular, not just an Irishman or a Jewish baked good, there would have to be some additional processing to determine if the reference was actually offensive. Which would then require some way to parse out the rest of the proposed username, taking into account the tendency of people to abbreviate words in coming up with said usernames. If someone came up with "flwrallah" as a username, is that "follower" or "flower"? One can use finite-state machines to come up with potential word lists from abbreviations pretty easily -- your host used this program for exactly such an application as part of a statistical NLP course -- but suddenly, we're talking about a fairly complicated piece of software just to check for all of the possible words that could be in someone's e-mail address.

In that light, it's not too hard to understand why Yahoo! might! have! just! decided! to! ban! the! string! outright! (to use a frequent Register affectation when writing about the company). But it's actually fairly clear that they didn't go through any such engineering analysis to consider the trade-offs. You see, we know not only that this was their policy as of two days ago, but how the story ends. Rather than just admit that they took the straightfoward approach to dealing with a problem they didn't want to have (namely, complaints from irate Muslims), they instead claim to have "recently re-evaluated" the issue, and concluded that it's simply gone away, which strikes SC as rather implausible in light of current headlines (think cartoons).

Of course, they could be telling the truth insofar as they've got the data on what usernames people have actually tried to register (and when they've done so), and SC doesn't. However, if we take their word, then we're also supposed to believe that they independently came to a decision to unblock the string right around the time they got a complaint that they should have easily anticipated -- and thus had a clear policy to deal with -- if they devoted even as much effort to this issue as SC has in writing this post.

February 20, 2006

A translation/search engine for Wikipedia

Every now and then, SC browses Google News for language-related stories, and comes up with something like this, a search engine for Wikis. The interesting linguistics hook here is that Qwika, the search engine in question, performs machine translation of Wikipedia articles into a variety of (admittedly fairly common) languages.

So how does it do? Well, here's the first sentence of the Wikipedia article on semantics:

In the main, semantics (from the Greek and in greek letters "σημαντικός" or in latin letters semantikos, or "significant meaning," derived from sema, sign) is the study of meaning, in some sense of that term.

And here's their Spanish translation:

En la cañería, semántica (de Greek semantikos, o el "significado significativo," derivó de sema, la muestra) es el estudio de meaning, en un cierto sentido de ese término.

El estudio de meaning? It's not bad for gisting purposes, though, which is about as good as you could hope for with large-scale machine translation, especially with so many subject areas to cover. Considering what a useful resource Wikipedia is in English, this looks like a pretty nice start to getting some of that work available in other languages.

September 23, 2005

Optimized for a waste of time

Your host is a little hesitant to write about this topic. This post is inspired by an article he read about a company which is a very, very aggressive litigant. Nevertheless, the topic falls quite squarely within the Semantic Compositions charter, and so we'll just hope they have bigger fish to fry.

Our story beigns with an article published in yesterday's Wall Street Journal, on websites that disappear from Google's results as a consequence of the strategies they use to get better placement (subscription required). The lead-in:

Last summer, Gary Pond realized he had a big problem: The Web site for his luggage business no longer showed up in the search results in Google or Yahoo.

He immediately contacted Traffic-Power.com, a company he had hired for $2,400 to do search engine optimization – tactics designed to make a Web site appear higher in Web-search results. But he said he was unable to reach anyone at the company. After further investigating, Mr. Pond discovered that Google and Yahoo had dropped his site, PorterCase.com, because Traffic Power had used methods the search engines deem improper. After weeks of correspondence, Google restored the site, but a search for "Porter Case" still fails to turn up the company's Web site on Yahoo.

So what does Traffic Power do? As their own site describes it:

  • Our professionally trained technicians help you to choose the 20 keyword phrases that best reflect the main focus of your site.
     
  • We use those keywords to optimize your meta tags and build 280 HTML attraction pages designed to conform to the ranking criteria of the top search engines.
     
  • Since many search engines now use link popularity in their rankings, we use systems developed by our programmers to build your link popularity.

Now we're getting somewhere. One might immediately suspect that "280 HTML attraction pages" loaded up with nothing but keywords and links might be a crude way to manipulate Google's link analysis, which gives higher rankings for each keyword to pages with large numbers of incoming links. And one would be right.

Odds are good that you've come across attraction pages (also known as "doorway pages" and "Search Engine Entry Pages") before, most likely in search of various consumer products. Llinking to them is problematic because they have a way of constantly changing and disappearing. But never fear, SC has devised an ingenious permanent solution for the problem -- try this link to a Google search for "viagra bargain phentermine", which is pretty much guaranteed to have bogus pages disguised as message boards, blogs, and other link farms forever (not to say that Traffic Power is specifically responsible for anything you find). Or at least until Google comes up with a clever way to detect web pages with content that differs markedly from what the directory names would suggest.

While Traffic Power seems to be at the outer edges of aggressiveness -- changing their name and earning no end of ridicule for suing a blogger over comments left on his site -- it's hard to say that their behavior is really qualitatively different from the larger industry of "search engine 'optimizers'", as they call themselves. They would vigorously contest the name "spammers", so SC will let you read their own words and decide for yourself. Consultant Phil Craven, arguing that his goals and Google's are the same:

SEOs have exactly the same aim as Google; that is to see the search results filled with relevant web pages for any given search term. Both SEOs and Google strive for that. The only difference between Google and SEOs is that SEOs want to see one of their relevant pages at or near the top, whereas Google doesn't care about individual websites. Apart from that, the aims and desires of both Google and SEOs are identical.

I realise that Google would like to index the natural web - a web that hasn't been tainted by arranged link exchanges and modifications to pages and sites, for the purpose of improving rankings. But the web wasn't in a natural state when Google arrived, and it will never be in a natural state in the future, so that ideal is something that Google can never have.

Craven has a point -- it's quite naive to think  that Google results represent a "natural web" where only pages which have earned their way to the top show up there. As soon as the first spammer discovered the power of hidden text, bogus redirect pages, etc., the link analysis performed by Google became worth only as much as the company could do to defend it.

But the idea that the "optimizers" are anything other than the natural enemies of search engines is laughable. Google's entire business model depends on being able to sell advertising tied to keywords. "Optimizers" attempt to get favorable placement without Google receiving any money. The only way Google's ads are worth money -- and the only way to make it economically viable for them to provide a free search service -- is if their non-advertising links are perceived to be as unmanipulated as possible. If you can get top billing without paying them, their ability to charge for ads is compromised. Not being a Google employee, SC has no obvious reason to shed tears over their loss -- but as a Google user, he sure has an incentive to want them to be able to make enough money to keep their service running and free.

Craven is not wrong when he argues that since Google has no incentive to care about any individual website, the specific order of equally relevant pages is capricious:

Let me give a hypothetical example of a web that hasn't been touched by search engine optimization. Suppose there are 100 hotels in New York, each having its own website, and suppose a surfer, who is looking for a New York hotel, types "new york hotels" into Google's search box. Which of the 100 hotel sites does Google list at or near the top, and which of them get buried? Which of them are more relevant to the searcher's requirements than the others? Of course, none of them are more relevant than the others; they are all equally relevant to the searcher's search. So which does Google display first? Those that, by chance, just happen have pages with criteria that are the closest to Google's algorithm. Is that a fair way of doing things? Of course not. The few sites at the top would think it's fair, but common sense says that there's nothing fair about that system.

But while this complaint might be valid so far as actual human-readable pages are concerned, Craven's view of what counts as "optimization" has little to do with content -- indeed, he has contempt for "ethical" optimization (his scare quotes, not mine). Just in case we're not clear on his position, he says "There is nothing instrinsically wrong, immoral or unethical with any of the so-called search engine spam techniques and methods." In Craven's view -- and those of "optimizers" more generally -- limiting yourself to Google's preferred optimization strategies, such as "writing copy...giving advice on site architecture and helping to find relevant directories to which a site can be submitted", is a mug's game. Craven gives the game away by saying:

What difference does it make if a site's doorway page or content page occupies a top ranking? None. A relevant page from the site is listed; that's all.

Nonsense. Whether one calls them doorway pages or attraction pages, the point is still to create links that make a site look more popular than it is. If I buy 280 domains and put up one page on each of them for the sole purpose of creating 280 links to this blog from different domains, I've created that many pages which no human actually wants to read, even if they want to read the pages that are linked. What sort of user interface puts up pages you don't want to read ahead of pages you do want? It might be a proxy for a genuinely useful page, but every page that comes up full of advertising links is still one more wasted page that isn't what I actually wanted to read.

One might argue that there is still a difference of degree, if not of kind, between putting up a page pretending to be a search engine or directory listing and between putting up fake blogs or fake message boards. Perhaps so. But when computational linguists set out to define useful results in response to keyword searches or real questions, pages that exist solely to redirect users sure aren't what they have in mind.

April 13, 2005

Microsoft Grammar Checker Are Not As Stupid As Advertised

Your host can't believe he's about to write this. Especially since it was covered quite well by the excellent Polyglot Conspiracy some 6 weeks ago. But here go:

Arts & Letters Daily Link to Chronicle Of Ivory Tower Fools Story Making Big Fun Of Microsoft Grammar Checker Program. Story Say Grammar Checker Write like SC on Crack.

Brain-dead Heap Smart Professor University of Washington Show Clever Examples of Grammar Checker Fail. Wise-ass Chronicle Editor Bite Hook, Sinker and Eat Much Line as Well. Pens Jackass Ungrammatical Headline and Think Super Funny That Grammar Checker Not Catch Mistake. Not Consider That Bogus Capitalization Source of Error; Maybe Program Not So Stupid After All.

Try SC Experiment At Home. Paste Block Quote into Word:

Microsoft Word Grammar Checker Are No Good, Scholar Conclude. Microsoft Word grammar checker are no good, scholar conclude.

See How Not Using Headline Capitalization Style Make Grammar Checker Smarter. Not true in all case, but make editor look real dumb print comment that headline fool grammar checker.

Every sentence in this post except for the normally capitalized one passed the grammar checker in your host’s copy of Word 2003. In fact, Word also caught grammar errors in the case of every non-capitalized word appearing in the earlier parts of this post -- they only pass the grammar checker after being corrected to “normal” capitalization. Admittedly, this suggests some rather simple rules are in play – not capitalizing prepositions, for example – but it also suggests some rational decision-making on the part of the Microsoft crowd, like treating sequences of capitalized words as proper nouns that they shouldn’t try too hard to parse with limited resources.

(Edited on 4/13/05 at 10:19 a.m. to properly credit the original scoop on this story in the linguistics blogging community.)

March 15, 2005

Tomatoes are vegetables

Courtesy of Jack Brounstein at The Audhumlan Conspiracy, a story from Reason magazine's blog about a bill designating a variety of tomato as the official state vegetable of New Jersey. Botanists and biologists will no doubt sneer, but SC finds this story strangely compelling.

Let's start by laying out the "scientific" definitions of fruits and vegetables. The catch here is that there really isn't such a definition for vegetables. But we'll deal with that in a second. First, fruit as defined by the Columbia Electronic Encyclopedia:

fruit, matured ovary of the pistil of a flower, containing the seed. After the egg nucleus, or ovum, has been fertilized (see fertilization) and the embryo plantlet begins to form, the surrounding ovule (see pistil) develops into a seed and the ovary wall (pericarp) around the ovule becomes the fruit.

Spend a moment to peruse the "types of fruits" link as well. They're not all exactly prototypical -- peas (when in pods) and beans fall into this classification, as do parts of the carrot plant that you aren't actually likely to find in the grocery store. Bananas are an interesting case -- the inside is a fruit on the technical definition given above, but the skin is actually tissue from the stem of the plant. For more detail on that, see here.

So then there are vegetables. Definitions vary considerably; the UC Davis Vegetable Research and Information Center defines vegetables as "the edible portion of a plant", which would seem to subsume fruits as well, although they then go on to classify vegetables by different plant parts other than ovaries; leaves, stems, roots, etc. A horticulturist observes that, from the standpoint of his profession (and as opposed to botanists), fruits are generally perennial, woody plants, while vegetables are annuals with soft stems. These aren't perfect definitions (watermelon is a vegetable on this reading of things), but as he points out, horticulture isn't an exact science.

Thus, from a purely scientific standpoint, tomatoes are fruits, and there are no such things as vegetables. But this is a plainly counterintuitive result to a typical speaker of English, who knows that he can walk into a grocery store, ask for the location of the vegetable aisle, and get a useful result. Our horticulturist friend points out that:

Tomatoes fit the vegetable category. They are planted every year. We use them in salads and, well, vegetable dishes during the main meal. I haven't had a tomato cake or pie and frankly, don't care if I ever do.

So, that's the foundation. Botanically, the tomato that you eat is the fruit of the plant, sort of like you were the fruit of your mother's womb. Botanically it IS a fruit. But remember, so is a cucumber, green bean pod, pumpkin and zucchini. Most of us probably wouldn't make a fuss over whether a green bean or cucumber is a fruit or vegetable. Common sense says they are a vegetable.

Linguists, especially of the computational variety, have a lot to say about defining terms for common-sense reasoning; that's most of the point of building ontologies, which are essentially taxonomies of concepts. At each level of the ontological hierarchy, the idea is to partition concepts into non-overlapping categories. In principle, this ought to be easy to do. So, for example, we might do as SUMO does, and start by saying that everything that exists is an entity of some sort. All entities exist as either physical or abstract objects. All physical entities are either objects or processes. Once you've got the hang of this sort of partitioning, it's easy to swing away with Occam's axe, hacking the entire world into neatly-defined categories.

However, as we've written before, once you get past the very most abstract levels, it can be anything but obvious what the correct partions are. And even at those abstract levels, there's often substantial disagreement about terminology. This gets us to the root of the problem with tomatoes; when people argue about whether or not they're vegetables, they're really arguing about their choice of partitioning rules. We've got:

  • The botanist rule: mature plant ovaries are fruits, everything else is a vegetable.
  • The horticulturist rule: perennials are fruits, annuals are vegetables.
  • SC's rule: if your mom had to force you to eat it, it's a vegetable

The U.S. Supreme Court took up the question in 1893, and decided that:

Botanically speaking, tomatoes are the fruit of a vine, just as are cucumbers, squashes, beans, and peas. But in the common language of the people, whether sellers or consumers of provisions, all these are vegetables...

In other words, they came down for SC's rule and ordinary language as the choice of partitions for plant products. The developers of the Cyc ontology, the world's largest collection of concepts, formalize it similarly:

(DEFINE-OKBC-FRAME vegetable-food :SUBCLASS-OF (food vegetable-matter) :INSTANCE-OF (class existing-stuff-type default-disjoint-food-type) :OWN-SLOT-SPECS ((documentation "A collection of edible stuff. Each element of Vegetable-Food is a foodstuff which is derived from a plant and is ordinarily considered a vegetable; e.g., a carrot (an instance of Carrot-Foodstuff), a potato (an instance of Potato-Foodstuff), a lima bean (a Bean-Foodstuff), a tomato (a Tomato-Foodstuff). Note: Vegetable-Food includes certain plant parts that are technically classified as fruits by botanists, but which are treated as vegetables in food classification -- such as tomatoes. These would, e.g., be found in the vegetables section of a supermarket, and they satisfy more of the axioms about vegetables than those about fruits (e.g., sweetness.)")) ) [emphasis added by me -- SC]

It can't be put more succulently succinctly -- tomatoes really do "satisfy more of the axioms about vegetables". Botanists may cringe, the editors of Reason may scoff, but from the standpoint of a linguist, three cheers for the New Jersey Assembly for standing up for common sense.

January 06, 2005

It depends on the meaning of "search"

In discussing the Edge question of the year yesterday, SC set aside one article for separate discussion, because it's the only one that he can really claim deep professional expertise in. That would be Marti Hearst's assertion that "The Search Problem is solvable".

As a matter of believing something that can't be proved, this isn't exactly a strong claim. Your host doesn't doubt it in the slightest, depending on what is meant by the "Search Problem", which is rather ill-posed. Reading Prof. Hearst's extended comments, it's clear that there are actually several different problems under discussion.

The problem that SC takes for granted as ultimately solvable is "Advances in computational linguistics and user interface design will eventually enable people to find answers to any question they have, so long as the answer is encoded in textual form and stored in a publicly accessible location." As far as your host is concerned, this is reasonably close to true already as long as you're looking for something in English, and have a little patience in combing through Google's results. The best answer isn't always on the first page, but SC would wager that if you've got a question that you can't answer with Google + patience, it probably doesn't meet one of the two conditions Prof. Hearst laid down.

But then there's another version of the problem, implied in a statement Prof. Hearst makes about translation systems. She didn't restrict her original posing of the problem to English, and this belief requires a bit more faith than the previous one. Cross-lingual information retrieval is not at all up to the standards of monolingual information retrieval; the last time it was evaluated at TREC (Text REtrieval Conference), the benchmark of success was getting back documents on the same topic, not necessarily answering specific questions. Defining the task is an ongoing problem; see the Cross-Language Evaluation Forum's homepage for current work.

Another version of the problem is what we might call the "answer on a plate" goal, or, "Google + patience, minus patience". Question-answering is the technical term, and that's a field that's advancing rapidly (again, in the monolingual English context; here's the TREC page). But unlike the goal of being able to get documents on a given topic across languages, it's less clear that question-answering is important to search. If I pull up Google because I want to know the total land surface of the Hawaiian islands, that's a case where a question-answering system would improve on having to read through a page full of links. If I just want to read about Hawaii, though, and don't have a specific question, then it's not clear that Google doesn't already represent a satisfactory solution.

One area that Prof. Hearst pays considerable attention to is the notion of user interface design. Strictly speaking, this doesn't affect the problem of finding answers to questions. But it does affect the happiness of users. She's quite right to point out that search engine providers have "enormous, albeit somewhat impoverished, repositories of information about how people ask for information". Any blogger who looks at their referral logs knows that there are a wide range of ways in which people reach the same pages. Some people seem comfortable feeding Google whole questions in natural language, even though they have to know by now that Google will strip the punctuation out and many grammatical-function words. Other people just put in a list of terms with no operators. Some people (like SC) insist on explicit operators before every term in a query, even though they may not always be necessary (if you aren't using "-", it really doesn't matter much). Some people always use grammatical phrases, and some people put their terms in apparently random order. Google is probably using the most robust technique to handle all these variations -- throw out everything but keywords -- but people wouldn't use all these different variations if they weren't expressing (sometimes subtle) differences in how they wished they interacted with search engines.

Like SC (see here), Prof. Hearst is skeptical about the Semantic Web -- but she seems fairly positive on the contribution that ontologies can make to search engines. Your host is only half-sure about this -- while he likes ontology-building, and believes it has all sorts of uses, it's not clear to him what purpose ontologies serve in searching unless it's for Semantic Web-like uses. If you have a comprehensive organization of concepts that's only relevant to you and perhaps some particular community of interest, then it can't be useful for browsing the data put out by other people and communities.

Finally, Prof. Hearst suggests that ultimately, we'll want something better than the text box interface for searching. SC isn't so sure about that. It's the simplest user interface possible -- much like the command lines that SC vastly prefers to rodent-based window systems -- and in SC's opinion, that makes it attractive to users of all levels of technical competence. The presentation of results might be improvable in some respects, but unless this sort of thing is your idea of an improvement, SC isn't sure that pages full of links will ever be wholly replaced. On the other hand, Prof. Hearst has some interesting demos up that represent useful enhancements to the basic text-box paradigm. You can see the ontology at work underneath her projects, which means the technology can't be transferred without considerable new effort. But maybe that's not such a bad thing. Maybe the real future is in meta-search engines that then direct you to purpose-built search engines which are more robustly engineered for what you're really interested in. The history of search engine development so far is the development of bigger and better general-purpose techniques for culling information from essentially universal databases. Looking at how elegant Prof. Hearst's work is when it's restricted to one domain, though, it might have been more daring for her predict the demise of general-purpose search engines.

December 08, 2004

Accoona? Gesundheit!

Radagast wrote recently on signs in Afghanistan congratulating Hamid Karzai that are written in English instead of one of the local languages. In reply, a commenter of his noted that "In many countries (Japan comes to mind), English characters look "cool." It doesn't matter what the text says, just as long as there's some there."

In apparently unrelated news, former President Clinton yesterday participated in the unveiling of a new search engine, known as Accoona. Accoona promises to "SuperTarget" your searches, a feat accomplished through three strategies. In Accoona's order, they are:

  1. Using "the meaning of words to get you better searches", which SC will bet $100 means dictionary or thesaurus-based query expansion. Frankly, this is one of the older tricks in the book, and one of the biggest cliches of the field.
  2. The real trick of the bunch, "Accoona artificial intelligence allows the user to highlight one keyword, and will rank the search results starting by every page where the meaning of that one keyword is more important than the meaning of the other four keywords". We'll explain what this has to do with Radagast's commenter in a minute.
  3. Finally, they will push paid links at you. Don't take my word for it, take theirs: "Accoona's Artificial Intelligence merges information from the web and the Accoona Business Information Database in real time". Big whoop.

The notion of ranking the search by weighting keywords is more interesting than anything else they're pushing. So SC set out to test it. But where could he find collections of suitably generic English words that would nevertheless have very particular best matches as search results?

Engrish! No, that's not a typo, it's a slang term denoting "the humorous English mistakes that appear in Japanese advertising and product design". More to the point, the same authors comment that "English is used as a design element in Japanese products and advertising to give them a modern look and feel (or just to "look cool")". The genre is just bursting with examples of meaningless phrases strongly associated with only one or two unique items, which makes it a great testbed.

So to start out, SC tried the slogan from the Cartoon Network's latest and weirdest Japanese import, a show called "Hi Hi Puffy AmiYumi". Rather than search for the show by name, your host used the slogan which appears in its ads, "happy fun cartoon rock band invasion". As can be seen from the first page of results, no sign of the show. Choose to "SuperTarget" rock, though, and you get two relevant hits in the top 5, one of which is the Cartoon Network homepage. Choose "band" instead...and you get the exact same links in the same order. Guess what happens if you choose any other single word from the list?

Hoping to get a more interesting demonstration, your host went back to the Engrish page and got the slogan for Wonda coffee: "coffee with tasty aroma for refined adults". The unrefined search for this product doesn't put it anywhere on the first page. Target "refined", and the first five hits reference Wonda explicitly, or contain clear puns on the slogan. Again, though, picking any other individual term from the group yields the same results as picking "refined". What's more, Accoona shows a rather disturbing tendency in spelling correction, suggesting "coffee with tasty roma for refine adult" as an improvement on the slogan as entered originally. And Google has no difficulty putting a half-dozen Wonda hits in their first 10 results, without the use of any operators, quote marks, or other attempts to direct the search.

So what's going on with Accoona's "artificial intelligence"? SC's best guess is that Accoona uses a vector-based representation of the original documents, where each word represents a dimension in a very high-order space (as many dimensions as words!), and that the "SuperTargeting" is nothing more than cranking up the weight of one dimension of the query over the others. This strategy is wholly ineffective if all of the search terms occur with about the same frequencies in most/all of the documents being searched, which would account for the identical results as weightings changed. Your host can't conclusively prove that this is what's going on, but got some strong supporting evidence by searching for the name of this blog.

Searching for just "semantic compositions" got back a number of references to this page, 6 of the 10 being linguistics blogs (including this one). Emphasizing "compositions" cut out all of the other linguistics blogs except for Tenser, Said the Tensor, and returned a number of other links which didn't come back in the first search. Emphasizing "semantic" removed TstT, and changed some of the other returned links, bringing up more Semantic Web pages, and removing a couple of links to sites regarding poetry and rhetoric. So Accoona is capable of recognizing disjoint senses of words to some extent, but only when the differences of weight within documents are very strong. Interestingly, some of the returned pages did not contain one word or the other as written; this one contained "composition" but not "compositions", so clearly stemming is part of the search strategy. Such a technique makes sense in order to reduce the number of elements within the vectors representing each document, but considering how crude the rest of the technology is, the accompanying reduction in selectivity is not helpful.

To conclude this examination of Accoona on a rather blunt note, Excite did the same thing in a more sophisticated manner, AltaVista did it much more comprehensively, and Google's suite of algorithms outstrips it considerably. The only real reason to use Accoona is the large volume of China-specific business data in its paid-search directory, a feature called out in the AP article linked above. Accoona has a politically-granted monopoly on this information, also noted in the article, and so it will likely succeed as a business for reasons having little-to-nothing to do with its technology. Eckhard Pfeiffer, the CEO of the new company, was CEO of Compaq when it developed AltaVista, and it is hard to believe that he does not recognize that he led a company with a much better product in the same space as recently as 1999. Unless he is completely incompetent ([a notion well-supported by Compaq's stock price after they bought your Digital Equipment -- ed.]), he could not possibly have gotten behind this venture unless he expected it to succeed for reasons having little to do with being a genuinely new and better search engine, which it most definitely is not.

August 12, 2004

Wanted: Named Entity Recognition

In the September issue of Car and Driver -- and couldn't they have waited until at least one full week of August had gone by before putting the September issue in the mail? -- there's a carefully worded story about Mitsubishi's interesting warranty practices.

For those readers who have no interest in cars, or who, like SC, prefer big American muscle cars to rice rockets (and thus ordinarily couldn't care less), the background is that Mitsubishi has made a successful niche market for themselves by taking about the least distinguished econobox on the road, and turning into the most cheaply-made serious performance vehicle on the road. It's an achievement, of sorts.

Unsurprisingly, the buyers of the Evo, as the hot-rodded Lancer Evolution is generally known, like to drive it aggressively. SC personally suspects that the majority of Evo racing is actually done at stoplights, against competitors who have no idea that they're being "raced", but suffice it to say that some buyers actually compete in formal events, with things like clocks and flags. Needless to say, racing puts considerably more stress on a car than typical street driving (those living in Orange County, CA may disagree), and since the Evo comes with a warranty, one might expect Evo owners to try to get Mitsubishi to bear the cost of their behaviors.

Thus, the story:

In June, several Mitsubishi Lancer Evolution owners began posting notices on Evo-enthusiast Web sites saying their warranties had been canceled because they'd participated in timed racing events or installed aftermarket goodies in their cars. Owners complain that Mitsubishi used the Internet to dig up the names of these offenders. Owners found out about this when they visited dealers for repairs and were told that their warranties had been restricted because of the two activities.

Typical of all carmakers, Mitsubishi's warranty states that "problems or failures related to racing, alteration, and/or vehicle modifications are not covered conditions".

Now here's where things get interesting:

Mitsubishi, meanwhile, denies that it proactively searches in hopes of voiding warranties. Responding to the complaints, the company said, "Mitsubishi does not have any automated Web search system looking for Lancer Evolutions involved in racing events".

Well that pretty much settles it. They don't have an automated system, but they wouldn't put it like that if they didn't have a manual one in place. In case you think that SC needs to put down his copy of 1984, the article later includes the comment that "When a warranty claim is questioned, Mitsubishi concedes that it may launch an investigation, which can include online searches for evidence the car was modified or run in a timed competition. The company adds, however, that owners are always given the benefit of the doubt". Baloney. With repair bills running as high as $8000 per incident (a figure cited anecdotally in the article, but perhaps above their average warranty claim), they've probably got serious incentive to figure out whose claims to deny as soon as the cars roll off the lot. And frankly, it's hard to blame them.

This ought to represent a serious opportunity for someone. Named entity recognition generally works best when it's targeted at specific domains. Internet bulletin boards represent a relatively restricted genre of text, so it ought to be easy to control for all the cutesy spellings that follow from them. It would be considerably harder to develop an application that can do a solid job linking evidence of bragging about vehicle performance to VIN numbers (unique IDs assigned by law to every car on the road), but to a company selling 20-30,000 units of a performance car a year, and getting even $3k in bills per car, such an application should be worth at least $50,000 -- annually. And Mitsubishi is hardly the only company out there selling hot-rodded factory-built vehicles; the Subaru WRX is generally compared head-to-head with the Evo. SC suddenly isn't feeling so afraid to sell out his fellow Ford owners; did somebody say Mustang Cobra? It's too bad that Toyota is going to stop selling the MR2 and Celica, also popular among the "ricer" crowd. But that's OK -- SC is too busy salivating at the thought of what Honda might pay to make racing claims go away on the most popular ricer car in North America, the Civic.

June 10, 2004

A possible source of Google error?

For reasons SC can't even recall at this point, a few days ago, he took it into his head to see if he could find out who makes the black hats traditionally worn by Orthodox Jews. After a lot more difficulty than one might expect (for religious reasons, various Orthodox sects do not make use of the Internet, and so few people sell to them that way), your host uncovered two names, Borsalino and Huckel. Being mildly curious, he then attempted to see what he could find written about the two manufacturers. Borsalino turns up about 100k hits through Google, many of which are relevant -- links to various hat retailers and fashion magazines, mixed in with a few other instances of that name.

Huckel, however, turns up only about 15k hits, almost all of which appear to involve either various works by a chemist named Erich Huckel, or misspellings of "huckleberry". Naturally, in order to restrict the search further, SC then added another term, which really made things difficult: "hat".

That had the desired effect of reducing the number of pages, alright, to about 600. But as you can see from looking at the search results, now just about everything that comes back is in German. SC doesn't speak a word of German ([except for schadenfreude, hamburger, and bier -- ed.]), but it quickly becomes apparent that "hat" is a valid German word. Google would seem to have an answer to this problem by providing a link to "search for English results only", but at least in SC's results, the second page contains a German hit. For what it's worth, at least this enabled your host to track down Huckel, apparently a hat maker from Novy Jicin, Czech Republic, which claims to have the only hat museum in the world.

This raises an interesting question about the reliability of search statistics compiled from web search engines (as opposed to using defined corpora like the ones provided by the Linguistic Data Consortium). LDC data can pretty much be guaranteed to be monolingual in the cases where it's supposed to be, but as this example demonstrates, search engine data cannot.

Is this likely to be a major source of error? Frankly, no -- one or two hits different in 127 (the number of hits once "English only" was picked) really doesn't affect frequency estimates to any significant degree. Given the order of magnitude of error in this example, it's no more significant than the changes that can be observed by running the same search a day or two apart, as sites appear and disappear. And for every word that does have an identical string in other languages, there are probably hundreds more that don't. So it would be wrong to assume that "wrong language" pages are contaminating search engine-based linguistic studies in a way that needs to be addressed, but it's nice to know where potential sources of error are coming from.

April 20, 2004

How to lie with word frequency statistics

Recently, a story has been making its way around the blogosphere concerning the results of searching for the word "Jew" with Google. The number one result when the tale began was an anti-Semitic website run by a neo-Nazi organization. In an attempt to get this off the top of the list, a number of bloggers have been linking to the "Jew" entry at Wikipedia. Initially, Google defended itself on the grounds that the ranking was done automatically, and that they absolutely, positively would not interfere with the sacredness of their page-ranking algorithm. Then, it turned out that Google had in fact fiddled with their algorithm to deal with the fact that child pornography turned up near the top for searches of the word "Chester", a fact which bothered the denizens of Chester, England.

However, Google is now not above fiddling with the "Jew" results, not to fix them, but to post a disingenuous disclaimer that only pops up when you search for "Jew". Since SC doesn't wish to distort Google's statistics on searches for Jews any more than they do, here's a direct link to the explanation.

First, Google makes a comment that SC doesn't actually disagree with, although he doesn't have any statistics to back it up:

If you use Google to search for "Judaism," "Jewish" or "Jewish people," the results are informative and relevant. So why is a search for "Jew" different? One reason is that the word "Jew" is often used in an anti-Semitic context. Jewish organizations are more likely to use the word "Jewish" when talking about members of their faith. The word has become somewhat charged linguistically..."

This strikes your host as plausible. Searching for "the Jew Sharon", "the Jew Wolfowitz", "the Jew Perle", and "the Jew Kristol" all turned up either anti-Semitic websites, quotes of same, or satirical work intended to mimic anti-Semitic behavior. SC does not refer here to people writing in disagreement with the aforementioned people's beliefs (which is not to say that there are no anti-Semites who disagree with them), but rather to writings insinuating the existence of a sinister cabal ([time to brush up on your Kabballah -- ed.]). So your host will accept this as valid -- but it's also sort of irrelevant.

The Google explanation goes on to state:

Someone searching for information on Jewish people would be more likely to enter terms like "Judaism," "Jewish people," or "Jews" than the single word "Jew." In fact, prior to this incident, the word "Jew" only appeared about once in every 10 million search queries. Now it's likely that the great majority of searches on Google for "Jew" are by people who have heard about this issue and want to see the results for themselves.

This is offered as explanation for why "Jew" turns up an anti-Semitic site. But it's completely off-point. Google's ranking algorithm doesn't rank pages by how often terms within them are searched for, it ranks them by how often they're linked to, and how often the term actually appears. The fact that "Jew" may only be searched for once in 10 million queries doesn't tell us why the offending site is ranked so highly. Aside from that, though, without any comparable statistics on the frequency of searches for "Jewish", "Judaism", etc., there's no way to tell whether or not 1 in 10 million is an unreasonably small number of searches to deal with. Google's frequency statistics going by simple number of documents returned do bear out the claim to some extent -- there are 1.8 million hits for "Jew" and 13.2 million for "Jewish". But again, without any kind of statistics to provide context about the number of searches overall, the number of searches for other Judaism-related words, and maybe some time series data on these points as well, there's no way to tell if searches for "Jew' are effectively getting the anti-Semitic site in front of people or not.

Google isn't doing anybody any favors by trying to recast this as a minor problem because of the number of searches performed. Let's stipulate that it's economically infeasible -- and from a free speech standpoint, undesirable -- to have a team of editors do nothing but check that search terms bring back only inoffensive results ([so if they did other things as well, it would be OK? -- ed.]). Let's also stipulate that while the moral lines are pretty clear in this case, in many other cases, it's a lot harder to decide if a result ought to count as offensive (imagine if the New York Times decided to demand that Google remove all links to Andrew Sullivan's site, on the grounds that his criticisms offend their employees). Even granting these facts, their explanation simply has nothing to do with the actual mechanics of how the Web is structured (or perhaps more accurately, how their algorithms assign structure to it). If the standard is now that searches which occur frequently enough merit action, then it will not be long before activists of all stripes launch campaigns to boost the profile of particular searches for a long enough time to force Google to do something about them. If it's true that enough pages with "Jew' in them link to the offensive site, and their algorithm is doing its job, then they ought not handle any requests, like the city of Chester's, or treat them all equally in some other way. If SC was advising Google, he'd tell them to just quietly handle these requests on an ad hoc basis -- a little common sense could probably go a long way towards avoiding a lot of bad PR, much of which is now deserved due to Google's inconsistent behavior.

UPDATE: After additional discussion with Radagast, as well as actually reading the original Chester site (warning: it's disgusting), your host feels moved to clarify what he means by "handling". We'll start by paraphrasing an argument that the Wall Street Journal recently made (link courtesy of Seth Friedman, who turns up near the top of searches for Chester in this context now). It's not at all clear that the site is illegal, as Google claims; while the content is repellent, it may not actually violate most pornography laws (there are no pictures, real or simulated, and it's arguably satirical in nature). It might be more clearly illegal under anti-obscenity laws, but those vary by jurisdiction; by playing the "the law made us do it" card, Google opens themselves to vulnerability on the question of whose laws they're claiming to be bound by. Given that Google hasn't made it impossible to locate the "Chester" page -- all they've done is remove an association in their database between a keyword and a URL -- it's fair to say that "Bad Chester" hasn't actually been censored.

So what's an equitable solution? Ideally, it should: 1) not involve censorship, 2) preserve Google's reputation as a neutral arbiter of searches, 3) be legally defensible, and 4) minimize the use of googlebombs as a response to this sort of problem (assuming that Google agrees that googlebombing is detrimental to their goal of accurately reflecting the relevance of content). SC is not above telling people -- himself included -- to suck it up when they encounter speech they dislike. Therefore, his original thought was that Google ought to just run complaints by a lawyer or ethicist, and then perhaps engage in a little unannounced delinking of specific keywords here and there, as they clearly have done before. However, while such a solution would largely meet the tests laid out above, perhaps the policy needs to be more cut and dried in order to keep their PR department from turning into a permanent crisis center. So here's SC's stab at a formal policy: given that the intent of googlebombing is to either raise or lower a specific link in the rankings, and in this case, the goal is to lower it, simply lower the weight of the page ranking for a sufficiently hotly disputed page. Google has shown a willingness to allow legal opinions to influence their judgment; if their legal team judges that a page's appearance is clearly grounds for prosecution -- delete the reference in regard to the specific keyword as they did a la Chester, or the German version of Google (see the WSJ article for details). If the page's appearance is clearly not illegal, simply lock it out of being the first returned listing, which seems to spawn particular ire. Irritating pages may well continue to show up as result #2, but freed from the emotional baggage of seeing something offensive as #1, users might well recognize that searches for a particular term are likely to bring back a diverse group of pages, including ones they don't like. If that fails, announce that the experiment is over, and go back to a policy of strict neutrality, undoing all non-algorithmically-derived rankings in the database.

It's easy to foresee a string of stories coming out for an indefinite, but long, period of time, demonstrating that Google is even more involved in censorship than is already known. A policy like the above represents an effort to reconcile the reality of what they've already done with the goals of running a reasonably transparent operation and preserving their reputation. Is it ideal? Absolutely not. But they're the ones who chose to proclaim one policy before while observing another, so they might as well end the charade and reestablish some clear user expectations. Radagast thinks it might be realistic for them to simply fess up and go straight back to an uncensored database. I'm not sure they're in a position to do that, though, because they've tried to sell that story before. That's why your host thinks they might as well experiment with another policy -- if it keeps Google from being the target of further embarrassment, then they can call it a success and stick with it. If it fails, then they've got an irrefutable argument that they've tried to be responsive to complaints, and that the resuls are simply intolerable.

So we'll close with an observation from Southern California talk-show host Larry Elder:

Q. What is the Elvis Factor? I once read that 10 percent of the American people think Elvis still alive, and 8 percent believe that if you send him a letter, he will answer it. That's the Elvis factor. You have to remember that, no matter what, 10 percent of the people are probably not capable of clear, rational thought.

As with Elvis, so with complaints. If a transparent policy of database editing proves unworkable, but the critics still won't let up, then at some point they just have to be ignored. Unfortunately, their own behavior has foreclosed that option for now. It's a shame that Google has put itself in the position of making these criticisms credible; no matter what policy they end up adopting (including sticking with the present one), they've done a possibly irreparable hatchet job on their reputation for being the most reliable, neutral guide to Web content available.

(Edited at 2:09 a.m. on 4/20/04 to include additional content.)

April 19, 2004

A (very) short history of speech synthesis

Without computers, there would be no computational linguistics. Fred Jelinek (scroll down) might think this was a bad trade, but SC's going to offer some evidence that it has resulted in a generally more humane approach to the field.

Once upon a time, without computers around to do speech synthesis, people used elaborate contraptions made up of organ bellows, reeds (think woodwinds), and even scarier devices. The first known successful speech synthesizer was produced by Christian Kratzenstein in 1779, and made use of a bellows and reed to produce 5 vowels. Later (see same link), Wolfgang von Kempelen used a leather imitation vocal tract to produce even more articulatory control. Alexander Graham Bell used castings of human skulls and an artifical tongue to do his own brand of speech synthesis. Left uncomputerized, obviously the trend would have eventually taken towards using real human parts.

Fortunately, after spectrum analysis was invented, people discovered that there were other approaches besides reconstructing the human speech system. Thus, in 1939 Bell Labs demonstrated a device called the "voice coder" or "voder", by providing an electrical waveform as a source (to mimic the vocal folds), running it through some filters that simulate the resonances of the mouth as they modify the airstream, then run it through an amplifier and loudspeaker. This alternate track, aside from saving human jawbones from some would-be Bell Jr., led inexorably towards further mechanization, culminating in the peak of human technological achievement in the '80s, the Speak-n-Spell. Readers feeling that SC may have omitted something along the way can comfort themselves with the additional information provided in the links.

All this is a long prelude towards a story borrowed from Cronaca, about a tiger-mauling in Bengal that inspired shock, revulsion, pottery -- and an early speech synthesizer. Cronaca quotes the Sotheby's auction catalog from one of the pottery items as follows:

Tipu Sultan of Mysore derived particular pleasure from the young man's misfortune and commissioned his mechanical toy, the Man-Tyger-Organ. Housed within a life-size carved and painted wood model of a tiger attacking a European was a mechanical pipe organ which, when cranked, emitted the growls of the tiger and the screams of its victim.

The organ is now on permanent display at the Victoria and Albert Museum in London, England. Perhaps fortunately, speech synthesis has not always since needed to rely on reconstructing horrific events in order to drive new developments in the field. As your host said before, we can thank computers for introducing a more humane approach to computational linguistics.

As a bit of macabre apocrypha, SC has heard a story from several people, but can't find definitive proof (and thinks they're all capable of pulling his leg) that the great speech synthesis researcher Dennis Klatt arranged for the synthesizer he developed to read the eulogy at his funeral in 1988.

April 12, 2004

Beane Counting

Andrew Sullivan commented Friday on a blog ranking scheme which shows his site to be the second "most influential" among political blogs. The scheme is interesting, especially as it has applications across a variety of domains.

As described by its author, the technique works as follows:

I went to Technorati, Daypop, Blogstreet, and the Truth Laid Bear Ecosystem on Tuesday and counted how many links went to the top 100 POLITICAL blogs listed. Then I went through and weeded out any blog that didn't make the top 100 on at least 3 of the 4 measuring tools.

At that point, there were only 29 blogs left and I took their best 3 scores (or there only 3 scores if that was the case) and added them up. For example, if "Blog X" was the 3rd, 12th, 19th, & 26th, most linked to blog on the 4 top 100 pages I used, the 26th place finish would be dropped and "Blog X" would get a score of 34. Blogs with a score <34 would be ahead of "Blog X" and blogs with a score >34 would be ranked behind it.

This reminded SC of a couple of similar evaulation techniques. One favorite of his is Rob Neyer's "Beane Count", named for the general manager of the Oakland A's, Billy Beane. The Beane Count is simply the sum of a team's rankings in four categories: home runs scored/allowed and walks earned/allowed. A good team will hit a lot of homers and earn a lot of walks, without giving up too many of either; thus, the Beane Count is a proxy for a team's all-around ability to do the things that produce marginal runs, perhaps boosting them over the top. It's not a perfect indicator of a team's ability to win any one game -- last year's World Series champions, the Florida Marlins, were only 8th in the National League according to this system -- but it correlates well with who is in fact atop the standings (here's last year's Beane Count, along with the actual standings; it's too early to be useful for 2004).

A nifty application of this methodology to natural language processing is Manning and Klein's lexicalized, factored parser. It gets better results than a conventional statistically-trained parser by making use of both a probabilistic context-free grammar and a lexical dependency model, and then making inferences about which is right in each case.

Beane Counting can even be done in hardware. One technique popular in high-end digital audio electronics is to parallel a couple of digital-to-analog converters, and subtract the difference of their outputs from the analog signal that ultimately goes out the back panel. This cancels some of the random distortion specific to each converter, while retaining the common signal (which is presumably correct). Done correctly, each doubling of the number of D/A converters can improve the system's signal-to-noise ratio by about 3 dB, which is a meaningful improvement, but only when cost is low on your list of priorities.

Beane Counting techniques only really work if you've got a couple of individual models which are each pretty good to start with. While differences definitely exist among each of the blog ranking algorithms, and Beane Counting can smooth out the noise specific to any one of them, if one ranking algorithm erroneously put Andrew Sullivan all the way down at the bottom, the output of averaging three of them would still have Sullivan badly misranked. If Manning and Klein's parser was deciding between two completely erroneous parses, its performance wouldn't really be any better for having a slick inference engine at the end. And no electrical engineer would design a system with 16 8-bit D/A converters if he had access to one decent 16-bit converter instead. "Garbage in, garbage out" holds for every algorithm ever designed.

As a side note, it's important to separate evaluations of the Beane Count's utility from evaluations of Billy Beane's. While the Oakland A's have done some truly amazing things over the last 5 years, the man also said:

"I wasn't looking to trade Ramon. I've just loved Mark Kotsay for a long time, and (Padres GM) Kevin (Towers) knows I've loved Kotsay since he was at Cal State Fullerton."

SC is a Padres fan (which meant that he was obliged to cheer for Mr. Kotsay for several years), but the notion that the general manager of a winning baseball team could say "I love Mark Kotsay" and use that as justification for trading a quality catcher is grounds for firing.

April 09, 2004

Don't have time to read blogs? Listen to them!

Scanning the referral logs today, I noticed a hit from this site:

Radio Vox Populi

The application is fairly straightforward: after crawling the web for blog entries, they use a text-to-speech engine to feed a RealAudio server, thus creating something like a radio station, but made up completely of random blog posts.

It's kind of cute, but the text-to-speech engine is awful. You can hear some much better ones here (especially the AT&T one).

As for the desirability of such a website, well, it depends on how much you like finding out about new blogs. Unfortunately, I can't see how you find out where to link to what you just heard. What would really be useful is if they would stream the URLs into the RealPlayer title bar, although I'm not sure if that's technically feasible, never having worked with the RealPlayer software (except as a consumer).

March 30, 2004

Why SC isn't as smart as Mark Liberman

Aside from the glaring differences in the lengths of our respective CVs, your host just found further evidence that he does not have nearly the lightning-quick wits of the good doctor.

Writing about R. Robot, an automatically generated blog that uses the same underlying principles as the Chomskybot, the Postmodernism Generator, and other well-known toys, Prof. Liberman says "I tried the interactive feature, supplying 'Geoff Pullum' as the requested name". Not taking the hint, SC promptly put his own nom de blog into the generator, and got this out.

Normally, SC doesn't mind being written about by other blogs, even if it's a denunciation. Bad publicity is better than no publicity, after all. But he finds that he can't untangle the coreferences in this quote:

Semantic Compositions -- was there ever a public official of such obsessive and even dangerous perfidy, such curiously screeching insouciance? But with Vice-President Cheney you get the sense that this is one who will wander into greatness.

Is the interpretation supposed to be that yours truly is a public official, and that SC is destined to wander into greatness? Or is it that the vice president is said public official and destined to wander into greatness? Or are the two sentences wholly unrelated, and your host is trying too hard to find coherent discourse when the generator is really only working at the single-sentence level?

SC will happily take a good denunciation, but good starts with interpretable ([no, you fool, it starts with 'g' -- ed.]), so this mention is a bit regrettable. Those who know SC personally will be collapsing in hysterics for other reasons.

March 28, 2004

Named entity recognition done dirt cheap

SC swears he's not stalking anybody at Language Log. It's just that a certain professor from Penn keeps writing about things that interest him. On that note, the unnamable professor writes, in reference to an automatically inserted link in a New York Times article:

The hyperlink on Laura's last name "Fluor" leads to a page about the Fluor Corporation...[but] [T]here is absolutely nothing in the original Carr article to lead us to believe that Laura Fluor has anything at all to do with the Fluor Corporation.

Prof. Liberman ([oops, you did it again -- ed.]) notes that faulty named entity recognition software seems to be at fault. This immediately cleared up a longstanding mystery for your host, regarding only slightly better-behaved links at a popular audio hobbyist website.

In this discussion, the name "Sony" is occasionally linked to an ad for Sony blank videotapes. It's not consistently applied, but at least they correctly recognized that Sony products are relevant to instances of the string "Sony". It would be nicer if they linked to actually relevant Sony products (in the case at hand, that would be to receivers, not to blank tapes). Oddly, although the names RCA, Zenith and Yamaha also come up, they are never hyperlinked to anything; the recognition software seems to key only on "Sony".

However, they don't just try to link names to ads for names. Audioreview also whores out sells links to generic terms. Thus, in this discussion, the word "computers" is linked to a Dell ad, and the word "cables" is linked to a seller of cables. At least this is also arguably relevant.

But sometimes, the system just completely screws up. In this discussion, a forum member is soliciting suggestions for a Neil Young compilation, and another member responds "Looks great, I'd like a copy please". In the post, the word "copy" is hyperlinked to an ad which states: "How to write killer ad copy. Copywriting tips from (various names snipped out so as to avoid free publicity) about web and salesletter copy." This is wholly irrelevant to the subject being discussed.

Because errors like this are at least as common (in SC's subjective opinion; he's not compiling a corpus to find out) as valid links, your host assumed that, in fact, there's nothing worthy of the name "named entity recognition" going on, and it's just a matter of automatically generating links to any strings that match a predefined list. Since not every instance of each word is linked, perhaps there's also some heuristic built into the software about how often users can/will tolerate this without getting so frustrated as to stop posting. Whoever wrote it guessed wrong -- SC won't participate at all in any discussion group where his copy is subject to this sort of modification (which is very different from moderated discussions).

Perhaps the Times' software also doesn't really deserve to be credited with the "named entity recognition" tag. SC doesn't doubt that it was probably advertised that way, but he also remembers a former manager who tried to market his search engine as a "data mining" tool, even though that's a reasonably well-known term of art which really doesn't include search engines. Our customers were technical enough to see that he was spouting BS -- or maybe they could just smell the alcohol on his breath and figure out that he was untrustworthy. Put another way, it might not be the case that the Times' software is what needs to be replaced.

March 17, 2004

Too soon to tell?

Reader Danny Ayers, having read SC's most recent thoughts on the Semantic Web, had some worthwhile rejoinders to the earlier postings. Since Semantic Compositions readers are unlikely to notice them without the aid of the TypePad interface (which lets the overpaid, underworked editor know what the 5 most-recent comments are), it seems appropriate to bring them up here.

Regarding SC's first post on the subject:

"The basic idea of the semantic web is that making inferences about the meaning of free text is hard."

No it isn't, nor is it about juggling on a unicycle being hard!

The basic idea is that adding explicit resource descriptions to the web is pretty easy and potentially very useful, and drawing inferences from these isn't too difficult.

Even if reading natural language *was* easy for machines, some form of processing would be needed to make use of it, something perhaps like formal-logic based approach behind Semantic Web technologies.

It's also worth noting that the kind of descriptions expressed in RDF/OWL needn't have any connection to natural language at all - e.g. a digital camera can timestamp photos, this can be encoded in RDF/XML and reasoned about/queried in Semantic Web systems.

Perhaps it would have been mildly more accurate for your host to have written "[T]he motivation of the semantic web is that making inferences about the meaning of free text is hard". It certainly is true that having a common resource description language makes it easier to make inferences. SC disagrees, though, that it's "easy" to add this sort of markup to documents on the scale needed to effectively do the sort of tasks done now at TREC, or perhaps more fairly, for the problems tackled by theRKF program. Making inferences isn't too difficult anymore, but in order to answer questions like "If the Taliban falls, will Iran step up their nuclear weapons research?", you need a lot of information about motives, about the relationships between political actors, about the existence (and progress to date) of an Iranian nuclear program, etc. This information can effectively be extracted from news articles and intelligence reports by humans, and the government programs linked above represent at least some progress towards doing it automatically, through "traditional" NLP. Trying to anticipate all the possible questions and hand-code the necessary information in advance -- even with the assistance of a sophisticated RDF/DAML-aware editor -- is prohibitively expensive.

This isn't to dismiss the utility of a common expression language, though. Having had access to the internals of three knowledge representation systems over the past two years, SC can only wish that they would have shared a common format. Trying to derive mappings between them is expensive in time terms more than monetary ones (although someone has to pay for SC's time to do this). So yes, it's probably better that some system replace the proliferation of incompatible representation formats presently out there.

Mr. Ayers' point that "the kind of descriptions expressed in RDF/OWL needn't have any connection to natural language at all" is well-taken. SC isn't aware of any current technology that can take an arbitrarily chosen picture and generate useful metadata about it to be reasoned with. If we stipulate that, as with natural language text, the amount of time needed to adequately encode more than basic information is still high, SC will agree that this is still a useful application which is better than nothing at all.

Moving on to Part II, Mr. Ayers writes:

I'd suggest that the SemWeb approach implicitly acknowledges that a single, global "world knowledge" is not likely to happen (and probably isn't desirable!). The RDF and OWL languages do make it relatively straightforward to make 'local' ontologies and mappings bet