SC swears he's not stalking anybody at Language Log. It's just that a certain professor from Penn keeps writing about things that interest him. On that note, the unnamable professor writes, in reference to an automatically inserted link in a New York Times article:
The hyperlink on Laura's last name "Fluor" leads to a page about the Fluor Corporation...[but] [T]here is absolutely nothing in the original Carr article to lead us to believe that Laura Fluor has anything at all to do with the Fluor Corporation.
Prof. Liberman ([oops, you did it again -- ed.]) notes that faulty named entity recognition software seems to be at fault. This immediately cleared up a longstanding mystery for your host, regarding only slightly better-behaved links at a popular audio hobbyist website.
In this discussion, the name "Sony" is occasionally linked to an ad for Sony blank videotapes. It's not consistently applied, but at least they correctly recognized that Sony products are relevant to instances of the string "Sony". It would be nicer if they linked to actually relevant Sony products (in the case at hand, that would be to receivers, not to blank tapes). Oddly, although the names RCA, Zenith and Yamaha also come up, they are never hyperlinked to anything; the recognition software seems to key only on "Sony".
However, they don't just try to link names to ads for names. Audioreview also whores out sells links to generic terms. Thus, in this discussion, the word "computers" is linked to a Dell ad, and the word "cables" is linked to a seller of cables. At least this is also arguably relevant.
But sometimes, the system just completely screws up. In this discussion, a forum member is soliciting suggestions for a Neil Young compilation, and another member responds "Looks great, I'd like a copy please". In the post, the word "copy" is hyperlinked to an ad which states: "How to write killer ad copy. Copywriting tips from (various names snipped out so as to avoid free publicity) about web and salesletter copy." This is wholly irrelevant to the subject being discussed.
Because errors like this are at least as common (in SC's subjective opinion; he's not compiling a corpus to find out) as valid links, your host assumed that, in fact, there's nothing worthy of the name "named entity recognition" going on, and it's just a matter of automatically generating links to any strings that match a predefined list. Since not every instance of each word is linked, perhaps there's also some heuristic built into the software about how often users can/will tolerate this without getting so frustrated as to stop posting. Whoever wrote it guessed wrong -- SC won't participate at all in any discussion group where his copy is subject to this sort of modification (which is very different from moderated discussions).
Perhaps the Times' software also doesn't really deserve to be credited with the "named entity recognition" tag. SC doesn't doubt that it was probably advertised that way, but he also remembers a former manager who tried to market his search engine as a "data mining" tool, even though that's a reasonably well-known term of art which really doesn't include search engines. Our customers were technical enough to see that he was spouting BS -- or maybe they could just smell the alcohol on his breath and figure out that he was untrustworthy. Put another way, it might not be the case that the Times' software is what needs to be replaced.
Comments