One of SC's favorite sessions at the 2009 LSA meeting was titled "Computational Linguistics: Implementation of Analyses against Data". Go here for a listing of the papers (it's session #30). There was a very conscious effort this year, driven very clearly by the efforts of Emily Bender and Terry Langendoen (they had a joint session to themselves earlier for this purpose), to present computational methods as desirable technical approaches to handling theoretical issues, which is exactly the sort of thing your host has always wanted to see develop further. Herewith, a little about each of the talks:
Emily Bender kicked off the discussion with a presentation on a grammar she built for the extinct language Wambaya (making use of 801 examples drawn from the documentation in Rachel Nordlinger's dissertation). Ordinarily, testing all sorts of licensing constraints and making sure that your newer rules don't break your older rules is a process that can take months. However, with the aid of the Grammar Matrix, a tool for writing and testing analyses in the Head-Driven Phrase Structure Grammar formalism, she managed to produce a grammar that correctly analyzed 91% of the cases in her development set, and 76% of cases in a separate test set, spending 210 hours in 5 1/2 weeks to accomplish this task. The introduction of formal test and development methods into the construction of theoretical analyses is welcome, and the steadily rising graph she presented to document the improvements in the grammar as a function of time was frankly astounding. If the only thing anyone took away from the presentation was that they should bring a genuine test plan into their work and actually keep metrics of their work as it progresses, the talk was a success. That it made such a convincing case for the utility of automated parsing and generation as core tools in doing theoretical work is a dream come true.
Next up was a presentation by Jason Baldridge and Katrin Erk of progress in a research project titled Efficient Annotation of Resources by Learning. Their team is tackling the problem of constructing interlinear glosses for text in languages where little prior data is available -- a problem for just about any small minority language in the world, and hence one where an efficient computational solution could reap enormous rewards (scientifically -- the IPO might be a bit more of a pipe dream). For the LSA talk, they described an experiment where 2 trained linguists were given 100,000 clauses of a Mayan dialect called Uspanteko (you can see an example at the project wiki), one of whom was a speaker of the language, and the other of whom was a theoretically knowledgable individual with no Uspanteko experience. The question posed was: how much can you gloss in 2 weeks with a little help from a computer? And the answer appears to be: with random selections from the corpus (to keep from overtraining on sequential -- and possibly contiguous -- material), enough to get a machine learning algorithm to predict labels for the entire corpus with about 30% accuracy. That's not good enough to leave the job to the machine, obviously, but it is good enough to already help rank possible tags for a user to speed up their manual annotation, which is exactly the application they're developing. If you've never tried to use an annotation interface that doesn't know anything about what you're up to -- or worse, tried to do it in a plain-text editor -- trust SC when he tells you that any further progress these folks make will be a blessing.
Following the EARL team, Nianwen Xue, Susan Brown and Martha Palmer presented a paper titled "Computational lexicons: When theory meets data", covering work on building a computational lexicon integrating data from a number of prior projects, which you can browse here. Specifically, they wanted to provide a resource combining the semantic role data found in PropBank (a treebank that encodes data about verb arguments in real sentences) with VerbNet, a very detailed implementation of Beth Levin's work on verb classes. The reason you would want this integration is that sense data is notably lacking from the PropBank, itself an extension of the Penn Treebank, and this is a Bad Thing when trying to train a parser to assign semantic roles to new text. The tagging procedure by which they accomplish their integration is sensible enough, albeit not something to write much about, but the import of the work is clear -- you really can build a computational resource that is faithful to both the needs of statistical parsing and generation algorithms and linguistic theory. It's not hard to imagine building a variety of potentially very interesting applications using a word-sense-aware parser backed by this lexicon, because a little semantic role data is a lot better than nothing at all.
Next up, Jason Riggle and John Goldsmith presented a paper with a too-rare title, "Information-theoretic approaches to phonology", which appears to be an update of this 2007 manuscript. Prof. Goldsmith gave a plenary address at the previous LSA meeting on computational methods, based on this paper, which provoked a certain amount of misunderstanding and suspicion that he was somehow not interested in finding out what was going on inside people's heads when they use language. Nothing could be further from the truth; the current paper demonstrates how the classic autosegmental theory of phonological tiers could be expressed in terms of probabilities for both consonant and vowel segments. More than that, it introduces the use of a genuinely zero-based metric for evaluating the quality of a phonological model, by tying the comparison of models to the number of bits needed to represent segments and words. Now, SC would stipulate that it is not at all clear that the language apparatus always and everywhere chooses the most efficient coding scheme that could be computed. However, as a metric for evaluating whether or not a particular theory has explanatory power, this is an excellent approach. If you can't show that your theory actually buys you something better than a naive n-gram model, you had better have some other compelling reason for adopting your proposal. Indeed, the autosegmental model actually was not the most efficient from a bits/symbol perspective, but the evidence for tiers is compelling enough to not discard them in favor of flat bigrams.
Finally, the talk that most excited SC was saved for last -- Christopher Potts presenting work with Florian Schwarz on getting pragmatic data out of reviews from TripAdvisor and Amazon. The methodology is brilliantly simple: these sites give you a convenient 5-point scale for rating things, with clearly defined negative and positive opinions. So count up associations of ratings with words, and you've got yourself a taxonomy of emotional baggage. Leaving the details of the computation to the linked paper, the paper demonstrated that "what a" tends to be a useful signal of heightened emotion:
- What a dump!
- What a nice hotel!
- What a completely quite neutral reaction I'm faking to throw off the math!
In all seriousness, phrases like "what a" are found to show up in both 1- and 5-star reviews, indicating extremity of reaction (although not polarity), while other words have more clearly directional connotations, like "wow" (positive) and "never" (negative). Even with noise of the sort introduced above, Potts and Schwarz show their results to be remarkably robust, with spurious examples of the relevant constructions to occur with frequencies that are orders of magnitude below the cases of interest. These are the sort of lessons one would ordinarily learn through survey-based research with lots of manually tabulated results and much smaller quantities of data. As a pure language-engineering tool, the applications are obvious -- it's easy to imagine conducting tests to start classifying all sorts of words as emotionally laden, positive, negative, and so forth, and integrating that into software that acts on opinions. As a research tool for theoretical inquiry, one can just as easily imagine constructing a program to serve as a filter for finding examples deserving closer scrutiny in a corpus.
What all of the papers from this symposium have in common is a commitment to the utility of theoretical linguistics, combined with an equally fervent commitment to the idea that systematic counting of examples is a legitimate way to validate your theories. The notion that a good theory ought to be able to survive contact with data doesn't require an abandonment of theoretical work in itself, and bringing a formal development cycle to your work is simply a dose of good-for-you discipline.