One of SC's favorite sessions at the 2009 LSA meeting was titled "Computational Linguistics: Implementation of Analyses against Data". Go here for a listing of the papers (it's session #30). There was a very conscious effort this year, driven very clearly by the efforts of Emily Bender and Terry Langendoen (they had a joint session to themselves earlier for this purpose), to present computational methods as desirable technical approaches to handling theoretical issues, which is exactly the sort of thing your host has always wanted to see develop further. Herewith, a little about each of the talks:
Emily Bender kicked off the discussion with a presentation on a grammar she built for the extinct language Wambaya (making use of 801 examples drawn from the documentation in Rachel Nordlinger's dissertation). Ordinarily, testing all sorts of licensing constraints and making sure that your newer rules don't break your older rules is a process that can take months. However, with the aid of the Grammar Matrix, a tool for writing and testing analyses in the Head-Driven Phrase Structure Grammar formalism, she managed to produce a grammar that correctly analyzed 91% of the cases in her development set, and 76% of cases in a separate test set, spending 210 hours in 5 1/2 weeks to accomplish this task. The introduction of formal test and development methods into the construction of theoretical analyses is welcome, and the steadily rising graph she presented to document the improvements in the grammar as a function of time was frankly astounding. If the only thing anyone took away from the presentation was that they should bring a genuine test plan into their work and actually keep metrics of their work as it progresses, the talk was a success. That it made such a convincing case for the utility of automated parsing and generation as core tools in doing theoretical work is a dream come true.
Next up was a presentation by Jason Baldridge and Katrin Erk of progress in a research project titled Efficient Annotation of Resources by Learning. Their team is tackling the problem of constructing interlinear glosses for text in languages where little prior data is available -- a problem for just about any small minority language in the world, and hence one where an efficient computational solution could reap enormous rewards (scientifically -- the IPO might be a bit more of a pipe dream). For the LSA talk, they described an experiment where 2 trained linguists were given 100,000 clauses of a Mayan dialect called Uspanteko (you can see an example at the project wiki), one of whom was a speaker of the language, and the other of whom was a theoretically knowledgable individual with no Uspanteko experience. The question posed was: how much can you gloss in 2 weeks with a little help from a computer? And the answer appears to be: with random selections from the corpus (to keep from overtraining on sequential -- and possibly contiguous -- material), enough to get a machine learning algorithm to predict labels for the entire corpus with about 30% accuracy. That's not good enough to leave the job to the machine, obviously, but it is good enough to already help rank possible tags for a user to speed up their manual annotation, which is exactly the application they're developing. If you've never tried to use an annotation interface that doesn't know anything about what you're up to -- or worse, tried to do it in a plain-text editor -- trust SC when he tells you that any further progress these folks make will be a blessing.
Following the EARL team, Nianwen Xue, Susan Brown and Martha Palmer presented a paper titled "Computational lexicons: When theory meets data", covering work on building a computational lexicon integrating data from a number of prior projects, which you can browse here. Specifically, they wanted to provide a resource combining the semantic role data found in PropBank (a treebank that encodes data about verb arguments in real sentences) with VerbNet, a very detailed implementation of Beth Levin's work on verb classes. The reason you would want this integration is that sense data is notably lacking from the PropBank, itself an extension of the Penn Treebank, and this is a Bad Thing when trying to train a parser to assign semantic roles to new text. The tagging procedure by which they accomplish their integration is sensible enough, albeit not something to write much about, but the import of the work is clear -- you really can build a computational resource that is faithful to both the needs of statistical parsing and generation algorithms and linguistic theory. It's not hard to imagine building a variety of potentially very interesting applications using a word-sense-aware parser backed by this lexicon, because a little semantic role data is a lot better than nothing at all.
Next up, Jason Riggle and John Goldsmith presented a paper with a too-rare title, "Information-theoretic approaches to phonology", which appears to be an update of this 2007 manuscript. Prof. Goldsmith gave a plenary address at the previous LSA meeting on computational methods, based on this paper, which provoked a certain amount of misunderstanding and suspicion that he was somehow not interested in finding out what was going on inside people's heads when they use language. Nothing could be further from the truth; the current paper demonstrates how the classic autosegmental theory of phonological tiers could be expressed in terms of probabilities for both consonant and vowel segments. More than that, it introduces the use of a genuinely zero-based metric for evaluating the quality of a phonological model, by tying the comparison of models to the number of bits needed to represent segments and words. Now, SC would stipulate that it is not at all clear that the language apparatus always and everywhere chooses the most efficient coding scheme that could be computed. However, as a metric for evaluating whether or not a particular theory has explanatory power, this is an excellent approach. If you can't show that your theory actually buys you something better than a naive n-gram model, you had better have some other compelling reason for adopting your proposal. Indeed, the autosegmental model actually was not the most efficient from a bits/symbol perspective, but the evidence for tiers is compelling enough to not discard them in favor of flat bigrams.
Finally, the talk that most excited SC was saved for last -- Christopher Potts presenting work with Florian Schwarz on getting pragmatic data out of reviews from TripAdvisor and Amazon. The methodology is brilliantly simple: these sites give you a convenient 5-point scale for rating things, with clearly defined negative and positive opinions. So count up associations of ratings with words, and you've got yourself a taxonomy of emotional baggage. Leaving the details of the computation to the linked paper, the paper demonstrated that "what a" tends to be a useful signal of heightened emotion:
- What a dump!
- What a nice hotel!
- What a completely quite neutral reaction I'm faking to throw off the math!
In all seriousness, phrases like "what a" are found to show up in both 1- and 5-star reviews, indicating extremity of reaction (although not polarity), while other words have more clearly directional connotations, like "wow" (positive) and "never" (negative). Even with noise of the sort introduced above, Potts and Schwarz show their results to be remarkably robust, with spurious examples of the relevant constructions to occur with frequencies that are orders of magnitude below the cases of interest. These are the sort of lessons one would ordinarily learn through survey-based research with lots of manually tabulated results and much smaller quantities of data. As a pure language-engineering tool, the applications are obvious -- it's easy to imagine conducting tests to start classifying all sorts of words as emotionally laden, positive, negative, and so forth, and integrating that into software that acts on opinions. As a research tool for theoretical inquiry, one can just as easily imagine constructing a program to serve as a filter for finding examples deserving closer scrutiny in a corpus.
What all of the papers from this symposium have in common is a commitment to the utility of theoretical linguistics, combined with an equally fervent commitment to the idea that systematic counting of examples is a legitimate way to validate your theories. The notion that a good theory ought to be able to survive contact with data doesn't require an abandonment of theoretical work in itself, and bringing a formal development cycle to your work is simply a dose of good-for-you discipline.
hi, thanks for the post! small correction -- the talk on EARL was presented by Jason Baldridge and Alexis Palmer (that's me), not Katrin Erk.
Posted by: alexis | February 05, 2009 at 04:29 PM
Great post you have here!Thanks for sharing such great entry.
-Ava
Posted by: Jollibee Food Corporation | March 03, 2009 at 05:45 PM
The night of the fight, you may feel a slight sting. That's pride f*cking with you. F*ck pride. Pride only hurts, it never helps.
abbb99af6734ffb75d2980a7d579641f
Posted by: Name | April 01, 2009 at 02:31 PM
Blogs are good for every one where we get lots of information for any topics nice job keep it up !!!
Posted by: MSC IT dissertation | May 04, 2009 at 09:54 PM
Hi,
I personally like your post; you have shared good insights and experiences. This post will really help beginners, although it is basic but, it will help others in great deal in future. Keep it up.
Posted by: Custom Essay Writing | October 19, 2009 at 01:02 AM
Hi,
Fantastic post and wonderful blog, I really like this type of interesting articles keep it up.
Posted by: Dissertation Online | October 19, 2009 at 03:37 AM
Hi,
Really it provided me some unknown information and sure I accept that in reading blogs helps us to gather some good information for all the topics which improves our knowledge. Thank you.
Posted by: Essay Writing | October 19, 2009 at 10:02 PM
Hi,
Nice info at this post thanks!!! I really like it.
Posted by: Coursework Online | October 20, 2009 at 12:33 AM
Hi,
Wow! That’s quite a read. I really appreciate this post, because it’s really an impressive work. You provide useful information.
Posted by: Custom Essays | October 30, 2009 at 03:56 AM
This is really a great stuff for sharing. Keep it up .Thanks for sharing.
Posted by: Assignments | December 23, 2009 at 09:19 PM
Great, I have to follow your blog, Thank you so much for post.
Posted by: cheap viagra online | February 22, 2010 at 01:17 PM
Hmm...Really what a great idea !! Alexis you have corrected good thing. It is really very good thing. Thank you for your concern !! Cheers MAn !!
Posted by: free movies | March 23, 2010 at 09:34 PM
Not everyone can provide the appropriate flow of information, thanks.
Posted by: ghd | March 25, 2010 at 07:35 PM
Something new. This is smart but need to develop.
Posted by: buy adipex generic | March 29, 2010 at 09:42 AM
A drug, broadly speaking, is any substance that, when absorbed into the body of a living organism, alters normal bodily function.[3] There is no single, precise definition, as there are different meanings in drug control law, government regulations, medicine, and colloquial usage.
Posted by: buy cheap kamagra | April 22, 2010 at 07:39 AM
I make restore points before I upgrade anything like this, but I still worry about upgrading too much and messing something up. So, is it better to upgrade computer drivers or leave them alone if they're working fine?
Posted by: propecia cost | April 26, 2010 at 11:13 AM
Nice article, i appreciate for putting this together! "This is obviously one great post. Thanks for the valuable information and insights you have so provided here. Keep it up!"
Posted by: managerial accounting homework | June 16, 2010 at 03:38 AM
Never give a party if you will be the most interesting person there.
Posted by: Ways to Lose Weight | June 17, 2010 at 04:55 AM
Nothing could be further from the truth; the current paper demonstrates how the classic autosegmental theory of phonological tiers could be expressed in terms of probabilities for both consonant and vowel segments.
Posted by: nursing clothing | July 13, 2010 at 08:27 PM
When implementing process or information system changes, slow and steady doesn't necessarily win the race. You have the need for speed.
Posted by: viagra online | August 18, 2010 at 01:05 PM
Grammar is an important component of linguistics.
Posted by: email encryption service | March 24, 2011 at 01:26 AM
Fluently communicating to people is a good sign you will make good in doing business.
Posted by: notary bond | May 16, 2011 at 02:19 AM
DvYPme9, valium online, HnKJjf5, valium best price, http://www.babyspotlatino.com/forum/topics/medication-valium-without-prescription >cheap valium, XhBLlz2, buying valium online in uk, http://www.babyspotlatino.com/forum/topics/medication-valium-without-prescription valium without prescription, GjQDiv1, valium online pharmacies no prescription, FlQYrk9
Posted by: qlbcfrwf | August 09, 2011 at 07:27 PM
DK0O4, http://www.idealers.us/profiles/blogs/how-much-does-ultracet-cost cheap ultracet, CA4U6, order ultracet cheap, http://www.idealers.us/profiles/blogs/how-much-does-ultracet-cost >ultracet, BF7Y0, cheap ultracet, ultracet online, BN1O1, ultracet tablets prices, LK2V8
Posted by: xtossdco | August 10, 2011 at 12:57 AM
HCgwQOB5, http://www.babyspotlatino.com/forum/topics/generic-soma-no-prescription >soma, RCiyQQC0, soma condo for sale philippines, http://www.babyspotlatino.com/forum/topics/generic-soma-no-prescription soma, KJwhLFA3, buy soma cheap online, soma, QHtvELY4, soma discounts, XQtpUEW9
Posted by: tzvdgglu | August 11, 2011 at 12:57 AM