Induction of Treebank-Aligned Lexical Resources

Presented at: The Sixth International Language Resources and Evaluation Conference (LREC2008)

by Tejaswini Deoskar, Mats Rooth

Webpage: http://www.lrec-conf.org/proceedings/lrec2008/pdf/798_paper.pdf
Webpage: http://www.lrec-conf.org/proceedings/lrec2008/slides/798.pdf
Webpage: http://www.lrec-conf.org/proceedings/lrec2008/summaries/798.html

We describe the induction of lexical resources from unannotated corpora that are aligned with treebank grammars, providing a systematic correspondence between features in the lexical resource and a treebank syntactic resource. We first describe a methodology based on parsing technology for augmenting a treebank database with linguistic features. A PCFG containing these features is created from the augmented treebank. We then use a procedure based on the inside-outside algorithm to learn lexical resources aligned with the treebank PCFG from large unannotated corpora. The method has been applied in creating a feature-annotated English treebank based on the Penn Treebank. The unsupervised estimation procedure gives a substantial error reduction (up to 31.6%) on the task of learning the subcategorization preference of novel verbs that are not present in the annotated training sample.

Keywords: Lexicon, lexical database, Parsing Systems, Statistical methods, Linguistics


Resource URI on the dog food server: http://data.semanticweb.org/conference/lrec/2008/papers/798


Explore this resource elsewhere: