Presented at: The Sixth International Language Resources and Evaluation Conference (LREC2008)
by Tejaswini Deoskar, Mats Rooth
Webpage: http://www.lrec-conf.org/proceedings/lrec2008/pdf/798_paper.pdfWe describe the induction of lexical resources from unannotated corpora that are aligned with treebank grammars, providing a systematic correspondence between features in the lexical resource and a treebank syntactic resource. We first describe a methodology based on parsing technology for augmenting a treebank database with linguistic features. A PCFG containing these features is created from the augmented treebank. We then use a procedure based on the inside-outside algorithm to learn lexical resources aligned with the treebank PCFG from large unannotated corpora. The method has been applied in creating a feature-annotated English treebank based on the Penn Treebank. The unsupervised estimation procedure gives a substantial error reduction (up to 31.6%) on the task of learning the subcategorization preference of novel verbs that are not present in the annotated training sample.
Keywords: Lexicon, lexical database, Parsing Systems, Statistical methods, Linguistics
Resource URI on the dog food server: http://data.semanticweb.org/conference/lrec/2008/papers/798
Explore this resource elsewhere: