Presented at: The Sixth International Language Resources and Evaluation Conference (LREC2008)
by Mariona TaulĂ©, M. Antonia MartĂ, Marta Recasens
Webpage: http://www.lrec-conf.org/proceedings/lrec2008/pdf/35_paper.pdfThis paper presents AnCora, a multilingual corpus annotated at different linguistic levels consisting of 500,000 words in Catalan (AnCora-Ca) and in Spanish (AnCora-Es). At present AnCora is the largest multilayer annotated corpus of these languages freely available from http://clic.ub.edu/ancora. The two corpora consist mainly of newspaper texts annotated at different levels of linguistic description: morphological (PoS and lemmas), syntactic (constituents and functions), and semantic (argument structures, thematic roles, semantic verb classes, named entities, and WordNet nominal senses). All resulting layers are independent of each other, thus making easier the data management. The annotation was performed manually, semiautomatically, or fully automatically, depending on the encoded linguistic information. The development of these basic resources constituted a primary objective, since there was a lack of such resources for these languages. A second goal was the definition of a consistent methodology that can be followed in further annotations. The current versions of AnCora have been used in several international evaluation competitions
Keywords: Corpus (creation, annotation, etc.), LR national/international projects, organizational/policy issues, Standards for LRs, Linguistics
Resource URI on the dog food server: http://data.semanticweb.org/conference/lrec/2008/papers/35
Explore this resource elsewhere: