POWLA: Modeling linguistic corpora in OWL/DL

Presented at: 9th Extended Semantic Web Conference (ESWC2012)

by Christian Chiarcos

This paper describes POWLA, a formalism to formalize linguistic corpora in OWL/DL. POWLA is based on data models currently developed by the NLP community to overcome the heterogeneity of linguistic annotation (Ide and Pustejovsky 2010), in particular, PAULA, an XML standoff format developed out of early sketches of the Linguistic Annotation Framework (LAF, Ide and Romary 2004) which is currently developed within ISO TC37/SC4. These data models are defined as specializations of directed acyclic (hyper)graphs, and it is claimed that every kind of linguistic annotation can be represented as a directed (hyper)graph (Bird and Liberman 2001). Linguistic corpora can thus be naturally linearized in RDF. Unlike earlier approaches to model generic data models for linguistic annotations by means of Semantic Web standards (e.g., Cassidy 2010), POWLA augments the RDF linearization of linguistic data with a data model formalized in an OWL/DL ontology that defines data types for primary data, annotations and linguistic metadata, as well as consistency constraints on linguistic corpora. Unlike other approaches to model linguistic corpora in OWL/DL (e.g., Burchardt et al. 2008), POWLA is not specific to a particular type of annotation, but it implements a generic data model. This genericity is illustrated here for the conversion of GrAF (the XML linearization of the Linguistic Annotation Format, Ide and Suderman 2007) to POWLA. That POWLA preserves the linguistic information conveyed in the original GrAF data as shown by an experient to emulate ANNIS-QL, a query language specifically designed for heterogeneous and richly annotated linguistic corpora (Chiarcos et al. 2008), by means of SPARQL macros on POWLA data. Finally, the paper identifies advantages and disadvantages of OWL/RDF linearizations of generic data models for linguistic corpora (and in particular, POWLA) as compared to traditional XML standoff formats (Ide and Suderman 2007, Chiarcos et al. 2008).

Keywords: GrAF, PAULA, ANNIS-QL, corpus querying with SPARQL, data modeling with OWL/DL, interoperability of linguistic annotations

Resource URI on the dog food server: http://data.semanticweb.org/conference/eswc/2012/paper/research/213

Explore this resource elsewhere: