Presented at: 16th International World Wide Web Conference (WWW2007)
by Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krüpl, Bernhard Pollak
Traditionally, information extraction from web tables has focused on small more or less homogeneous corpora, often based on assumptions about the semantics and use of Resource URI on the dog food server: http://data.semanticweb.org/conference/www/2007/paper/main/790 Explore this resource elsewhere: tags. A multitude of implementation forms of tables render these approaches difficult to scale. In this paper, we address the problem of domain-independent information extraction from web tables by shifting the focus from the tree-based representation of web pages to the 2-dimensional representation as intended by human authors for human readers. This additional visual information would allow us to fill the gap between syntax and domain dependent semantic. We show that this approach gives us a new set of features, which allow each of the steps of table location, recognition and interpretation to work without any reliance on domain-specific knowledge or domain-specific table templates.