Page-level Template Detection via Isotonic Smoothing

Presented at: 16th International World Wide Web Conference (WWW2007)

by Deepayan Chakrabarti, Ravi Kumar, Kunal Punera

We develop a novel framework for the ``page-level'' template detection problem. Our framework is built on two main ideas. The first is the automatic generation of training data for a classifier that, given a page, assigns a templateness score to every DOM node of the page. The second is the global smoothing of these per-node classifier scores by solving a regularized isotonic regression problem; the latter follows from a simple yet powerful abstraction of templateness on a page. Our extensive experiments on human-labeled test data show that our approach detects templates effectively.


Resource URI on the dog food server: http://data.semanticweb.org/conference/www/2007/paper/main/588


Explore this resource elsewhere: