Extracting Article Text from the Web with Maximum Subsequence Segmentation

Presented at: 18th International World Wide Web Conference (WWW2009)

by Jeff Pasternack, Dan Roth

Webpage: http://www2009.eprints.org/98/1/p971.pdf

Much of the information on the Web is found in articles from online news outlets, magazines, encyclopedias, review collections, and other sources. However, extracting this content from the original HTML document is complicated by the large amount of less informative and typically unrelated material such as navigation menus, forms, user comments, and ads. Existing approaches tend to be either brittle and demand significant expert knowledge and time (manual or tool-assisted generation of rules or code), necessitate labeled examples for every different page structure to be processed (wrapper induction), require relatively uniform layout (template detection), or, as with Visual Page Segmentation (VIPS), are computationally expensive. We introduce maximum subsequence segmentation, a method of global optimization over token-level local classifiers, and apply it to the domain of news websites. Training examples are easy to obtain, both learning and prediction are linear time, and results are excellent (our semi-supervised algorithm yields an overall F1score of 97.947%), surpassing even those produced by VIPS with a hypothetical perfect block-selection heuristic. We also evaluate against the recent CleanEval shared task with surprisingly good cross-task performance cleaning general web pages, exceeding the top "text-only" score (based on Levenshtein distance), 87.8% versus 84.1%.

Keywords: XML and Web Data


Resource URI on the dog food server: http://data.semanticweb.org/conference/www/2009/paper/98


Explore this resource elsewhere: