Presented at: 19th International World Wide Web Conference (WWW2010)
by Suju Rajan, Dragomir Yankov, Scott Gaffney, Adwait Ratnaparkhi
Many web applications such as ad matching systems, vertical search engines, and page categorization systems require the identification of a particular type or class of pages on the Web. The sheer number and diversity of the pages on the web, however, makes the problem of obtaining a good sample of the class of interest hard. In this paper, we describe a successfully deployed end-to-end system that starts from a manually collected biased training sample and makes use of several state-of-the-art machine learning systems working in tandem, including a powerful active learning component, in order to achieve a good classification system. The performance of the system is evaluated on the traffic to a real-world ad-matching platform and is shown to have significant reduction in editorial effort and labeling time, while maintaining pre-specified performance criteria.
Keywords: Negative content filtering, porn, spam, viruses
Resource URI on the dog food server: http://data.semanticweb.org/conference/www/2010/paper/main/66
Explore this resource elsewhere: