Presented at: 16th International World Wide Web Conference (WWW2007)
by Gurmeet Manku, Arvind Jain, Anish Das Sarma
Near-duplicate documents are commonly found on the web. A pair of near-duplicate web pages differ from each other in a very small portion. The differences commonly consist of advertisements and timestamps. Such differences are irrelevant for web search. During web crawling, it is useful to quickly ascertain whether a newly crawled web page is a near-duplicate of a previously crawled web page or not.
In the course of developing a practical system for near-duplicate detection, we make two research contributions. First, we demonstrate the effectiveness of Charikar's fingerprinting technique for identifying near-duplicate web pages. We show that for 8 billion web-pages, a good choice of parameters is 64-bit fingerprints and 3-bit Hamming distances. Second, we present an algorithmic technique for identifying existing f-bit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.
Resource URI on the dog food server: http://data.semanticweb.org/conference/www/2007/paper/main/215
Explore this resource elsewhere: