Presented at: 19th International World Wide Web Conference (WWW2010)
by Ping Li, Arnd König
This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, social networks and computational advertising. By only storing the lowest b bits of each (minwise) hashed value (e.g., b=1 or 2), one can gain substantial advantages in terms of computational efficiency and storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any $b$. We demonstrate that, even in the least favorable scenario, using b=1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to using b=64 (or b=32), if one is interested in resemblance >0.5. Our theoretical results are validated using a proprietary collection of $10^6$ news articles and a public (UCI) dataset of $300,000$ NYTimes articles.
Keywords: Efficient algorithms for large-scale analysis
Resource URI on the dog food server: http://data.semanticweb.org/conference/www/2010/paper/main/398
Explore this resource elsewhere: