Microsoft
- Publication date
- Publisher
Abstract
Near-duplicate detection is not only an important pre and post processing task in Information Retrieval but also an effective spam-detection technique. Among different approaches to near-replica detection methods based on document signatures are particularly attractive due to their scalability to massive document collections and their ability to handle high throughput rates. Their weakness lies in the potential brittleness of signatures to small changes in content, which makes them vulnerable to various types of noise. In the important spam-filtering application, this vulnerability can also be exploited by dedicated attackers aiming to maximally fragment signatures corresponding to the same email campaign. We focus on the I-Match algorithm and present a method of strengthening it by considering the usage context when deciding which portions of a document should affect signature generation. This substantially (almost 100-fold in some cases) increases the difficulty of dedicated attacks and provides effective protection against document noise in non-adversarial settings. Our analysis is supported by experiments using a real email collection. 1