3 research outputs found

    Using Redundant Bit Vectors for Near-Duplicate Image Detection

    No full text

    Using Redundant Bit Vectors for Near-Duplicate Image Detection

    No full text
    Abstract. Images are amongst the most widely proliferated form of digital information due to affordable imaging technologies and the Web. In such an environment, the use of digital watermarking for image copyright infringement detection is a challenge. For such tasks, near-duplicate image detection is increasingly attractive due to its ability of automated content analysis; moreover, the application domain also extends to data management. The application of PCA-SIFT features and Locality-Sensitive Hashing (LSH) — for indexing and retrieval — has been shown to be highly effective for this task. In this work, we prune the number of PCA-SIFT features and introduce a modified Redundant Bit Vector (RBV) index. This is the first application of the RBV index that shows near-perfect effectiveness. Using the best parameters of our RBV approach, we observe an average recall and precision of 91 % and 98%, respectively, with query response time of under 10 seconds on a collection of 20, 000 images. Compared to the baseline (the LSH index), the query response times and index size of the RBV index is 12 times faster and 126 times smaller, respectively. As compared to brute-force sequential scan, the RBV index rapidly reduces the search space to 1/80

    Detection of near-duplicates in large image collections

    Get PDF
    The vast numbers of images on the Web include many duplicates, and an even larger number of near-duplicate variants derived from the same original. These include thumbnails stored by search engines, copies shared by various news portals, and images that appear on multiple web sites, legitimately or otherwise. Such near-duplicates appear in the results of many web image searches, and constitute redundancy, and may also represent infringements of copyright. Digital images can be easily altered through simple digital manipulation such as conversion to grey-scale, colour balance change, rescaling, rotation, and cropping. Any of these operations defeat simple duplicate detection methods such as bit-level hashing. The ability to detect such variants with a reasonable degree of reliability and accuracy would support reduction of redundancy in collections and in presentation of search results, and also allow detection of possible copyright violations. Some existing methods for identifying near-duplicates are derived from computer vision techniques; these have shown high effectiveness for this domain, but are computationally expensive, and therefore impractical for large image collections. Other methods address the problem using conventional CBIR approaches that are more efficient but are typically not as robust. None of the previous methods have addressed the problem in its entirety, and none have addressed the large scale near-duplicate problem on the Web; there has been no analysis of the kinds of alterations that are common on the Web, nor any or evaluation of whether real cases of near-duplication can in fact be identified. In this thesis, we analyse the different types of alterations and near-duplicates existent in a range of popular web image searches, and establish a collection and evaluation ground truth using real-world near-duplicate examples. We present a simple ranking approach to reduce the number of local-descriptors, and therefore improve the efficiency of the descriptor-based retrieval method for near-duplicate detection. The descriptor-based method has been shown to produce near-perfect detection of near-duplicates, but was previously computationally very expensive. We show that while maintaining comparable effectiveness, our method scales well for large collections of hundreds of thousands of images. We also explore a more compact indexing structure to support near duplicate image detection. We develop a method to automatically detect the pair-wise near-duplicate relationship of images without the use of a query. We adapt the hash-based probabilistic counting method --- originally used for near-duplicate text document detection --- with the local descriptors; our adaptation offers the first effective and efficient non-query-based approach to this domain. We further incorporate our pair-wise detection approach for clustering of near-duplicates. We present a clustering method specifically for near-duplicate images, where our method is arguably the first clustering method to achieve a high level of effectiveness in this domain. We also show that near-duplicates within a large collection of a million images can be effectively clustered using our approach in less than an hour using relatively modest computational resources. Overall, our proposed methods provide practical approaches to the detection and management of near-duplicate images in large collection
    corecore