7 research outputs found

    Calculating Sameness: Identifying Early-Modern Image Reuse Outside the Black Box

    Get PDF

    Framework for Map Reducing Technique Using Correlation for Duplicate Image Identi?cation Process

    Get PDF
    The duplicate image identification is an image deduplication System which avoids duplicate copies of images from storing in the storage server and reduces Storage space. This technique is used to improve storage utilization by avoiding duplicate images to store in storage server and reduce the time complexity by using Map Reduce technique. With explosive growth of digitization bulk of digital data may uploaded on server every day, deduplication schemes are widely used in backup and recovery System to minimize network and storage overhead by detecting and avoiding redundancy among data. Traditional deduplication schemes work if and only if the second image having the same content as first, so this restricts the performance of many applications as exact images need to be there if want to succeed and these all schemes are suffering from huge time complexity problem to deal with huge amount of data. In this paper, we propose the duplicate image identification system using MapReduce technique which improves the scalability and efficiency of system. Our approach reduce the time required to identify the duplicate image in storage server using MapReducing technique that is been powered with correlation technique

    A picture is worth a thousand words : content-based image retrieval techniques

    Get PDF
    In my dissertation I investigate techniques for improving the state of the art in content-based image retrieval. To place my work into context, I highlight the current trends and challenges in my field by analyzing over 200 recent articles. Next, I propose a novel paradigm called __artificial imagination__, which gives the retrieval system the power to imagine and think along with the user in terms of what she is looking for. I then introduce a new user interface for visualizing and exploring image collections, empowering the user to navigate large collections based on her own needs and preferences, while simultaneously providing her with an accurate sense of what the database has to offer. In the later chapters I present work dealing with millions of images and focus in particular on high-performance techniques that minimize memory and computational use for both near-duplicate image detection and web search. Finally, I show early work on a scene completion-based image retrieval engine, which synthesizes realistic imagery that matches what the user has in mind.LEI Universiteit LeidenNWOImagin

    The role of context in image annotation and recommendation

    Get PDF
    With the rise of smart phones, lifelogging devices (e.g. Google Glass) and popularity of image sharing websites (e.g. Flickr), users are capturing and sharing every aspect of their life online producing a wealth of visual content. Of these uploaded images, the majority are poorly annotated or exist in complete semantic isolation making the process of building retrieval systems difficult as one must firstly understand the meaning of an image in order to retrieve it. To alleviate this problem, many image sharing websites offer manual annotation tools which allow the user to “tag” their photos, however, these techniques are laborious and as a result have been poorly adopted; Sigurbjörnsson and van Zwol (2008) showed that 64% of images uploaded to Flickr are annotated with < 4 tags. Due to this, an entire body of research has focused on the automatic annotation of images (Hanbury, 2008; Smeulders et al., 2000; Zhang et al., 2012a) where one attempts to bridge the semantic gap between an image’s appearance and meaning e.g. the objects present. Despite two decades of research the semantic gap still largely exists and as a result automatic annotation models often offer unsatisfactory performance for industrial implementation. Further, these techniques can only annotate what they see, thus ignoring the “bigger picture” surrounding an image (e.g. its location, the event, the people present etc). Much work has therefore focused on building photo tag recommendation (PTR) methods which aid the user in the annotation process by suggesting tags related to those already present. These works have mainly focused on computing relationships between tags based on historical images e.g. that NY and timessquare co-exist in many images and are therefore highly correlated. However, tags are inherently noisy, sparse and ill-defined often resulting in poor PTR accuracy e.g. does NY refer to New York or New Year? This thesis proposes the exploitation of an image’s context which, unlike textual evidences, is always present, in order to alleviate this ambiguity in the tag recommendation process. Specifically we exploit the “what, who, where, when and how” of the image capture process in order to complement textual evidences in various photo tag recommendation and retrieval scenarios. In part II, we combine text, content-based (e.g. # of faces present) and contextual (e.g. day-of-the-week taken) signals for tag recommendation purposes, achieving up to a 75% improvement to precision@5 in comparison to a text-only TF-IDF baseline. We then consider external knowledge sources (i.e. Wikipedia & Twitter) as an alternative to (slower moving) Flickr in order to build recommendation models on, showing that similar accuracy could be achieved on these faster moving, yet entirely textual, datasets. In part II, we also highlight the merits of diversifying tag recommendation lists before discussing at length various problems with existing automatic image annotation and photo tag recommendation evaluation collections. In part III, we propose three new image retrieval scenarios, namely “visual event summarisation”, “image popularity prediction” and “lifelog summarisation”. In the first scenario, we attempt to produce a rank of relevant and diverse images for various news events by (i) removing irrelevant images such memes and visual duplicates (ii) before semantically clustering images based on the tweets in which they were originally posted. Using this approach, we were able to achieve over 50% precision for images in the top 5 ranks. In the second retrieval scenario, we show that by combining contextual and content-based features from images, we are able to predict if it will become “popular” (or not) with 74% accuracy, using an SVM classifier. Finally, in chapter 9 we employ blur detection and perceptual-hash clustering in order to remove noisy images from lifelogs, before combining visual and geo-temporal signals in order to capture a user’s “key moments” within their day. We believe that the results of this thesis show an important step towards building effective image retrieval models when there lacks sufficient textual content (i.e. a cold start)

    Detection of near-duplicates in large image collections

    Get PDF
    The vast numbers of images on the Web include many duplicates, and an even larger number of near-duplicate variants derived from the same original. These include thumbnails stored by search engines, copies shared by various news portals, and images that appear on multiple web sites, legitimately or otherwise. Such near-duplicates appear in the results of many web image searches, and constitute redundancy, and may also represent infringements of copyright. Digital images can be easily altered through simple digital manipulation such as conversion to grey-scale, colour balance change, rescaling, rotation, and cropping. Any of these operations defeat simple duplicate detection methods such as bit-level hashing. The ability to detect such variants with a reasonable degree of reliability and accuracy would support reduction of redundancy in collections and in presentation of search results, and also allow detection of possible copyright violations. Some existing methods for identifying near-duplicates are derived from computer vision techniques; these have shown high effectiveness for this domain, but are computationally expensive, and therefore impractical for large image collections. Other methods address the problem using conventional CBIR approaches that are more efficient but are typically not as robust. None of the previous methods have addressed the problem in its entirety, and none have addressed the large scale near-duplicate problem on the Web; there has been no analysis of the kinds of alterations that are common on the Web, nor any or evaluation of whether real cases of near-duplication can in fact be identified. In this thesis, we analyse the different types of alterations and near-duplicates existent in a range of popular web image searches, and establish a collection and evaluation ground truth using real-world near-duplicate examples. We present a simple ranking approach to reduce the number of local-descriptors, and therefore improve the efficiency of the descriptor-based retrieval method for near-duplicate detection. The descriptor-based method has been shown to produce near-perfect detection of near-duplicates, but was previously computationally very expensive. We show that while maintaining comparable effectiveness, our method scales well for large collections of hundreds of thousands of images. We also explore a more compact indexing structure to support near duplicate image detection. We develop a method to automatically detect the pair-wise near-duplicate relationship of images without the use of a query. We adapt the hash-based probabilistic counting method --- originally used for near-duplicate text document detection --- with the local descriptors; our adaptation offers the first effective and efficient non-query-based approach to this domain. We further incorporate our pair-wise detection approach for clustering of near-duplicates. We present a clustering method specifically for near-duplicate images, where our method is arguably the first clustering method to achieve a high level of effectiveness in this domain. We also show that near-duplicates within a large collection of a million images can be effectively clustered using our approach in less than an hour using relatively modest computational resources. Overall, our proposed methods provide practical approaches to the detection and management of near-duplicate images in large collection

    Detection of Near-duplicate Images for Web Search ABSTRACT

    No full text
    Among the vast numbers of images on the web are many duplicates and near-duplicates, that is, variants derived from the same original image. Such near-duplicates appear in many web image searches and may represent infringements of copyright or indicate the presence of redundancy. While methods for identifying near-duplicates have been investigated, there has been no analysis of the kinds of alterations that are common on the web or evaluation of whether real cases of near-duplication can in fact be identified. In this paper we use popular queries and a commercial image search service to collect images that we then manually analyse for instances of near-duplication. We show that such duplication is indeed significant, but that not all kinds of image alteration explored in previous literature are evident in web data. Removal of near-duplicates from a collection is impractical, but we propose that they be removed from sets of answers. We evaluate our technique for automatic identification of near duplicates during query evaluation and show that it has promise as an effective mechanism for management of near-duplication in practice