526 research outputs found

    Embedding based on function approximation for large scale image search

    Full text link
    The objective of this paper is to design an embedding method that maps local features describing an image (e.g. SIFT) to a higher dimensional representation useful for the image retrieval problem. First, motivated by the relationship between the linear approximation of a nonlinear function in high dimensional space and the stateof-the-art feature representation used in image retrieval, i.e., VLAD, we propose a new approach for the approximation. The embedded vectors resulted by the function approximation process are then aggregated to form a single representation for image retrieval. Second, in order to make the proposed embedding method applicable to large scale problem, we further derive its fast version in which the embedded vectors can be efficiently computed, i.e., in the closed-form. We compare the proposed embedding methods with the state of the art in the context of image search under various settings: when the images are represented by medium length vectors, short vectors, or binary vectors. The experimental results show that the proposed embedding methods outperform existing the state of the art on the standard public image retrieval benchmarks.Comment: Accepted to TPAMI 2017. The implementation and precomputed features of the proposed F-FAemb are released at the following link: http://tinyurl.com/F-FAem

    Selective Deep Convolutional Features for Image Retrieval

    Full text link
    Convolutional Neural Network (CNN) is a very powerful approach to extract discriminative local descriptors for effective image search. Recent work adopts fine-tuned strategies to further improve the discriminative power of the descriptors. Taking a different approach, in this paper, we propose a novel framework to achieve competitive retrieval performance. Firstly, we propose various masking schemes, namely SIFT-mask, SUM-mask, and MAX-mask, to select a representative subset of local convolutional features and remove a large number of redundant features. We demonstrate that this can effectively address the burstiness issue and improve retrieval accuracy. Secondly, we propose to employ recent embedding and aggregating methods to further enhance feature discriminability. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art retrieval accuracy.Comment: Accepted to ACM MM 201

    Aggregating Deep Features For Image Retrieval

    Get PDF
    Measuring visual similarity between two images is useful in several multimedia applications such as visual search and image retrieval. However, measuring visual similarity between two images is an ill-posed problem which makes it a challenging task.This problem has been tackled extensively by the computer vision and machine learning communities. Nevertheless, with the recent advancements in deep learning, it is now possible to design novel image representations that allow systems to measure visual similarity more accurately than existing and widely adopted approaches, such as Fisher vectors. Unfortunately, deep-learning-based visual similarity approaches typically require post-processing stages that can be computationally expensive. To alleviate this issue, this thesis describes deep-learning-based visual image representations that allow a system to measure visual similarity without requiring post-processing stages. Specifically, this thesis describes max-pooling-based aggregation layers that combined with a convolutional-neural-network-based produce rich image representations for image retrieval without requiring an expensive post-processing stages. Moreover, the proposed max-pooling-based aggregation layers are general and can be seamlessly integrated with any existing and pre-trained networks. The experiments on large-scale image retrieval datasets confirm that the introduced image representations yield visual similarity measures that achieve a comparable or better retrieval performance than state-of-the art approaches, without requiring expensive post-processing operations

    Orientation covariant aggregation of local descriptors with embeddings

    Get PDF
    Image search systems based on local descriptors typically achieve orientation invariance by aligning the patches on their dominant orientations. Albeit successful, this choice introduces too much invariance because it does not guarantee that the patches are rotated consistently. This paper introduces an aggregation strategy of local descriptors that achieves this covariance property by jointly encoding the angle in the aggregation stage in a continuous manner. It is combined with an efficient monomial embedding to provide a codebook-free method to aggregate local descriptors into a single vector representation. Our strategy is also compatible and employed with several popular encoding methods, in particular bag-of-words, VLAD and the Fisher vector. Our geometric-aware aggregation strategy is effective for image search, as shown by experiments performed on standard benchmarks for image and particular object retrieval, namely Holidays and Oxford buildings.Comment: European Conference on Computer Vision (2014

    Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features

    Full text link
    The task of open-vocabulary object-centric image retrieval involves the retrieval of images containing a specified object of interest, delineated by an open-set text query. As working on large image datasets becomes standard, solving this task efficiently has gained significant practical importance. Applications include targeted performance analysis of retrieved images using ad-hoc queries and hard example mining during training. Recent advancements in contrastive-based open vocabulary systems have yielded remarkable breakthroughs, facilitating large-scale open vocabulary image retrieval. However, these approaches use a single global embedding per image, thereby constraining the system's ability to retrieve images containing relatively small object instances. Alternatively, incorporating local embeddings from detection pipelines faces scalability challenges, making it unsuitable for retrieval from large databases. In this work, we present a simple yet effective approach to object-centric open-vocabulary image retrieval. Our approach aggregates dense embeddings extracted from CLIP into a compact representation, essentially combining the scalability of image retrieval pipelines with the object identification capabilities of dense detection methods. We show the effectiveness of our scheme to the task by achieving significantly better results than global feature approaches on three datasets, increasing accuracy by up to 15 mAP points. We further integrate our scheme into a large scale retrieval framework and demonstrate our method's advantages in terms of scalability and interpretability.Comment: BMVC 202

    Image Retrieval via CNNs in TensorFlow2

    Get PDF
    Tato práce se věnuje vyhledávání největší množiny obrázků příslušící vyhledávanému objektu v rozsáhlých datových kolekcích. Konvoluční neuronové sítě (CNNs) prokázaly svoji schopnost poskytnout efektivní deskriptory pro vyhledávání obrázků. Zabýváme se tedy použitím vyladěných CNN k extrakcí globálních deskriptorů pro použití v problému vyhledávání obrázků (CBIR). V práci jsme studovali současný stav poznání metod vyhledávání obrázků jako například GeM a DELF. Klíčovým přínosem této práce je TensorFlow 2 implementace rozšiřitelného a vysoce přizpůsobitelného frameworku pro CBIR, založená na práci Radenoviće et al. Tento přístup poskytuje výsledky srovnatelné s nejlepšími současnými metodami, přičemž ale používá relativně krátké deskriptory. Pro ověření výsledků jsme natrénovali sítě na SfM120k datasetu a provedli experimenty na dvou standardních datasetech (revisited Oxford5k a Paris6k). Během experimentů byly využity rozlišné trénovací strategie, architektury neuronových sítí a ztrátové funkce pro komplexní zhodnocení implementovaného přístupu. Finální zdrojový kód byl přidán do oficiálního repozitáře TensorFlow 2, jakožto součást výzkumné knihovny DELF.This thesis addresses the problem of instance-level image retrieval in large-scale picture collections, intending to find the greatest number of images corresponding to a query. Convolutional neural networks (CNNs) have demonstrated their ability to provide effective descriptors for content-based image retrieval (CBIR). Given the current knowledge, we focused our efforts on utilizing fine-tuned CNNs for global feature extraction with the goal of using those for image retrieval problems. Firstly, we examined several methods proposed to improve image retrieval, such as GeM and DELF. As the main result of this thesis, an extendable and highly-customizable image retrieval framework based on the work of Radenović et al. was re-implemented in TensorFlow 2. This approach produces state-of-the-art retrieval results while using relatively short descriptors. As a validation, we trained the networks on the SfM120k landmark images dataset and performed experiments on two image retrieval benchmarks (revisited Oxford5k and Paris6k). Different training strategies, network architectures and loss functions were used in the experiments. The final project code was successfully merged into the official Tensorflow repository managed by Google, as a part of the DELF research library

    Fine-grained Incident Video Retrieval with Video Similarity Learning.

    Get PDF
    PhD ThesesIn this thesis, we address the problem of Fine-grained Incident Video Retrieval (FIVR) using video similarity learning methods. FIVR is a video retrieval task that aims to retrieve all videos that depict the same incident given a query video { related video retrieval tasks adopt either very narrow or very broad scopes, considering only nearduplicate or same event videos. To formulate the case of same incident videos, we de ne three video associations taking into account the spatio-temporal spans captured by video pairs. To cover the benchmarking needs of FIVR, we construct a large-scale dataset, called FIVR-200K, consisting of 225,960 YouTube videos from major news events crawled from Wikipedia. The dataset contains four annotation labels according to FIVR de nitions; hence, it can simulate several retrieval scenarios with the same video corpus. To address FIVR, we propose two video-level approaches leveraging features extracted from intermediate layers of Convolutional Neural Networks (CNN). The rst is an unsupervised method that relies on a modi ed Bag-of-Word scheme, which generates video representations from the aggregation of the frame descriptors based on learned visual codebooks. The second is a supervised method based on Deep Metric Learning, which learns an embedding function that maps videos in a feature space where relevant video pairs are closer than the irrelevant ones. However, videolevel approaches generate global video representations, losing all spatial and temporal relations between compared videos. Therefore, we propose a video similarity learning approach that captures ne-grained relations between videos for accurate similarity calculation. We train a CNN architecture to compute video-to-video similarity from re ned frame-to-frame similarity matrices derived from a pairwise region-level similarity function. The proposed approaches have been extensively evaluated on FIVR- 200K and other large-scale datasets, demonstrating their superiority over other video retrieval methods and highlighting the challenging aspect of the FIVR problem
    • …
    corecore