526 research outputs found
Embedding based on function approximation for large scale image search
The objective of this paper is to design an embedding method that maps local
features describing an image (e.g. SIFT) to a higher dimensional representation
useful for the image retrieval problem. First, motivated by the relationship
between the linear approximation of a nonlinear function in high dimensional
space and the stateof-the-art feature representation used in image retrieval,
i.e., VLAD, we propose a new approach for the approximation. The embedded
vectors resulted by the function approximation process are then aggregated to
form a single representation for image retrieval. Second, in order to make the
proposed embedding method applicable to large scale problem, we further derive
its fast version in which the embedded vectors can be efficiently computed,
i.e., in the closed-form. We compare the proposed embedding methods with the
state of the art in the context of image search under various settings: when
the images are represented by medium length vectors, short vectors, or binary
vectors. The experimental results show that the proposed embedding methods
outperform existing the state of the art on the standard public image retrieval
benchmarks.Comment: Accepted to TPAMI 2017. The implementation and precomputed features
of the proposed F-FAemb are released at the following link:
http://tinyurl.com/F-FAem
Selective Deep Convolutional Features for Image Retrieval
Convolutional Neural Network (CNN) is a very powerful approach to extract
discriminative local descriptors for effective image search. Recent work adopts
fine-tuned strategies to further improve the discriminative power of the
descriptors. Taking a different approach, in this paper, we propose a novel
framework to achieve competitive retrieval performance. Firstly, we propose
various masking schemes, namely SIFT-mask, SUM-mask, and MAX-mask, to select a
representative subset of local convolutional features and remove a large number
of redundant features. We demonstrate that this can effectively address the
burstiness issue and improve retrieval accuracy. Secondly, we propose to employ
recent embedding and aggregating methods to further enhance feature
discriminability. Extensive experiments demonstrate that our proposed framework
achieves state-of-the-art retrieval accuracy.Comment: Accepted to ACM MM 201
Aggregating Deep Features For Image Retrieval
Measuring visual similarity between two images is useful in several multimedia applications such as visual search and image retrieval. However, measuring visual similarity between two images is an ill-posed problem which makes it a challenging task.This problem has been tackled extensively by the computer vision and machine learning communities. Nevertheless, with the recent advancements in deep learning, it is now possible to design novel image representations that allow systems to measure visual similarity more accurately than existing and widely adopted approaches, such as Fisher vectors. Unfortunately, deep-learning-based visual similarity approaches typically require post-processing stages that can be computationally expensive. To alleviate this issue, this thesis describes deep-learning-based visual image representations that allow a system to measure visual similarity without requiring post-processing stages. Specifically, this thesis describes max-pooling-based aggregation layers that combined with a convolutional-neural-network-based produce rich image representations for image retrieval without requiring an expensive post-processing stages. Moreover, the proposed max-pooling-based aggregation layers are general and can be seamlessly integrated with any existing and pre-trained networks. The experiments on large-scale image retrieval datasets confirm that the introduced image representations yield visual similarity measures that achieve a comparable or better retrieval performance than state-of-the art approaches, without requiring expensive post-processing operations
Orientation covariant aggregation of local descriptors with embeddings
Image search systems based on local descriptors typically achieve orientation
invariance by aligning the patches on their dominant orientations. Albeit
successful, this choice introduces too much invariance because it does not
guarantee that the patches are rotated consistently. This paper introduces an
aggregation strategy of local descriptors that achieves this covariance
property by jointly encoding the angle in the aggregation stage in a continuous
manner. It is combined with an efficient monomial embedding to provide a
codebook-free method to aggregate local descriptors into a single vector
representation. Our strategy is also compatible and employed with several
popular encoding methods, in particular bag-of-words, VLAD and the Fisher
vector. Our geometric-aware aggregation strategy is effective for image search,
as shown by experiments performed on standard benchmarks for image and
particular object retrieval, namely Holidays and Oxford buildings.Comment: European Conference on Computer Vision (2014
Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features
The task of open-vocabulary object-centric image retrieval involves the
retrieval of images containing a specified object of interest, delineated by an
open-set text query. As working on large image datasets becomes standard,
solving this task efficiently has gained significant practical importance.
Applications include targeted performance analysis of retrieved images using
ad-hoc queries and hard example mining during training. Recent advancements in
contrastive-based open vocabulary systems have yielded remarkable
breakthroughs, facilitating large-scale open vocabulary image retrieval.
However, these approaches use a single global embedding per image, thereby
constraining the system's ability to retrieve images containing relatively
small object instances. Alternatively, incorporating local embeddings from
detection pipelines faces scalability challenges, making it unsuitable for
retrieval from large databases.
In this work, we present a simple yet effective approach to object-centric
open-vocabulary image retrieval. Our approach aggregates dense embeddings
extracted from CLIP into a compact representation, essentially combining the
scalability of image retrieval pipelines with the object identification
capabilities of dense detection methods. We show the effectiveness of our
scheme to the task by achieving significantly better results than global
feature approaches on three datasets, increasing accuracy by up to 15 mAP
points. We further integrate our scheme into a large scale retrieval framework
and demonstrate our method's advantages in terms of scalability and
interpretability.Comment: BMVC 202
Image Retrieval via CNNs in TensorFlow2
Tato práce se vÄ›nuje vyhledávánĂ nejvÄ›tšà mnoĹľiny obrázkĹŻ pĹ™ĂslušĂcĂ vyhledávanĂ©mu objektu v rozsáhlĂ˝ch datovĂ˝ch kolekcĂch. KonvoluÄŤnĂ neuronovĂ© sĂtÄ› (CNNs) prokázaly svoji schopnost poskytnout efektivnĂ deskriptory pro vyhledávánĂ obrázkĹŻ. ZabĂ˝váme se tedy pouĹľitĂm vyladÄ›nĂ˝ch CNN k extrakcĂ globálnĂch deskriptorĹŻ pro pouĹľitĂ v problĂ©mu vyhledávánĂ obrázkĹŻ (CBIR). V práci jsme studovali souÄŤasnĂ˝ stav poznánĂ metod vyhledávánĂ obrázkĹŻ jako napĹ™Ăklad GeM a DELF. KlĂÄŤovĂ˝m pĹ™Ănosem tĂ©to práce je TensorFlow 2 implementace rozšiĹ™itelnĂ©ho a vysoce pĹ™izpĹŻsobitelnĂ©ho frameworku pro CBIR, zaloĹľená na práci Radenoviće et al. Tento pĹ™Ăstup poskytuje vĂ˝sledky srovnatelnĂ© s nejlepšĂmi souÄŤasnĂ˝mi metodami, pĹ™iÄŤemĹľ ale pouĹľĂvá relativnÄ› krátkĂ© deskriptory. Pro ověřenĂ vĂ˝sledkĹŻ jsme natrĂ©novali sĂtÄ› na SfM120k datasetu a provedli experimenty na dvou standardnĂch datasetech (revisited Oxford5k a Paris6k). BÄ›hem experimentĹŻ byly vyuĹľity rozlišnĂ© trĂ©novacĂ strategie, architektury neuronovĂ˝ch sĂtĂ a ztrátovĂ© funkce pro komplexnĂ zhodnocenĂ implementovanĂ©ho pĹ™Ăstupu. FinálnĂ zdrojovĂ˝ kĂłd byl pĹ™idán do oficiálnĂho repozitáře TensorFlow 2, jakoĹľto součást vĂ˝zkumnĂ© knihovny DELF.This thesis addresses the problem of instance-level image retrieval in large-scale picture collections, intending to find the greatest number of images corresponding to a query. Convolutional neural networks (CNNs) have demonstrated their ability to provide effective descriptors for content-based image retrieval (CBIR). Given the current knowledge, we focused our efforts on utilizing fine-tuned CNNs for global feature extraction with the goal of using those for image retrieval problems. Firstly, we examined several methods proposed to improve image retrieval, such as GeM and DELF. As the main result of this thesis, an extendable and highly-customizable image retrieval framework based on the work of Radenović et al. was re-implemented in TensorFlow 2. This approach produces state-of-the-art retrieval results while using relatively short descriptors. As a validation, we trained the networks on the SfM120k landmark images dataset and performed experiments on two image retrieval benchmarks (revisited Oxford5k and Paris6k). Different training strategies, network architectures and loss functions were used in the experiments. The final project code was successfully merged into the official Tensorflow repository managed by Google, as a part of the DELF research library
Fine-grained Incident Video Retrieval with Video Similarity Learning.
PhD ThesesIn this thesis, we address the problem of Fine-grained Incident Video Retrieval (FIVR)
using video similarity learning methods. FIVR is a video retrieval task that aims to
retrieve all videos that depict the same incident given a query video { related video
retrieval tasks adopt either very narrow or very broad scopes, considering only nearduplicate
or same event videos. To formulate the case of same incident videos, we
de ne three video associations taking into account the spatio-temporal spans captured
by video pairs. To cover the benchmarking needs of FIVR, we construct a large-scale
dataset, called FIVR-200K, consisting of 225,960 YouTube videos from major news
events crawled from Wikipedia. The dataset contains four annotation labels according
to FIVR de nitions; hence, it can simulate several retrieval scenarios with the same
video corpus. To address FIVR, we propose two video-level approaches leveraging
features extracted from intermediate layers of Convolutional Neural Networks (CNN).
The rst is an unsupervised method that relies on a modi ed Bag-of-Word scheme,
which generates video representations from the aggregation of the frame descriptors
based on learned visual codebooks. The second is a supervised method based on Deep
Metric Learning, which learns an embedding function that maps videos in a feature
space where relevant video pairs are closer than the irrelevant ones. However, videolevel
approaches generate global video representations, losing all spatial and temporal
relations between compared videos. Therefore, we propose a video similarity learning
approach that captures ne-grained relations between videos for accurate similarity
calculation. We train a CNN architecture to compute video-to-video similarity from
re ned frame-to-frame similarity matrices derived from a pairwise region-level similarity
function. The proposed approaches have been extensively evaluated on FIVR-
200K and other large-scale datasets, demonstrating their superiority over other video
retrieval methods and highlighting the challenging aspect of the FIVR problem
- …