895 research outputs found
Efficient Image Retrieval via Decoupling Diffusion into Online and Offline Processing
Diffusion is commonly used as a ranking or re-ranking method in retrieval
tasks to achieve higher retrieval performance, and has attracted lots of
attention in recent years. A downside to diffusion is that it performs slowly
in comparison to the naive k-NN search, which causes a non-trivial online
computational cost on large datasets. To overcome this weakness, we propose a
novel diffusion technique in this paper. In our work, instead of applying
diffusion to the query, we pre-compute the diffusion results of each element in
the database, making the online search a simple linear combination on top of
the k-NN search process. Our proposed method becomes 10~ times faster in terms
of online search speed. Moreover, we propose to use late truncation instead of
early truncation in previous works to achieve better retrieval performance.Comment: Accepted by AAAI 201
Two-stage Discriminative Re-ranking for Large-scale Landmark Retrieval
We propose an efficient pipeline for large-scale landmark image retrieval
that addresses the diversity of the dataset through two-stage discriminative
re-ranking. Our approach is based on embedding the images in a feature-space
using a convolutional neural network trained with a cosine softmax loss. Due to
the variance of the images, which include extreme viewpoint changes such as
having to retrieve images of the exterior of a landmark from images of the
interior, this is very challenging for approaches based exclusively on visual
similarity. Our proposed re-ranking approach improves the results in two steps:
in the sort-step, -nearest neighbor search with soft-voting to sort the
retrieved results based on their label similarity to the query images, and in
the insert-step, we add additional samples from the dataset that were not
retrieved by image-similarity. This approach allows overcoming the low visual
diversity in retrieved images. In-depth experimental results show that the
proposed approach significantly outperforms existing approaches on the
challenging Google Landmarks Datasets. Using our methods, we achieved 1st place
in the Google Landmark Retrieval 2019 challenge and 3rd place in the Google
Landmark Recognition 2019 challenge on Kaggle. Our code is publicly available
here: \url{https://github.com/lyakaap/Landmark2019-1st-and-3rd-Place-Solution}Comment: 10 pages, 5 figure
Compositional Servoing by Recombining Demonstrations
Learning-based manipulation policies from image inputs often show weak task
transfer capabilities. In contrast, visual servoing methods allow efficient
task transfer in high-precision scenarios while requiring only a few
demonstrations. In this work, we present a framework that formulates the visual
servoing task as graph traversal. Our method not only extends the robustness of
visual servoing, but also enables multitask capability based on a few
task-specific demonstrations. We construct demonstration graphs by splitting
existing demonstrations and recombining them. In order to traverse the
demonstration graph in the inference case, we utilize a similarity function
that helps select the best demonstration for a specific task. This enables us
to compute the shortest path through the graph. Ultimately, we show that
recombining demonstrations leads to higher task-respective success. We present
extensive simulation and real-world experimental results that demonstrate the
efficacy of our approach.Comment: http://compservo.cs.uni-freiburg.d
Viewpoint Invariant Dense Matching for Visual Geolocalization
In this paper we propose a novel method for image matching based on dense local features and tailored for visual geolocalization. Dense local features matching is robust against changes in illumination and occlusions, but not against viewpoint shifts which are a fundamental aspect of geolocalization. Our method, called GeoWarp, directly embeds invariance to viewpoint shifts in the process of extracting dense features. This is achieved via a trainable module which learns from the data an invariance that is meaningful for the task of recognizing places. We also devise a new self-supervised loss and two new weakly supervised losses to train this module using only unlabeled data and weak labels. GeoWarp is implemented efficiently as a re-ranking method that can be easily embedded into pre-existing visual geolocalization pipelines. Experimental validation on standard geolocalization benchmarks demonstrates that GeoWarp boosts the accuracy of state-of-the-art retrieval architectures. The code and trained models are available at https://github.com/gmberton/geo_war
Toward Data Efficient Online Sequential Learning
Can machines optimally take sequential decisions over time? Since decades, researchers have been seeking an answer to this question, with the ultimate goal of unlocking the potential of artificial general intelligence (AGI) for a better and sustainable society. Many are the sectors that would be boosted by machines being able to take efficient sequential decisions over time. Let think at real-world applications such as personalized systems in entertainment (content systems) but also in healthcare (personalized therapy), smart cities (traffic control, flooding prevention), robots (control and planning), etc.. However, letting machines taking proper decisions in real-life is a highly challenging task. This is caused by the uncertainty behind such decisions (uncertainty on the actual reward, on the context, on the environment, etc.). A viable solution is to learn by experience (i.e., by trial and error), letting the machines uncover the uncertainty while taking decisions, and refining its strategy accordingly. However, such refinement is usually highly data-hungry (data-inefficiency), requiring a large amount of application specified data, leading to very slow learning processes -- hence very slow convergence to optimal strategies (curse of dimensionality). Luckily, data is usually intrinsically structured. Identifying and exploiting such structure substantially improves the data-efficiency of sequential learning algorithms. This is the key hypothesis underpinning the research in this thesis, in which novel structural learning methodologies are proposed for decision-making strategies problems such as Recommendation System (RS), Multi-armed Bandit (MAB) and Reinforcement Learning (RL), with the ultimate goal of making the learning process more (data)-efficient. Specifically, we tackle such goal from the perspective of modelling the problem structure as graphs, embedding tools from graph signal processing into decision learning theory.
As the first step, we study the application of graph-clustering techniques for RS, in which the curse of dimensionality is addressed by grouping data into clusters via graph-clustering techniques. Next, we exploit spectral graph structure for MAB problems, representing online learning problems. A key challenge is to learn sequentially the unknown bandit vector. Exploiting the smoothness-prior (i.e., bandit vector smooth on a given underpinning graph), we study theoretically the Laplacian-regularized estimator and provide both empirical evidences and theoretical analysis on the benefits of exploiting the graph structure in MABs. Then, we focus on the theoretical understanding of the Laplacian-regularized estimator. To this end, we derive a theoretical error upper bound on the estimator, which illustrates the impact of the alignment between the data and the graph structure as well as the graph spectrum on the estimation accuracy.
We then move to RL problems, focusing on the specific problem of learning a proper representation of the state-action (representation learning problem). Motivated by the fact that a good representation should be informative of the value function, we seek a learning algorithm able to preserve continuity between the value function and the representation space. Showing that state values are intrinsically correlated to the state transition dynamic structure and the diffusion of the reward on the MDP graph, we build a new loss function based on the newly defined diffusion distance and we propose a novel method to learn state representation with such desirable property.
In summary, in this thesis we address both theoretically and empirically important online sequential learning problems leveraging on the intrinsic data structure, showing the gain of the proposed solutions toward more data-efficient sequential learning strategies
Deep Image Retrieval: A Survey
In recent years a vast amount of visual content has been generated and shared
from various fields, such as social media platforms, medical images, and
robotics. This abundance of content creation and sharing has introduced new
challenges. In particular, searching databases for similar content, i.e.content
based image retrieval (CBIR), is a long-established research area, and more
efficient and accurate methods are needed for real time retrieval. Artificial
intelligence has made progress in CBIR and has significantly facilitated the
process of intelligent search. In this survey we organize and review recent
CBIR works that are developed based on deep learning algorithms and techniques,
including insights and techniques from recent papers. We identify and present
the commonly-used benchmarks and evaluation methods used in the field. We
collect common challenges and propose promising future directions. More
specifically, we focus on image retrieval with deep learning and organize the
state of the art methods according to the types of deep network structure, deep
features, feature enhancement methods, and network fine-tuning strategies. Our
survey considers a wide variety of recent methods, aiming to promote a global
view of the field of instance-based CBIR.Comment: 20 pages, 11 figure
Survey of Social Bias in Vision-Language Models
In recent years, the rapid advancement of machine learning (ML) models,
particularly transformer-based pre-trained models, has revolutionized Natural
Language Processing (NLP) and Computer Vision (CV) fields. However, researchers
have discovered that these models can inadvertently capture and reinforce
social biases present in their training datasets, leading to potential social
harms, such as uneven resource allocation and unfair representation of specific
social groups. Addressing these biases and ensuring fairness in artificial
intelligence (AI) systems has become a critical concern in the ML community.
The recent introduction of pre-trained vision-and-language (VL) models in the
emerging multimodal field demands attention to the potential social biases
present in these models as well. Although VL models are susceptible to social
bias, there is a limited understanding compared to the extensive discussions on
bias in NLP and CV. This survey aims to provide researchers with a high-level
insight into the similarities and differences of social bias studies in
pre-trained models across NLP, CV, and VL. By examining these perspectives, the
survey aims to offer valuable guidelines on how to approach and mitigate social
bias in both unimodal and multimodal settings. The findings and recommendations
presented here can benefit the ML community, fostering the development of
fairer and non-biased AI models in various applications and research endeavors
- …