889 research outputs found
IRGen: Generative Modeling for Image Retrieval
While generative modeling has been ubiquitous in natural language processing
and computer vision, its application to image retrieval remains unexplored. In
this paper, we recast image retrieval as a form of generative modeling by
employing a sequence-to-sequence model, contributing to the current unified
theme. Our framework, IRGen, is a unified model that enables end-to-end
differentiable search, thus achieving superior performance thanks to direct
optimization. While developing IRGen we tackle the key technical challenge of
converting an image into quite a short sequence of semantic units in order to
enable efficient and effective retrieval. Empirical experiments demonstrate
that our model yields significant improvement over three commonly used
benchmarks, for example, 22.9\% higher than the best baseline method in
precision@10 on In-shop dataset with comparable recall@10 score
Automatic Caption Generation for Aerial Images: A Survey
Aerial images have attracted attention from researcher community since long time. Generating a caption for an aerial image describing its content in comprehensive way is less studied but important task as it has applications in agriculture, defence, disaster management and many more areas. Though different approaches were followed for natural image caption generation, generating a caption for aerial image remains a challenging task due to its special nature. Use of emerging techniques from Artificial Intelligence (AI) and Natural Language Processing (NLP) domains have resulted in generation of accepted quality captions for aerial images. However lot needs to be done to fully utilize potential of aerial image caption generation task. This paper presents detail survey of the various approaches followed by researchers for aerial image caption generation task. The datasets available for experimentation, criteria used for performance evaluation and future directions are also discussed
Seamless Multimodal Biometrics for Continuous Personalised Wellbeing Monitoring
Artificially intelligent perception is increasingly present in the lives of
every one of us. Vehicles are no exception, (...) In the near future, pattern
recognition will have an even stronger role in vehicles, as self-driving cars
will require automated ways to understand what is happening around (and within)
them and act accordingly. (...) This doctoral work focused on advancing
in-vehicle sensing through the research of novel computer vision and pattern
recognition methodologies for both biometrics and wellbeing monitoring. The
main focus has been on electrocardiogram (ECG) biometrics, a trait well-known
for its potential for seamless driver monitoring. Major efforts were devoted to
achieving improved performance in identification and identity verification in
off-the-person scenarios, well-known for increased noise and variability. Here,
end-to-end deep learning ECG biometric solutions were proposed and important
topics were addressed such as cross-database and long-term performance,
waveform relevance through explainability, and interlead conversion. Face
biometrics, a natural complement to the ECG in seamless unconstrained
scenarios, was also studied in this work. The open challenges of masked face
recognition and interpretability in biometrics were tackled in an effort to
evolve towards algorithms that are more transparent, trustworthy, and robust to
significant occlusions. Within the topic of wellbeing monitoring, improved
solutions to multimodal emotion recognition in groups of people and
activity/violence recognition in in-vehicle scenarios were proposed. At last,
we also proposed a novel way to learn template security within end-to-end
models, dismissing additional separate encryption processes, and a
self-supervised learning approach tailored to sequential data, in order to
ensure data security and optimal performance. (...)Comment: Doctoral thesis presented and approved on the 21st of December 2022
to the University of Port
Metric Optimization and Mainstream Bias Mitigation in Recommender Systems
The first part of this thesis focuses on maximizing the overall
recommendation accuracy. This accuracy is usually evaluated with some
user-oriented metric tailored to the recommendation scenario, but because
recommendation is usually treated as a machine learning problem, recommendation
models are trained to maximize some other generic criteria that does not
necessarily align with the criteria ultimately captured by the user-oriented
evaluation metric. Recent research aims at bridging this gap between training
and evaluation via direct ranking optimization, but still assumes that the
metric used for evaluation should also be the metric used for training. We
challenge this assumption, mainly because some metrics are more informative
than others. Indeed, we show that models trained via the optimization of a loss
inspired by Rank-Biased Precision (RBP) tend to yield higher accuracy, even
when accuracy is measured with metrics other than RBP. However, the superiority
of this RBP-inspired loss stems from further benefiting users who are already
well-served, rather than helping those who are not.
This observation inspires the second part of this thesis, where our focus
turns to helping non-mainstream users. These are users who are difficult to
recommend to either because there is not enough data to model them, or because
they have niche taste and thus few similar users to look at when recommending
in a collaborative way. These differences in mainstreamness introduce a bias
reflected in an accuracy gap between users or user groups, which we try to
narrow.Comment: PhD Thesis defended on Nov 14, 202
A Billion-scale Foundation Model for Remote Sensing Images
As the potential of foundation models in visual tasks has garnered
significant attention, pretraining these models before downstream tasks has
become a crucial step. The three key factors in pretraining foundation models
are the pretraining method, the size of the pretraining dataset, and the number
of model parameters. Recently, research in the remote sensing field has focused
primarily on the pretraining method and the size of the dataset, with limited
emphasis on the number of model parameters. This paper addresses this gap by
examining the effect of increasing the number of model parameters on the
performance of foundation models in downstream tasks such as rotated object
detection and semantic segmentation. We pretrained foundation models with
varying numbers of parameters, including 86M, 605.26M, 1.3B, and 2.4B, to
determine whether performance in downstream tasks improved with an increase in
parameters. To the best of our knowledge, this is the first billion-scale
foundation model in the remote sensing field. Furthermore, we propose an
effective method for scaling up and fine-tuning a vision transformer in the
remote sensing field. To evaluate general performance in downstream tasks, we
employed the DOTA v2.0 and DIOR-R benchmark datasets for rotated object
detection, and the Potsdam and LoveDA datasets for semantic segmentation.
Experimental results demonstrated that, across all benchmark datasets and
downstream tasks, the performance of the foundation models and data efficiency
improved as the number of parameters increased. Moreover, our models achieve
the state-of-the-art performance on several datasets including DIOR-R, Postdam,
and LoveDA.Comment: This work has been submitted to the IEEE for possible publicatio
Information Retrieval: Recent Advances and Beyond
In this paper, we provide a detailed overview of the models used for
information retrieval in the first and second stages of the typical processing
chain. We discuss the current state-of-the-art models, including methods based
on terms, semantic retrieval, and neural. Additionally, we delve into the key
topics related to the learning process of these models. This way, this survey
offers a comprehensive understanding of the field and is of interest for for
researchers and practitioners entering/working in the information retrieval
domain
Is Solving Graph Neural Tangent Kernel Equivalent to Training Graph Neural Network?
A rising trend in theoretical deep learning is to understand why deep
learning works through Neural Tangent Kernel (NTK) [jgh18], a kernel method
that is equivalent to using gradient descent to train a multi-layer
infinitely-wide neural network. NTK is a major step forward in the theoretical
deep learning because it allows researchers to use traditional mathematical
tools to analyze properties of deep neural networks and to explain various
neural network techniques from a theoretical view. A natural extension of NTK
on graph learning is \textit{Graph Neural Tangent Kernel (GNTK)}, and
researchers have already provide GNTK formulation for graph-level regression
and show empirically that this kernel method can achieve similar accuracy as
GNNs on various bioinformatics datasets [dhs+19]. The remaining question now is
whether solving GNTK regression is equivalent to training an infinite-wide
multi-layer GNN using gradient descent. In this paper, we provide three new
theoretical results. First, we formally prove this equivalence for graph-level
regression. Second, we present the first GNTK formulation for node-level
regression. Finally, we prove the equivalence for node-level regression
Visual place recognition for improved open and uncertain navigation
Visual place recognition localises a query place image by comparing it against a reference database of known place images, a fundamental element of robotic navigation.
Recent work focuses on using deep learning to learn image descriptors for this task
that are invariant to appearance changes from dynamic lighting, weather and seasonal
conditions. However, these descriptors: require greater computational resources than
are available on robotic hardware, have few SLAM frameworks designed to utilise
them, return a relative comparison between image descriptors which is difficult to interpret, cannot be used for appearance invariance in other navigation tasks such as
scene classification and are unable to identify query images from an open environment that have no true match in the reference database. This thesis addresses these
challenges with three contributions. The first is a lightweight visual place recognition
descriptor combined with a probabilistic filter to address a subset of the visual SLAM
problem in real-time. The second contribution combines visual place recognition and
scene classification for appearance invariant scene classification, which is extended
to recognise unknown scene classes when navigating an open environment. The final contribution uses comparisons between query and reference image descriptors to
classify whether they result in a true, or false positive localisation and whether a true
match for the query image exists in the reference database.Edinburgh Centre for Robotics and Engineering and Physical Sciences Research Council (EPSRC) fundin
Symmetric Contrastive Learning On Programming Languages
Contrastive pre-training has been shown to learn good features by finding the inner difference and similar latent traits among the samples. The pairwise data programming languages and natural language also have strong inner-relationship that can be used on the downstream tasks. Pre-trained models for Natural Languages have been recently shown to transfer well to Programming Languages (PL) and primarily benefit different intelligence code-related tasks, such as code search, clone detection, programming translation and code document generation. However, existing pre-trained methods for programming languages are mainly conducted by masked language modelling. This restricted form limits their performance and transferability since PL and NL have different syntax rules. Here we introduce C3P, a Contrastive Code-Comment Pre-training approach, to solve various downstream tasks by pre-training the multi-representation features on both programming and natural syntax. The model encodes the code syntax and natural language description (comment) by two encoders and the encoded embeddings are projected into a multi-modal space for learning the latent representation. In the latent space, C3P jointly trains the code and comment encoders by the symmetric loss function, which aims to maximize the cosine similarity of the correct code-comment pairs while minimizing the similarity of unrelated pairs. We verify the empirical performance of the proposed pre-trained models on multiple downstream code-related tasks. The comprehensive experiments demonstrate that C3P outperforms previous work on the understanding tasks of code search and code clone, as well as the generation tasks of programming translation and document generation.
Furthermore, we validate the transferability of C3P to the new programming language. The results show our model surpasses all supervised methods and in some programming language cases even outperforms prior pre-trained approaches
- …