2,006 research outputs found
Training deep retrieval models with noisy datasets
In this thesis we study loss functions that allow to train Convolutional Neural
Networks (CNNs) under noisy datasets for the particular task of Content-
Based Image Retrieval (CBIR). In particular, we propose two novel losses to fit
models that generate global image representations. First, a Soft-Matching (SM)
loss, exploiting both image content and meta data, is used to specialized general
CNNs to particular cities or regions using weakly annotated datasets. Second,
a Bag Exponential (BE) loss inspired by the Multiple Instance Learning (MIL)
framework is employed to train CNNs for CBIR under noisy datasets.
The first part of the thesis introduces a novel training framework that, relying
on image content and meta data, learns location-adapted deep models that
provide fine-tuned image descriptors for specific visual contents. Our networks,
which start from a baseline model originally learned for a different task, are specialized
using a custom pairwise loss function, our proposed SM loss, that uses
weak labels based on image content and meta data.
The experimental results show that the proposed location-adapted CNNs
achieve an improvement of up to a 55% over the baseline networks on a landmark
discovery task. This implies that the models successfully learn the visual
clues and peculiarities of the region for which they are trained, and generate
image descriptors that are better location-adapted. In addition, for those landmarks
that are not present on the training set or even other cities, our proposed
models perform at least as well as the baseline network, which indicates a good
resilience against overfitting.
The second part of the thesis introduces the BE Loss function to train CNNs
for image retrieval borrowing inspiration from the MIL framework. The loss
combines the use of an exponential function acting as a soft margin, and a MILbased
mechanism working with bags of positive and negative pairs of images.
The method allows to train deep retrieval networks under noisy datasets, by
weighing the influence of the different samples at loss level, which increases the
performance of the generated global descriptors. The rationale behind the improvement
is that we are handling noise in an end-to-end manner and, therefore,
avoiding its negative influence as well as the unintentional biases due to fixed
pre-processing cleaning procedures. In addition, our method is general enough
to suit other scenarios requiring different weights for the training instances (e.g.
boosting the influence of hard positives during training). The proposed bag exponential
function can bee seen as a back door to guide the learning process
according to a certain objective in a end-to-end manner, allowing the model to
approach such an objective smoothly and progressively.
Our results show that our loss allows CNN-based retrieval systems to be
trained with noisy training sets and achieve state-of-the-art performance. Furthermore,
we have found that it is better to use training sets that are highly
correlated with the final task, even if they are noisy, than training with a clean set that is only weakly related with the topic at hand. From our point of view,
this result represents a big leap in the applicability of retrieval systems and help
to reduce the effort needed to set-up new CBIR applications: e.g. by allowing
a fast automatic generation of noisy training datasets and then using our bag
exponential loss to deal with noise. Moreover, we also consider that this result
opens a new line of research for CNN-based image retrieval: let the models decide
not only on the best features to solve the task but also on the most relevant
samples to do it.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: Luis Salgado Álvarez de Sotomayor.- Secretario: Pablos Martínez Olmos.- Vocal: Ernest Valveny Llobe
Analysis of Vision based Techniques for the Translation of Indian Sign Language
Sign language acts as a medium of communication among those of the hearing impaired and mute community. However, it cannot be easily understood by common people. Various research has been done to bridge this gap by developing Sign Language Recognition (SLR) methodologies. Studies say that 1 in every 5 deaf people is Indian.
In this paper, a thorough review of these methodologies has been done, to compare and contrast various aspects of them. This includes an overview on different preprocessing methods used like segmentation, image morphological processing, cropping, etc, feature extraction techniques like Fourier Descriptors, Image Moments, Eigen values, Mediapipe and others. This study also covered classification models spanning from Distance metrics to Kernel based approaches and feedforward neural networks, along with Deep Learning based methods such as CNNs, LSTMs, GANs, Transformers etc
ATiTHi: A Deep Learning Approach for Tourist Destination Classification using Hybrid Parametric Optimization
A picture is best way to explore the tourist destination by visual content. The content-based image classification of tourist destinations makes it possible to understand the tourism liking by providing a more satisfactory tour. It also provides an important reference for tourist destination marketing. To enhance the competitiveness of the tourism market in India, this research proposes an innovative tourist spot identification mechanism by identifying the content of significant numbers of tourist photos using convolutional neural network (CNN) approach. It overcomes the limitations of manual approaches by recognizing visual information in photos. In this study, six thousand photos from different tourist destinations of India were identified and categorized into six major categories to form a new dataset of Indian Trajectory. This research employed Transfer learning (TF) strategies which help to obtain a good performance measure with very small dataset for image classification.VGG-16, VGG-19, MobileNetV2, InceptionV3, ResNet-50 and AlexNet CNN model with pretrained weight from ImageNet dataset was used for initialization and then an adapted classifier was used to classify tourist destination images from the newly prepared dataset. Hybrid hyperparameter optimization employ to find out hyperparameter for proposed Atithi model which lead to more efficient model in classification. To analyse and compare the performance of the models, known performance indicators were selected. As compared to the AlexNet model (0.83), MobileNetV2(0.93), VGG-19(0.918), InceptionV3(0.89), ResNet-50(0.852) the VGG16 model has performed the best in terms of accuracy (0.95). These results show the effectiveness of the current model in tourist destination image classification
UIT-Saviors at MEDVQA-GI 2023: Improving Multimodal Learning with Image Enhancement for Gastrointestinal Visual Question Answering
In recent years, artificial intelligence has played an important role in
medicine and disease diagnosis, with many applications to be mentioned, one of
which is Medical Visual Question Answering (MedVQA). By combining computer
vision and natural language processing, MedVQA systems can assist experts in
extracting relevant information from medical image based on a given question
and providing precise diagnostic answers. The ImageCLEFmed-MEDVQA-GI-2023
challenge carried out visual question answering task in the gastrointestinal
domain, which includes gastroscopy and colonoscopy images. Our team approached
Task 1 of the challenge by proposing a multimodal learning method with image
enhancement to improve the VQA performance on gastrointestinal images. The
multimodal architecture is set up with BERT encoder and different pre-trained
vision models based on convolutional neural network (CNN) and Transformer
architecture for features extraction from question and endoscopy image. The
result of this study highlights the dominance of Transformer-based vision
models over the CNNs and demonstrates the effectiveness of the image
enhancement process, with six out of the eight vision models achieving better
F1-Score. Our best method, which takes advantages of BERT+BEiT fusion and image
enhancement, achieves up to 87.25% accuracy and 91.85% F1-Score on the
development test set, while also producing good result on the private test set
with accuracy of 82.01%.Comment: ImageCLEF2023 published version:
https://ceur-ws.org/Vol-3497/paper-129.pd
Dual Embedding Expansion for Vehicle Re-identification
Vehicle re-identification plays a crucial role in the management of
transportation infrastructure and traffic flow. However, this is a challenging
task due to the large view-point variations in appearance, environmental and
instance-related factors. Modern systems deploy CNNs to produce unique
representations from the images of each vehicle instance. Most work focuses on
leveraging new losses and network architectures to improve the descriptiveness
of these representations. In contrast, our work concentrates on re-ranking and
embedding expansion techniques. We propose an efficient approach for combining
the outputs of multiple models at various scales while exploiting tracklet and
neighbor information, called dual embedding expansion (DEx). Additionally, a
comparative study of several common image retrieval techniques is presented in
the context of vehicle re-ID. Our system yields competitive performance in the
2020 NVIDIA AI City Challenge with promising results. We demonstrate that DEx
when combined with other re-ranking techniques, can produce an even larger gain
without any additional attribute labels or manual supervision
- …