2,006 research outputs found

    The computer nose best

    Get PDF

    Training deep retrieval models with noisy datasets

    Get PDF
    In this thesis we study loss functions that allow to train Convolutional Neural Networks (CNNs) under noisy datasets for the particular task of Content- Based Image Retrieval (CBIR). In particular, we propose two novel losses to fit models that generate global image representations. First, a Soft-Matching (SM) loss, exploiting both image content and meta data, is used to specialized general CNNs to particular cities or regions using weakly annotated datasets. Second, a Bag Exponential (BE) loss inspired by the Multiple Instance Learning (MIL) framework is employed to train CNNs for CBIR under noisy datasets. The first part of the thesis introduces a novel training framework that, relying on image content and meta data, learns location-adapted deep models that provide fine-tuned image descriptors for specific visual contents. Our networks, which start from a baseline model originally learned for a different task, are specialized using a custom pairwise loss function, our proposed SM loss, that uses weak labels based on image content and meta data. The experimental results show that the proposed location-adapted CNNs achieve an improvement of up to a 55% over the baseline networks on a landmark discovery task. This implies that the models successfully learn the visual clues and peculiarities of the region for which they are trained, and generate image descriptors that are better location-adapted. In addition, for those landmarks that are not present on the training set or even other cities, our proposed models perform at least as well as the baseline network, which indicates a good resilience against overfitting. The second part of the thesis introduces the BE Loss function to train CNNs for image retrieval borrowing inspiration from the MIL framework. The loss combines the use of an exponential function acting as a soft margin, and a MILbased mechanism working with bags of positive and negative pairs of images. The method allows to train deep retrieval networks under noisy datasets, by weighing the influence of the different samples at loss level, which increases the performance of the generated global descriptors. The rationale behind the improvement is that we are handling noise in an end-to-end manner and, therefore, avoiding its negative influence as well as the unintentional biases due to fixed pre-processing cleaning procedures. In addition, our method is general enough to suit other scenarios requiring different weights for the training instances (e.g. boosting the influence of hard positives during training). The proposed bag exponential function can bee seen as a back door to guide the learning process according to a certain objective in a end-to-end manner, allowing the model to approach such an objective smoothly and progressively. Our results show that our loss allows CNN-based retrieval systems to be trained with noisy training sets and achieve state-of-the-art performance. Furthermore, we have found that it is better to use training sets that are highly correlated with the final task, even if they are noisy, than training with a clean set that is only weakly related with the topic at hand. From our point of view, this result represents a big leap in the applicability of retrieval systems and help to reduce the effort needed to set-up new CBIR applications: e.g. by allowing a fast automatic generation of noisy training datasets and then using our bag exponential loss to deal with noise. Moreover, we also consider that this result opens a new line of research for CNN-based image retrieval: let the models decide not only on the best features to solve the task but also on the most relevant samples to do it.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: Luis Salgado Álvarez de Sotomayor.- Secretario: Pablos Martínez Olmos.- Vocal: Ernest Valveny Llobe

    Analysis of Vision based Techniques for the Translation of Indian Sign Language

    Get PDF
    Sign language acts as a medium of communication among those of the hearing impaired and mute community. However, it cannot be easily understood by common people. Various research has been done to bridge this gap by developing Sign Language Recognition (SLR) methodologies. Studies say that 1 in every 5 deaf people is Indian. In this paper, a thorough review of these methodologies has been done, to compare and contrast various aspects of them. This includes an overview on different preprocessing methods used like segmentation, image morphological processing, cropping, etc, feature extraction techniques like Fourier Descriptors, Image Moments, Eigen values, Mediapipe and others. This study also covered classification models spanning from Distance metrics to Kernel based approaches and feedforward neural networks, along with Deep Learning based methods such as CNNs, LSTMs, GANs, Transformers etc

    ATiTHi: A Deep Learning Approach for Tourist Destination Classification using Hybrid Parametric Optimization

    Get PDF
    A picture is best way to explore the tourist destination by visual content. The content-based image classification of tourist destinations makes it possible to understand the tourism liking by providing a more satisfactory tour. It also provides an important reference for tourist destination marketing. To enhance the competitiveness of the tourism market in India, this research proposes an innovative tourist spot identification mechanism by identifying the content of significant numbers of tourist photos using convolutional neural network (CNN) approach. It overcomes the limitations of manual approaches by recognizing visual information in photos. In this study, six thousand photos from different tourist destinations of India were identified and categorized into six major categories to form a new dataset of Indian Trajectory. This research employed Transfer learning (TF) strategies which help to obtain a good performance measure with very small dataset for image classification.VGG-16, VGG-19, MobileNetV2, InceptionV3, ResNet-50 and AlexNet CNN model with pretrained weight from ImageNet dataset was used for initialization and then an adapted classifier was used to classify tourist destination images from the newly prepared dataset. Hybrid hyperparameter optimization employ to find out hyperparameter for proposed Atithi model which lead to more efficient model in classification. To analyse and compare the performance of the models, known performance indicators were selected. As compared to the AlexNet model (0.83), MobileNetV2(0.93), VGG-19(0.918), InceptionV3(0.89), ResNet-50(0.852) the VGG16 model has performed the best in terms of accuracy (0.95). These results show the effectiveness of the current model in tourist destination image classification

    UIT-Saviors at MEDVQA-GI 2023: Improving Multimodal Learning with Image Enhancement for Gastrointestinal Visual Question Answering

    Full text link
    In recent years, artificial intelligence has played an important role in medicine and disease diagnosis, with many applications to be mentioned, one of which is Medical Visual Question Answering (MedVQA). By combining computer vision and natural language processing, MedVQA systems can assist experts in extracting relevant information from medical image based on a given question and providing precise diagnostic answers. The ImageCLEFmed-MEDVQA-GI-2023 challenge carried out visual question answering task in the gastrointestinal domain, which includes gastroscopy and colonoscopy images. Our team approached Task 1 of the challenge by proposing a multimodal learning method with image enhancement to improve the VQA performance on gastrointestinal images. The multimodal architecture is set up with BERT encoder and different pre-trained vision models based on convolutional neural network (CNN) and Transformer architecture for features extraction from question and endoscopy image. The result of this study highlights the dominance of Transformer-based vision models over the CNNs and demonstrates the effectiveness of the image enhancement process, with six out of the eight vision models achieving better F1-Score. Our best method, which takes advantages of BERT+BEiT fusion and image enhancement, achieves up to 87.25% accuracy and 91.85% F1-Score on the development test set, while also producing good result on the private test set with accuracy of 82.01%.Comment: ImageCLEF2023 published version: https://ceur-ws.org/Vol-3497/paper-129.pd

    Dual Embedding Expansion for Vehicle Re-identification

    Full text link
    Vehicle re-identification plays a crucial role in the management of transportation infrastructure and traffic flow. However, this is a challenging task due to the large view-point variations in appearance, environmental and instance-related factors. Modern systems deploy CNNs to produce unique representations from the images of each vehicle instance. Most work focuses on leveraging new losses and network architectures to improve the descriptiveness of these representations. In contrast, our work concentrates on re-ranking and embedding expansion techniques. We propose an efficient approach for combining the outputs of multiple models at various scales while exploiting tracklet and neighbor information, called dual embedding expansion (DEx). Additionally, a comparative study of several common image retrieval techniques is presented in the context of vehicle re-ID. Our system yields competitive performance in the 2020 NVIDIA AI City Challenge with promising results. We demonstrate that DEx when combined with other re-ranking techniques, can produce an even larger gain without any additional attribute labels or manual supervision
    corecore