13 research outputs found

    Sampling Based On Natural Image Statistics Improves Local Surrogate Explainers

    Full text link
    Many problems in computer vision have recently been tackled using models whose predictions cannot be easily interpreted, most commonly deep neural networks. Surrogate explainers are a popular post-hoc interpretability method to further understand how a model arrives at a particular prediction. By training a simple, more interpretable model to locally approximate the decision boundary of a non-interpretable system, we can estimate the relative importance of the input features on the prediction. Focusing on images, surrogate explainers, e.g., LIME, generate a local neighbourhood around a query image by sampling in an interpretable domain. However, these interpretable domains have traditionally been derived exclusively from the intrinsic features of the query image, not taking into consideration the manifold of the data the non-interpretable model has been exposed to in training (or more generally, the manifold of real images). This leads to suboptimal surrogates trained on potentially low probability images. We address this limitation by aligning the local neighbourhood on which the surrogate is trained with the original training data distribution, even when this distribution is not accessible. We propose two approaches to do so, namely (1) altering the method for sampling the local neighbourhood and (2) using perceptual metrics to convey some of the properties of the distribution of natural images.Comment: 12 pages, 7 figure

    Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

    Get PDF
    Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the frame-based systems could present some problems when they were used directly to solve a videobased task despite the domain adaptation, which opens a new line of research to discover new ways to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models. Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The results revealed that these modalities carry relevant information to detect users’ emotional state and their combination enables improvement of system performance

    Inferencia de la respuesta afectiva de los espectadores de un video

    Get PDF
    In this project we propose the automatic analysis of the relation between the audiovisual characteristics of a multimedia production and the impact caused in its audience. With this aim, potential synergies are explored between different areas of knowledge including, among others: audiovisual communication, computer vision, multimodal systems, biometric sensors, social network analysis, opinion mining, and affective computing. Our efforts are oriented towards combining these technologies to introduce novel computational models that could predict the reactions of spectators to multimedia elements across different media and moments. On the one hand, we study the cognitive and emotional response of the spectators while they are watching the media instances, using neuroscience techniques and biometric sensors. On the other hand, we also study the reaction shown by the audience on social networks by relying on the automatic collection and analysis of different metadata related to the media elements, such as popularity, sharing patterns, ratings and commentaries.Este proyecto propone el análisis de la posible dependencia entre el contenido audiovisual de una producción multimedia y el impacto causado por ésta en sus espectadores. Para ello, nos apoyamos en diferentes áreas de conocimiento tales como comunicación audiovisual, visión por computador, sistemas multimodales, sensores biométricos, análisis de redes sociales, análisis de opinión o computación afectiva, entre otras, con el objetivo de diseñar nuevos modelos computacionales que permitan predecir las reacciones de los espectadores de un video de forma transversal a los medios y momentos en que éstas se producen. Trabajamos principalmente con dos tipos de respuesta: la respuesta cognitiva y emocional inmediata de los espectadores durante el visionado, que medimos utilizando técnicas de neurociencia y sensores biométricos, y la reacción expresada en redes sociales, cuyo impacto es cuantificado mediante el análisis automático de diferentes metadatos recabados para dichos videos, tales como popularidad, patrones de compartición, valoraciones y comentarios realizados en las redes.The work leading to these results has been supported by the Spanish Ministry of Economy, Industry and Competitiveness through the ESITUR (MINECO, RTC-2016-5305-7), CAVIAR (MINECO, TEC2017-84593-C2-1-R), and AMIC (MINECO, TIN2017-85854-C4-4-R) projects (AEI/FEDER, UE)

    A multi-threshold approach and a realistic error measure for vanishing point detection in natural landscapes

    Get PDF
    Vanishing Point (VP) detection is a computer vision task that can be useful in many different fields of application. In this work, we present a VP detection algorithm for natural landscape images based on an multi-threshold edge extraction process that combines several representations of an image, and on novel clustering and cluster refinement procedures. Our algorithm identifies a VP candidate in images with single-point perspective and improves detection results on two datasets that have already been tested for this task. Furthermore, we study how VP detection results have been reported in literature, pointing out the main drawbacks of previous approaches. To overcome these drawbacks, we present a novel error measure that is based on a probabilistic consistency measure between edges and VP hypothesis, and that can be tuned to vary the strictness on the results. Our reasoning on how our measure is more correct is supported by an intuitive analysis, simulations and an experimental validation.The work leading to these results has been supported by the Span-ish Ministry of Economy and Competitiveness and the Ministry of Science, Innovation and Universities through the ESITUR (MINECO,RTC-2016-5305- 7), CAVIAR (MICINN, TEC2017-84593-C2-1-R), and AMIC (MICINN, TIN2017-85854-C4-4-R) projects (AEI/FEDER, UE).We also gratefully acknowledge the support of NVIDIA Corporation.Publicad

    Predicting image aesthetics for intelligent tourism information systems

    Get PDF
    Image perception can vary considerably between subjects, yet some sights are regarded as aesthetically pleasant more often than others due to their specific visual content, this being particularly true in tourism-related applications. We introduce the ESITUR project, oriented towards the development of 'smart tourism' solutions aimed at improving the touristic experience. The idea is to convert conventional tourist showcases into fully interactive information points accessible from any smartphone, enriched with automatically-extracted contents from the analysis of public photos uploaded to social networks by other visitors. Our baseline, knowledge-driven system reaches a classification accuracy of 64.84 ± 4.22% telling suitable images from unsuitable ones for a tourism guide application. As an alternative we adopt a data-driven Mixture of Experts (MEX) approach, in which multiple learners specialize in partitions of the problem space. In our case, a location tag is attached to every picture providing a criterion to segment the data by, and the MEX model accordingly defined achieves an accuracy of 85.08 ± 2.23%. We conclude ours is a successful approach in environments in which some kind of data segmentation can be applied, such as touristic photographs.The work leading to these results has been supported by the Spanish Ministry of Economy, Industry and Competitiveness through the ESITUR (MINECO, RTC-2016-5305-7), CAVIAR (MINECO, TEC2017-84593-C2-1-R), and AMIC (MINECO, TIN2017-85854-C4-4-R) projects (AEI/FEDER, UE).Publicad

    Video Memorability Prediction From Jointly-learnt Semantic and Visual Features

    Full text link
    The memorability of a video is defined as an intrinsic property of its visual features that dictates the fraction of people who recall having watched it on a second viewing within a memory game. Still, unravelling what are the key features to predict memorability remains an obscure matter. This challenge is addressed here by fine-tuning text and image encoders using a cross-modal strategy known as Contrastive Language-Image Pre-training (CLIP). The resulting video-level data representations learned include semantics and topic-descriptive information as observed from both modalities, hence enhancing the predictive power of our algorithms. Our proposal achieves in the text domain a significantly greater Spearman Rank Correlation Coefficient (SRCC) than a default pre-trained text encoder (0.575 ± 0.007 and 0.538 ± 0.007, respectively) over the Memento10K dataset. A similar trend, although less pronounced, can be noticed in the visual domain.We believe these findings signal the potential benefits that cross-modal predictive systems can extract from being fine-tuned to the specific issue of media memorability

    Fine-Tuning BERT Models for Intent Recognition Using a Frequency Cut-Off Strategy for Domain-Specific Vocabulary Extension

    No full text
    Intent recognition is a key component of any task-oriented conversational system. The intent recognizer can be used first to classify the user’s utterance into one of several predefined classes (intents) that help to understand the user’s current goal. Then, the most adequate response can be provided accordingly. Intent recognizers also often appear as a form of joint models for performing the natural language understanding and dialog management tasks together as a single process, thus simplifying the set of problems that a conversational system must solve. This happens to be especially true for frequently asked question (FAQ) conversational systems. In this work, we first present an exploratory analysis in which different deep learning (DL) models for intent detection and classification were evaluated. In particular, we experimentally compare and analyze conventional recurrent neural networks (RNN) and state-of-the-art transformer models. Our experiments confirmed that best performance is achieved by using transformers. Specifically, best performance was achieved by fine-tuning the so-called BETO model (a Spanish pretrained bidirectional encoder representations from transformers (BERT) model from the Universidad de Chile) in our intent detection task. Then, as the main contribution of the paper, we analyze the effect of inserting unseen domain words to extend the vocabulary of the model as part of the fine-tuning or domain-adaptation process. Particularly, a very simple word frequency cut-off strategy is experimentally shown to be a suitable method for driving the vocabulary learning decisions over unseen words. The results of our analysis show that the proposed method helps to effectively extend the original vocabulary of the pretrained models. We validated our approach with a selection of the corpus acquired with the Hispabot-Covid19 system obtaining satisfactory results

    A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset

    No full text
    Emotion recognition is attracting the attention of the research community due to its multiple applications in different fields, such as medicine or autonomous driving. In this paper, we proposed an automatic emotion recognizer system that consisted of a speech emotion recognizer (SER) and a facial emotion recognizer (FER). For the SER, we evaluated a pre-trained xlsr-Wav2Vec2.0 transformer using two transfer-learning techniques: embedding extraction and fine-tuning. The best accuracy results were achieved when we fine-tuned the whole model by appending a multilayer perceptron on top of it, confirming that the training was more robust when it did not start from scratch and the previous knowledge of the network was similar to the task to adapt. Regarding the facial emotion recognizer, we extracted the Action Units of the videos and compared the performance between employing static models against sequential models. Results showed that sequential models beat static models by a narrow difference. Error analysis reported that the visual systems could improve with a detector of high-emotional load frames, which opened a new line of research to discover new ways to learn from videos. Finally, combining these two modalities with a late fusion strategy, we achieved 86.70% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. Results demonstrated that these modalities carried relevant information to detect users’ emotional state and their combination allowed to improve the final system performance

    Interpreting Sign Language Recognition Using Transformers and MediaPipe Landmarks

    Full text link
    Sign Language Recognition (SLR) is a challenging task that aims to bridge the communication gap between the deaf and hearing communities. In recent years, deep learning-based approaches have shown promising results in SLR. However, the lack of interpretability remains a significant challenge. In this paper, we seek to understand which hand and pose MediaPipe Landmarks are deemed the most important for prediction as estimated by a Transformer model. We propose to embed a learnable array of parameters into the model that performs an element-wise multiplication of the inputs. This learned array highlights the most informative input features that contributed to solve the recognition task. Resulting in a human-interpretable vector that lets us interpret the model predictions. We evaluate our approach on public datasets called WLASL100 (SRL) and IPNHand (gesture recognition). We believe that the insights gained in this way could be exploited for the development of more efficient SLR pipelines
    corecore