6,230 research outputs found

    3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition

    Full text link
    Audio-visual recognition (AVR) has been considered as a solution for speech recognition tasks when the audio is corrupted, as well as a visual recognition method used for speaker verification in multi-speaker scenarios. The approach of AVR systems is to leverage the extracted information from one modality to improve the recognition ability of the other modality by complementing the missing information. The essential problem is to find the correspondence between the audio and visual streams, which is the goal of this work. We propose the use of a coupled 3D Convolutional Neural Network (3D-CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio-visual streams using the learned multimodal features. The proposed architecture will incorporate both spatial and temporal information jointly to effectively find the correlation between temporal information for different modalities. By using a relatively small network architecture and much smaller dataset for training, our proposed method surpasses the performance of the existing similar methods for audio-visual matching which use 3D CNNs for feature representation. We also demonstrate that an effective pair selection method can significantly increase the performance. The proposed method achieves relative improvements over 20% on the Equal Error Rate (EER) and over 7% on the Average Precision (AP) in comparison to the state-of-the-art method

    A Bimodal Learning Approach to Assist Multi-sensory Effects Synchronization

    Full text link
    In mulsemedia applications, traditional media content (text, image, audio, video, etc.) can be related to media objects that target other human senses (e.g., smell, haptics, taste). Such applications aim at bridging the virtual and real worlds through sensors and actuators. Actuators are responsible for the execution of sensory effects (e.g., wind, heat, light), which produce sensory stimulations on the users. In these applications sensory stimulation must happen in a timely manner regarding the other traditional media content being presented. For example, at the moment in which an explosion is presented in the audiovisual content, it may be adequate to activate actuators that produce heat and light. It is common to use some declarative multimedia authoring language to relate the timestamp in which each media object is to be presented to the execution of some sensory effect. One problem in this setting is that the synchronization of media objects and sensory effects is done manually by the author(s) of the application, a process which is time-consuming and error prone. In this paper, we present a bimodal neural network architecture to assist the synchronization task in mulsemedia applications. Our approach is based on the idea that audio and video signals can be used simultaneously to identify the timestamps in which some sensory effect should be executed. Our learning architecture combines audio and video signals for the prediction of scene components. For evaluation purposes, we construct a dataset based on Google's AudioSet. We provide experiments to validate our bimodal architecture. Our results show that the bimodal approach produces better results when compared to several variants of unimodal architectures

    Continuous Multimodal Emotion Recognition Approach for AVEC 2017

    Full text link
    This paper reports the analysis of audio and visual features in predicting the continuous emotion dimensions under the seventh Audio/Visual Emotion Challenge (AVEC 2017), which was done as part of a B.Tech. 2nd year internship project. For visual features we used the HOG (Histogram of Gradients) features, Fisher encodings of SIFT (Scale-Invariant Feature Transform) features based on Gaussian mixture model (GMM) and some pretrained Convolutional Neural Network layers as features; all these extracted for each video clip. For audio features we used the Bag-of-audio-words (BoAW) representation of the LLDs (low-level descriptors) generated by openXBOW provided by the organisers of the event. Then we trained fully connected neural network regression model on the dataset for all these different modalities. We applied multimodal fusion on the output models to get the Concordance correlation coefficient on Development set as well as Test set.Comment: 4 pages, 3 figures, arXiv:1605.06778, arXiv:1512.0338

    Updating the silent speech challenge benchmark with deep learning

    Full text link
    The 2010 Silent Speech Challenge benchmark is updated with new results obtained in a Deep Learning strategy, using the same input features and decoding strategy as in the original article. A Word Error Rate of 6.4% is obtained, compared to the published value of 17.4%. Additional results comparing new auto-encoder-based features with the original features at reduced dimensionality, as well as decoding scenarios on two different language models, are also presented. The Silent Speech Challenge archive has been updated to contain both the original and the new auto-encoder features, in addition to the original raw data.Comment: 25 pages, 6 page

    Modality Dropout for Improved Performance-driven Talking Faces

    Full text link
    We describe our novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-speech facial movements are generated using only visual information. To ensure that our model exploits both modalities during training, batches are generated that contain audio-only, video-only, and audiovisual input features. The probability of dropping a modality allows control over the degree to which the model exploits audio and visual information during training. Our trained model runs in real-time on resource limited hardware (e.g.\ a smart phone), it is user agnostic, and it is not dependent on a potentially error-prone transcription of the speech. We use subjective testing to demonstrate: 1) the improvement of audiovisual-driven animation over the equivalent video-only approach, and 2) the improvement in the animation of speech-related facial movements after introducing modality dropout. Before introducing dropout, viewers prefer audiovisual-driven animation in 51% of the test sequences compared with only 18% for video-driven. After introducing dropout viewer preference for audiovisual-driven animation increases to 74%, but decreases to 8% for video-only.Comment: Pre-prin

    Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

    Full text link
    Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. We show that the learnt representations are useful for classifying events and localizing their characteristic audio-visual elements. The system is trained using only video-level event labels without any timing information. An important feature of our method is its capacity to learn from unsynchronized audio-visual events. We achieve state-of-the-art results on a large-scale dataset of weakly-labeled audio event videos. Visualizations of localized visual regions and audio segments substantiate our system's efficacy, especially when dealing with noisy situations where modality-specific cues appear asynchronously

    Deep Learning for Sentiment Analysis : A Survey

    Full text link
    Deep learning has emerged as a powerful machine learning technique that learns multiple layers of representations or features of the data and produces state-of-the-art prediction results. Along with the success of deep learning in many other application domains, deep learning is also popularly used in sentiment analysis in recent years. This paper first gives an overview of deep learning and then provides a comprehensive survey of its current applications in sentiment analysis.Comment: 34 pages, 9 figures, 2 table

    Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning

    Full text link
    Emotion recognition has become an important field of research in Human Computer Interactions as we improve upon the techniques for modelling the various aspects of behaviour. With the advancement of technology our understanding of emotions are advancing, there is a growing need for automatic emotion recognition systems. One of the directions the research is heading is the use of Neural Networks which are adept at estimating complex functions that depend on a large number and diverse source of input data. In this paper we attempt to exploit this effectiveness of Neural networks to enable us to perform multimodal Emotion recognition on IEMOCAP dataset using data from Speech, Text, and Motion capture data from face expressions, rotation and hand movements. Prior research has concentrated on Emotion detection from Speech on the IEMOCAP dataset, but our approach is the first that uses the multiple modes of data offered by IEMOCAP for a more robust and accurate emotion detection

    An Attempt towards Interpretable Audio-Visual Video Captioning

    Full text link
    Automatically generating a natural language sentence to describe the content of an input video is a very challenging problem. It is an essential multimodal task in which auditory and visual contents are equally important. Although audio information has been exploited to improve video captioning in previous works, it is usually regarded as an additional feature fed into a black box fusion machine. How are the words in the generated sentences associated with the auditory and visual modalities? The problem is still not investigated. In this paper, we make the first attempt to design an interpretable audio-visual video captioning network to discover the association between words in sentences and audio-visual sequences. To achieve this, we propose a multimodal convolutional neural network-based audio-visual video captioning framework and introduce a modality-aware module for exploring modality selection during sentence generation. Besides, we collect new audio captioning and visual captioning datasets for further exploring the interactions between auditory and visual modalities for high-level video understanding. Extensive experiments demonstrate that the modality-aware module makes our model interpretable on modality selection during sentence generation. Even with the added interpretability, our video captioning network can still achieve comparable performance with recent state-of-the-art methods.Comment: 11 pages, 4 figure

    Deep Learning in Robotics: A Review of Recent Research

    Full text link
    Advances in deep learning over the last decade have led to a flurry of research in the application of deep artificial neural networks to robotic systems, with at least thirty papers published on the subject between 2014 and the present. This review discusses the applications, benefits, and limitations of deep learning vis-\`a-vis physical robotic systems, using contemporary research as exemplars. It is intended to communicate recent advances to the wider robotics community and inspire additional interest in and application of deep learning in robotics.Comment: 41 pages, 135 reference
    • …
    corecore