7,431 research outputs found

    End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

    Full text link
    Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advantage of being robust to different speech modes (e.g., whisper speech) or background noise. Recent advances in audiovisual speech processing using deep learning have opened opportunities to capture in a principled way the temporal relationships between acoustic and visual features. This study explores this idea proposing a \emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach models the temporal dynamic of the sequential audiovisual data, improving the accuracy and robustness of the proposed SAD system. Instead of estimating hand-crafted features, the study investigates an end-to-end training approach, where acoustic and visual features are directly learned from the raw data during training. The experimental evaluation considers a large audiovisual corpus with over 60.8 hours of recordings, collected from 105 speakers. The results demonstrate that the proposed framework leads to absolute improvements up to 1.2% under practical scenarios over a VAD baseline using only audio implemented with deep neural network (DNN). The proposed approach achieves 92.7% F1-score when it is evaluated using the sensors from a portable tablet under noisy acoustic environment, which is only 1.0% lower than the performance obtained under ideal conditions (e.g., clean speech obtained with a high definition camera and a close-talking microphone).Comment: Submitted to Speech Communicatio

    The COGs (context, object, and goals) in multisensory processing

    Get PDF
    Our understanding of how perception operates in real-world environments has been substantially advanced by studying both multisensory processes and “top-down” control processes influencing sensory processing via activity from higher-order brain areas, such as attention, memory, and expectations. As the two topics have been traditionally studied separately, the mechanisms orchestrating real-world multisensory processing remain unclear. Past work has revealed that the observer’s goals gate the influence of many multisensory processes on brain and behavioural responses, whereas some other multisensory processes might occur independently of these goals. Consequently, other forms of top-down control beyond goal dependence are necessary to explain the full range of multisensory effects currently reported at the brain and the cognitive level. These forms of control include sensitivity to stimulus context as well as the detection of matches (or lack thereof) between a multisensory stimulus and categorical attributes of naturalistic objects (e.g. tools, animals). In this review we discuss and integrate the existing findings that demonstrate the importance of such goal-, object- and context-based top-down control over multisensory processing. We then put forward a few principles emerging from this literature review with respect to the mechanisms underlying multisensory processing and discuss their possible broader implications

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

    Time-delay neural network for continuous emotional dimension prediction from facial expression sequences

    Get PDF
    "(c) 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works."Automatic continuous affective state prediction from naturalistic facial expression is a very challenging research topic but very important in human-computer interaction. One of the main challenges is modeling the dynamics that characterize naturalistic expressions. In this paper, a novel two-stage automatic system is proposed to continuously predict affective dimension values from facial expression videos. In the first stage, traditional regression methods are used to classify each individual video frame, while in the second stage, a Time-Delay Neural Network (TDNN) is proposed to model the temporal relationships between consecutive predictions. The two-stage approach separates the emotional state dynamics modeling from an individual emotional state prediction step based on input features. In doing so, the temporal information used by the TDNN is not biased by the high variability between features of consecutive frames and allows the network to more easily exploit the slow changing dynamics between emotional states. The system was fully tested and evaluated on three different facial expression video datasets. Our experimental results demonstrate that the use of a two-stage approach combined with the TDNN to take into account previously classified frames significantly improves the overall performance of continuous emotional state estimation in naturalistic facial expressions. The proposed approach has won the affect recognition sub-challenge of the third international Audio/Visual Emotion Recognition Challenge (AVEC2013)1

    First impressions: A survey on vision-based apparent personality trait analysis

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Personality analysis has been widely studied in psychology, neuropsychology, and signal processing fields, among others. From the past few years, it also became an attractive research area in visual computing. From the computational point of view, by far speech and text have been the most considered cues of information for analyzing personality. However, recently there has been an increasing interest from the computer vision community in analyzing personality from visual data. Recent computer vision approaches are able to accurately analyze human faces, body postures and behaviors, and use these information to infer apparent personality traits. Because of the overwhelming research interest in this topic, and of the potential impact that this sort of methods could have in society, we present in this paper an up-to-date review of existing vision-based approaches for apparent personality trait recognition. We describe seminal and cutting edge works on the subject, discussing and comparing their distinctive features and limitations. Future venues of research in the field are identified and discussed. Furthermore, aspects on the subjectivity in data labeling/evaluation, as well as current datasets and challenges organized to push the research on the field are reviewed.Peer ReviewedPostprint (author's final draft

    Musical experience may help the brain respond to second language reading

    Get PDF
    A person's native language background exerts constraints on the brain's automatic responses while learning a second language. It remains unclear, however, whether and how musical experience may help the brain overcome such constraints and meet the requirements of a second language. This study compared native Chinese English learners who were musicians, non-musicians and native English readers on their automatic brain automatic integration of English letter-sounds with an ERP cross-modal audiovisual mismatch negativity paradigm. The results showed that native Chinese-speaking musicians successfully integrated English letters and sounds, but their non-musician peers did not, despite of their comparable English learning experience and proficiency level. However, native Chinese-speaking musicians demonstrated enhanced cross-modal MMN for both synchronized and delayed letter-sound integration, while native English readers only showed enhanced cross-modal MMN for synchronized integration. Moreover, native Chinese-speaking musicians showed stronger theta oscillations when integrating English letters and sounds, suggesting that they had better top-down modulation. In contrast, native English readers showed stronger delta oscillations for synchronized integration, and their cross-modal delta oscillations significantly correlated with English reading performance. These findings suggest that long-term professional musical experience may enhance the top-down modulation, then help the brain efficiently integrating letter-sounds required by the second language. Such benefits from musical experience may be different from those from specific language experience in shaping the brain's automatic responses to reading.Peer reviewe

    Psychophysiology-based QoE assessment : a survey

    Get PDF
    We present a survey of psychophysiology-based assessment for quality of experience (QoE) in advanced multimedia technologies. We provide a classification of methods relevant to QoE and describe related psychological processes, experimental design considerations, and signal analysis techniques. We summarize multimodal techniques and discuss several important aspects of psychophysiology-based QoE assessment, including the synergies with psychophysical assessment and the need for standardized experimental design. This survey is not considered to be exhaustive but serves as a guideline for those interested to further explore this emerging field of research
    corecore