2,626 research outputs found

    Multimodal person recognition for human-vehicle interaction

    Get PDF
    Next-generation vehicles will undoubtedly feature biometric person recognition as part of an effort to improve the driving experience. Today's technology prevents such systems from operating satisfactorily under adverse conditions. A proposed framework for achieving person recognition successfully combines different biometric modalities, borne out in two case studies

    Anti-social behavior detection in audio-visual surveillance systems

    Get PDF
    In this paper we propose a general purpose framework for detection of unusual events. The proposed system is based on the unsupervised method for unusual scene detection in web{cam images that was introduced in [1]. We extend their algorithm to accommodate data from different modalities and introduce the concept of time-space blocks. In addition, we evaluate early and late fusion techniques for our audio-visual data features. The experimental results on 192 hours of data show that data fusion of audio and video outperforms using a single modality

    Bimodal Emotion Recognition using Speech and Physiological Changes

    Get PDF
    With exponentially evolving technology it is no exaggeration to say that any interface fo

    A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition

    Get PDF
    A key requirement for developing any innovative system in a computing environment is to integrate a sufficiently friendly interface with the average end user. Accurate design of such a user-centered interface, however, means more than just the ergonomics of the panels and displays. It also requires that designers precisely define what information to use and how, where, and when to use it. Recent advances in user-centered design of computing systems have suggested that multimodal integration can provide different types and levels of intelligence to the user interface. The work of this thesis aims at improving speech recognition-based interfaces by making use of the visual modality conveyed by the movements of the lips. Designing a good visual front end is a major part of this framework. For this purpose, this work derives the optical flow fields for consecutive frames of people speaking. Independent Component Analysis (ICA) is then used to derive basis flow fields. The coefficients of these basis fields comprise the visual features of interest. It is shown that using ICA on optical flow fields yields better classification results than the traditional approaches based on Principal Component Analysis (PCA). In fact, ICA can capture higher order statistics that are needed to understand the motion of the mouth. This is due to the fact that lips movement is complex in its nature, as it involves large image velocities, self occlusion (due to the appearance and disappearance of the teeth) and a lot of non-rigidity. Another issue that is of great interest to audio-visual speech recognition systems designers is the integration (fusion) of the audio and visual information into an automatic speech recognizer. For this purpose, a reliability-driven sensor fusion scheme is developed. A statistical approach is developed to account for the dynamic changes in reliability. This is done in two steps. The first step derives suitable statistical reliability measures for the individual information streams. These measures are based on the dispersion of the N-best hypotheses of the individual stream classifiers. The second step finds an optimal mapping between the reliability measures and the stream weights that maximizes the conditional likelihood. For this purpose, genetic algorithms are used. The addressed issues are challenging problems and are substantial for developing an audio-visual speech recognition framework that can maximize the information gather about the words uttered and minimize the impact of noise

    Audiovisual head orientation estimation with particle filtering in multisensor scenarios

    Get PDF
    This article presents a multimodal approach to head pose estimation of individuals in environments equipped with multiple cameras and microphones, such as SmartRooms or automatic video conferencing. Determining the individuals head orientation is the basis for many forms of more sophisticated interactions between humans and technical devices and can also be used for automatic sensor selection (camera, microphone) in communications or video surveillance systems. The use of particle filters as a unified framework for the estimation of the head orientation for both monomodal and multimodal cases is proposed. In video, we estimate head orientation from color information by exploiting spatial redundancy among cameras. Audio information is processed to estimate the direction of the voice produced by a speaker making use of the directivity characteristics of the head radiation pattern. Furthermore, two different particle filter multimodal information fusion schemes for combining the audio and video streams are analyzed in terms of accuracy and robustness. In the first one, fusion is performed at a decision level by combining each monomodal head pose estimation, while the second one uses a joint estimation system combining information at data level. Experimental results conducted over the CLEAR 2006 evaluation database are reported and the comparison of the proposed multimodal head pose estimation algorithms with the reference monomodal approaches proves the effectiveness of the proposed approach.Postprint (published version

    3D AUDIO-VISUAL SPEAKER TRACKING WITH AN ADAPTIVE PARTICLE FILTER

    Get PDF
    reserved4siWe propose an audio-visual fusion algorithm for 3D speaker tracking from a localised multi-modal sensor platform composed of a camera and a small microphone array. After extracting audio-visual cues from individual modalities we fuse them adaptively using their reliability in a particle filter framework. The reliability of the audio signal is measured based on the maximum Global Coherence Field (GCF) peak value at each frame. The visual reliability is based on colour-histogram matching with detection results compared with a reference image in the RGB space. Experiments on the AV16.3 dataset show that the proposed adaptive audio-visual tracker outperforms both the individual modalities and a classical approach with fixed parameters in terms of tracking accuracy.Qian, Xinyuan; Brutti, Alessio; Omologo, Maurizio; Cavallaro, AndreaQian, Xinyuan; Brutti, Alessio; Omologo, Maurizio; Cavallaro, Andre
    corecore