2,626 research outputs found
Multimodal person recognition for human-vehicle interaction
Next-generation vehicles will undoubtedly feature biometric person recognition as part of an effort to improve the driving experience. Today's technology prevents such systems from operating satisfactorily under adverse conditions. A proposed framework for achieving person recognition successfully combines different biometric modalities, borne out in two case studies
Anti-social behavior detection in audio-visual surveillance systems
In this paper we propose a general purpose framework for
detection of unusual events. The proposed system is based on the unsupervised method for unusual scene detection in web{cam images that was introduced in [1]. We extend their algorithm to accommodate data from different modalities and introduce the concept of time-space blocks. In addition, we evaluate early and late fusion techniques for our audio-visual data features. The experimental results on 192 hours of data show that data fusion of audio and video outperforms using a single modality
Bimodal Emotion Recognition using Speech and Physiological Changes
With exponentially evolving technology it is no exaggeration to say that any interface fo
A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition
A key requirement for developing any innovative system in a
computing environment is to integrate a sufficiently friendly
interface with the average end user. Accurate design of such a
user-centered interface, however, means more than just the
ergonomics of the panels and displays. It also requires that
designers precisely define what information to use and how, where,
and when to use it. Recent advances in user-centered design of
computing systems have suggested that multimodal integration can
provide different types and levels of intelligence to the user
interface. The work of this thesis aims at improving speech
recognition-based interfaces by making use of the visual modality
conveyed by the movements of the lips.
Designing a good visual front end is a major part of this framework.
For this purpose, this work derives the optical flow fields for
consecutive frames of people speaking. Independent Component
Analysis (ICA) is then used to derive basis flow fields. The
coefficients of these basis fields comprise the visual features of
interest. It is shown that using ICA on optical flow fields yields
better classification results than the traditional approaches based
on Principal Component Analysis (PCA). In fact, ICA can capture
higher order statistics that are needed to understand the motion of
the mouth. This is due to the fact that lips movement is complex in
its nature, as it involves large image velocities, self occlusion
(due to the appearance and disappearance of the teeth) and a lot of
non-rigidity.
Another issue that is of great interest to audio-visual speech
recognition systems designers is the integration (fusion) of the
audio and visual information into an automatic speech recognizer.
For this purpose, a reliability-driven sensor fusion scheme is
developed. A statistical approach is developed to account for the
dynamic changes in reliability. This is done in two steps. The first
step derives suitable statistical reliability measures for the
individual information streams. These measures are based on the
dispersion of the N-best hypotheses of the individual stream
classifiers. The second step finds an optimal mapping between the
reliability measures and the stream weights that maximizes the
conditional likelihood. For this purpose, genetic algorithms are
used.
The addressed issues are challenging problems and are substantial
for developing an audio-visual speech recognition framework that can
maximize the information gather about the words uttered and minimize
the impact of noise
Audiovisual head orientation estimation with particle filtering in multisensor scenarios
This article presents a multimodal approach to head pose estimation of individuals in environments equipped with multiple cameras and microphones, such as SmartRooms or automatic video conferencing. Determining the individuals head orientation is the basis for many forms of more sophisticated interactions between humans and technical devices and can also be used for automatic sensor selection (camera, microphone) in communications or video surveillance systems. The use of particle filters as a unified framework for the estimation of the head orientation for both monomodal and multimodal cases is proposed. In video, we estimate head orientation from color information by exploiting spatial redundancy among cameras. Audio information is processed to estimate the direction of the voice produced by a speaker making use of the directivity characteristics of the head radiation pattern. Furthermore, two different particle filter multimodal information fusion schemes for combining the audio and video streams are analyzed in terms of accuracy and robustness. In the first one, fusion is performed at a decision level by combining each monomodal head pose estimation, while the second one uses a joint estimation system combining information at data level. Experimental results conducted over the CLEAR 2006 evaluation database are reported and the comparison of the proposed multimodal head pose estimation algorithms with the reference monomodal approaches proves the effectiveness of the proposed approach.Postprint (published version
3D AUDIO-VISUAL SPEAKER TRACKING WITH AN ADAPTIVE PARTICLE FILTER
reserved4siWe propose an audio-visual fusion algorithm for 3D speaker tracking from a localised multi-modal sensor platform composed of a camera and a small microphone array. After extracting audio-visual cues from individual modalities we fuse them adaptively using their reliability in a particle filter framework. The reliability of the audio signal is measured based on the maximum Global Coherence Field (GCF) peak value at each frame. The visual reliability is based on colour-histogram matching with detection results compared with a reference image in the RGB space. Experiments on the AV16.3 dataset show that the proposed adaptive audio-visual tracker outperforms both the individual modalities and a classical approach with fixed parameters in terms of tracking accuracy.Qian, Xinyuan; Brutti, Alessio; Omologo, Maurizio; Cavallaro, AndreaQian, Xinyuan; Brutti, Alessio; Omologo, Maurizio; Cavallaro, Andre
- …