11,416 research outputs found

    A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition

    Get PDF
    A key requirement for developing any innovative system in a computing environment is to integrate a sufficiently friendly interface with the average end user. Accurate design of such a user-centered interface, however, means more than just the ergonomics of the panels and displays. It also requires that designers precisely define what information to use and how, where, and when to use it. Recent advances in user-centered design of computing systems have suggested that multimodal integration can provide different types and levels of intelligence to the user interface. The work of this thesis aims at improving speech recognition-based interfaces by making use of the visual modality conveyed by the movements of the lips. Designing a good visual front end is a major part of this framework. For this purpose, this work derives the optical flow fields for consecutive frames of people speaking. Independent Component Analysis (ICA) is then used to derive basis flow fields. The coefficients of these basis fields comprise the visual features of interest. It is shown that using ICA on optical flow fields yields better classification results than the traditional approaches based on Principal Component Analysis (PCA). In fact, ICA can capture higher order statistics that are needed to understand the motion of the mouth. This is due to the fact that lips movement is complex in its nature, as it involves large image velocities, self occlusion (due to the appearance and disappearance of the teeth) and a lot of non-rigidity. Another issue that is of great interest to audio-visual speech recognition systems designers is the integration (fusion) of the audio and visual information into an automatic speech recognizer. For this purpose, a reliability-driven sensor fusion scheme is developed. A statistical approach is developed to account for the dynamic changes in reliability. This is done in two steps. The first step derives suitable statistical reliability measures for the individual information streams. These measures are based on the dispersion of the N-best hypotheses of the individual stream classifiers. The second step finds an optimal mapping between the reliability measures and the stream weights that maximizes the conditional likelihood. For this purpose, genetic algorithms are used. The addressed issues are challenging problems and are substantial for developing an audio-visual speech recognition framework that can maximize the information gather about the words uttered and minimize the impact of noise

    Discriminatively trained features using fMPE for multi-stream audio-visual speech recognition

    Get PDF
    Abstract fMPE is a recently introduced discriminative training technique that uses the Minimum Phone Error (MPE) discriminative criterion to train a feature-level transformation. In this paper we investigate fMPE trained audio/visual features for multistream HMM-based audio-visual speech recognition. A flexible, layer-based implementation of fMPE allows us to combine the the visual information with the audio stream using the discriminative traning process, and dispense with the multiple stream approach. Experiments are reported on the IBM infrared headset audio-visual database. On average of 20-speaker 1 hour speaker independent test data, the fMPE trained acoustic features achieve 33% relative gain. Adding video layers on top of audio layers gives additional 10% gain over fMPE trained features from the audio stream alone. The fMPE trained visual features achieve 14% relative gain, while the decision fusion of audio/visual streams with fMPE trained features achieves 29% relative gain. However, fMPE trained models do not improve over the original models on the mismatched noisy test data

    On dynamic stream weighting for Audio-Visual Speech Recognition

    Get PDF
    The integration of audio and visual information improves speech recognition performance, specially in the presence of noise. In these circumstances it is necessary to introduce audio and visual weights to control the contribution of each modality to the recognition task. We present a method to set the value of the weights associated to each stream according to their reliability for speech recognition, allowing them to change with time and adapt to different noise and working conditions. Our dynamic weights are derived from several measures of the stream reliability, some specific to speech processing and others inherent to any classification task, and take into account the special role of silence detection in the definition of audio and visual weights. In this paper we propose a new confidence measure, compare it to existing ones and point out the importance of the correct detection of silence utterances in the definition of the weighting system. Experimental results support our main contribution: the inclusion of a voice activity detector in the weighting scheme improves speech recognition over different system architectures and confidence measures, leading to an increase in performance more relevant than any difference between the proposed confidence measures

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    Framework for proximal personified interfaces

    Get PDF

    A framework for context-aware driver status assessment systems

    Get PDF
    The automotive industry is actively supporting research and innovation to meet manufacturers' requirements related to safety issues, performance and environment. The Green ITS project is among the efforts in that regard. Safety is a major customer and manufacturer concern. Therefore, much effort have been directed to developing cutting-edge technologies able to assess driver status in term of alertness and suitability. In that regard, we aim to create with this thesis a framework for a context-aware driver status assessment system. Context-aware means that the machine uses background information about the driver and environmental conditions to better ascertain and understand driver status. The system also relies on multiple sensors, mainly video and audio. Using context and multi-sensor data, we need to perform multi-modal analysis and data fusion in order to infer as much knowledge as possible about the driver. Last, the project is to be continued by other students, so the system should be modular and well-documented. With this in mind, a driving simulator integrating multiple sensors was built. This simulator is a starting point for experimentation related to driver status assessment, and a prototype of software for real-time driver status assessment is integrated to the platform. To make the system context-aware, we designed a driver identification module based on audio-visual data fusion. Thus, at the beginning of driving sessions, the users are identified and background knowledge about them is loaded to better understand and analyze their behavior. A driver status assessment system was then constructed based on two different modules. The first one is for driver fatigue detection, based on an infrared camera. Fatigue is inferred via percentage of eye closure, which is the best indicator of fatigue for vision systems. The second one is a driver distraction recognition system, based on a Kinect sensor. Using body, head, and facial expressions, a fusion strategy is employed to deduce the type of distraction a driver is subject to. Of course, fatigue and distraction are only a fraction of all possible drivers' states, but these two aspects have been studied here primarily because of their dramatic impact on traffic safety. Through experimental results, we show that our system is efficient for driver identification and driver inattention detection tasks. Nevertheless, it is also very modular and could be further complemented by driver status analysis, context or additional sensor acquisition
    • 

    corecore