60 research outputs found

    Bimodal Fusion in Audio-Visual Speech Recognition

    Get PDF
    Extending automatic speech recognition (ASR) to the visual modality has been shown to greatly increase recognition accuracy and improve system robustness over purely acoustic systems. especially in acoustically hostile environments. An important aspect of designing such systems is how to incorporate the visual component Into the acoustic speech recognizer to achieve optimal performance. In this paper, we investigate methods of Integrating the audio and visual modalities within HMM-based classification models. We examine existing integration schemes and propose the use of a coupled hidden Markov model (CHMM) to exploit audio-visual interaction. Our experimental results demonstrate that the CHMM consistently outperforms other integration models for a large range of acoustic noise levels and suggest that it better captures temporal correlations between the two streams of information

    Digital Signal Processing

    Get PDF
    Contains reports on three research projects.U. S. Navy Office of Naval Research (Contract N00014-67-A-0204-0064)National Science Foundation (Grant GK-31353

    Automatic Speechreading with Application to Speaker Verification

    Get PDF
    Speech not only conveys the linguistic information, but also characterizes the talker\u27s identify and therefore can be used in personal authentication. While most of the speech information is contained in the acoustic channel, the lip movement during speech production also provides useful information. In this paper we investigate the effectiveness of visual speech features in a speaker veri pound sterling cation task. We pound sterling rst present the visual front-end of the automatic speechreading system. We then develop a recognition engine to train and recognize sequences of visual parameters. The experimental results based on the XM2VTS database [1] demonstrate that visual information is highly effective in reducing both false acceptance and false rejection rates in speaker verification tasks

    Lip Feature Extraction Towards an Automatic Speechreading System

    Get PDF
    The use of color information can significantly improve efficiency and robustness of lip feature extraction capability over purely grayscale-based methods. Edge information provides another useful tool in characterizing lip boundaries. In this paper we present a method of integrating both types of information to address the problem of lip feature extraction for the purpose of speechreading. We first examine various color models and view hue as an effective descriptor to characterize the lips due to its invariance to luminance and human skin color, and its discriminative properties. We use prominent red hue as an indicator to locate the position of the lips. Based on the identified lip area, we further refine the interior and exterior lip boundary using both color and spatial edge information, where those two are combined within a Markov random field (MRF) framework. Experimental results are presented to show the effectiveness of this method

    Digital reconstruction of multidimensional signals from their projections.

    No full text
    Massachusetts Institute of Technology. Dept. of Electrical Engineering. Thesis. 1973. Sc.D.MICROFICHE COPY ALSO AVAILABLE IN BARKER ENGINEERING LIBRARY.Vita.Bibliography: leaves 181-184.Sc.D

    Novel methods for video signal analysis and compression

    No full text
    Issued as final repor

    Multiple Global Affine Motion Models Used in Video Coding

    No full text
    With low bit rate scenarios, a hybrid video coder (e.g. AVC/H.264) tends to allocate greater portion of bits for motion vectors, while saving bits on residual errors. According to this fact, a coding scheme with non-normative global motion models in combination with conventional local motion vectors is proposed, which describes the motion of a frame by the affine motion parameter sets drawn by motion segmentation of the luminance channel. The motion segmentation task is capable of adapting the number of motion objects to the video contents. 6-D affine model sets are driven by linear regression from the scalable block-based motion fields estimated by the existent MPEG encoder. In cases that the number of motion objects exceeds a certain threshold, the global affine models are disabled. Otherwise the 4 scaling factors of the affine models are compressed by a vector quantizer, designed with a unique cache memory for efficient searching and coding. The affine motion information is written in the slice header as a syntax. The global motion information is used for compensating those macroblocks whose Lagrange cost is minimized by the AFFINE mode. The rate-distortion cost is computed by a modified Lagrange equation, which takes into consideration the perceptual discrimination of human vision in different areas. Besides increasing the coding efficiency, the global affine model manifests the following two features that refine the compressed video quality. i) When the number of slices per frame is more than 1, the global affine motion model can enhance the error-resilience of the video stream, because the affine motion parameters are duplicated in the headers of different slices over the same frame. ii) The global motion model predicts a frame by warping the whole reference frame and this helps to decrease blocking artifacts in the compensation frame.Ph.D.Committee Chair: Jackson, Joel; Committee Member: anderson, david; Committee Member: fritz, hermann; Committee Member: Mersereau, Russel; Committee Member: Yezzi, Anthon

    Audio-Visual Speech Recognition by Speechreading

    Get PDF
    Speechreading increases intelligibility in human speech perception. This suggests that conventional acoustic-based speech processing can benefit from the addition of visual information. This paper exploits speechreading for joint audio-visual speech recognition. We first present a color-based feature extraction algorithm that is able to extract salient visual speech features reliably from a frontal view of the talker in a video sequence. Then, a new fusion strategy using a coupled hidden Markov model (CHMM) is proposed to incorporate visual modality into the acoustic subsystem. By maintaining temporal coupling across the two modalities at the feature level and allowing asynchrony in the state at the same time, a CHMM provides a better model for capturing temporal correlations between the two streams of information. The experimental results demonstrate that the combined audio-visual system outperforms the acoustic-only recognizer over a wide range of noise levels
    • …
    corecore