Speechreading increases intelligibility in human speech perception. This suggests that conventional acoustic-based speech processing can benefit from the addition of visual information. This paper exploits speechreading for joint audio-visual speech recognition. We first present a color-based feature extraction algorithm that is able to extract salient visual speech features reliably from a frontal view of the talker in a video sequence. Then, a new fusion strategy using a coupled hidden Markov model (CHMM) is proposed to incorporate visual modality into the acoustic subsystem. By maintaining temporal coupling across the two modalities at the feature level and allowing asynchrony in the state at the same time, a CHMM provides a better model for capturing temporal correlations between the two streams of information. The experimental results demonstrate that the combined audio-visual system outperforms the acoustic-only recognizer over a wide range of noise levels

Clements, Mark A.

Mersereau, Russell M.

Zhang, Xiaozheng

English

DigitalCommons@CalPoly

AUDIO-VISUAL SPEECH RECOGNITION BY SPEECHREADING Xiaozheng Zhang, Russell M. Mersereau, and Mark A. Clements ABSTRACT Speechreading increases intelligibility in human speech perception. This suggests that conventional acoustic-based speech process­ing can benefit from the addition ofvisual information. This paper exploits speechreadingfor joint audio-visual speech recognition. We .first present a color-based feature extraction algorithm that is able to extract salient visual speech features reliably from afron­tal view ofthe talker in a video sequence. Then, a newfusion strategy using a coupled hidden Markov model (CHMM) is proposed to incorporate visual modality into the acoustic subsystem. By maintaining temporal coupling across the two modalities a/ the fea­ture level and allowing asynchrony in the state at the same time, a CHMM provides a better model for capturing remporal correla­tions berween the two streams ofinformation. The experimental results demonstrate that the combined audio-visualsystemoutpe•~ forms the acoustic-only recognizer over a wide range ofnoise levels. 1. INTRODUCTION Hearing-impaired people use speechreading as a pri­mary source of information for speech perception. Even listeners with normal hearing can·enhance their speech perception by seeing the speaker's face, particularly in noisy conditions. The benefit gained from the presence of the visual signal has been quantitatively estimated to be equivalent to an ncrease of 15dB in the SNR when noisy environments are mcountered [ 1]. This is due in part to the complementary nature of the audio and visual aspect ofspeech. The first attempt to use vision to aid speech recogni­tion was done by Petajan in 1984 [2]. He demonstrated that visual speech yields information that is not always present in the acoustic signal and enables improved rec­ognition accuracy over purely acoustic-based systems. Since then, there has been increasing interest in supple­menting acoustic recognizers with the visual modality to overcome their limitations. While yielding excellent e­sults in a controlled environment, the performance of acoustic-only systems degrades dramatically in the real world in noisy mvironments, such as in an automobile, or in a typical office with noise from ringing telephones, fans, and human conversations. Robust automatic speech recognition has long been an engineering goal for several decades. The use of the additional visual information has opened new possibilities. Automatic speechreading is primarily directed at two research areas--the design of a visual front end where visual speech features are accurately and reliably ex­tracted, and the development of an effective strategy to integrate the two separate infonnation sources. In this paper, we examine both of these issues. This paper is organized into three distinct parts. Sec­tion 2 describes the novel visual front end that we use to extract the visual features. Section 3 addresses the prob­lem of audio-visual integration and introduces the cou­pled hidden Markov model (CHMM) for fusing the two speech modalities. Finally, some initial ecperiments on audio-visual speech recognition and performance evalua­tions are presented in Section 4 for both speaker­dependent and speaker-independent cases. 2. VISUAL ANALYSIS 2.1 Previous Work Most visual speech information is contained in the lips. Thus, visual analysis in automatic speechreading usually focuses on lip feature extraction. Existing <pproaches to visual feature extraction generally fall mder two main categories: image-based techniques and explicit feature extraction. In the image-based approach, the whole image contain­ing the mouth area is used as a feature either directly [3], after some preprocessing such as a principal component analysis [4] or vector quantization [5]. In a more recent study [6], the image was processed by a discrete cosine transformation followed by a linear discriminant analysis projection and maximum likelihood linear transform fea­ture rotation. The advantage of an image-based approach is that no information is lost, however, it is left to the recognition engine to determine the relevant features in the image. A common criticism of this approach is that it tends to be very sensitive to changes in illumination, po­sition, camera distance, rotation, and speaker. The alternative to the image-based method aims at ex­plicitly extracting relevant visual speech features. Here, model-based methods are commonly consi.lered where a geometric model of the lip contour is applied. Typical examples are deformable templates, "snakes", and active shape models (ASM). Recently, an active appearance model (AAM) extending the ASM was proposed [7]. It adds a statistical model of gray-level appearance. How­ever, most of these methods use intensity-based images. The difficulty with these approaches usually arises when the contrast is poor along lip contours, which occurs quite often under natural lighting. In particular, edges on the lower lip are difficult to distinguish because of shading and reflection. The algorithm is difficult to extend to various lighting conditions, different skin colors, and people with facial hair. In addition, it is difficult to detect the teeth and tongue using ntensity information only, because the skin-lip and lip-teeth edges are highly con­fusable. 2.2 Our Approach We propose a color-based approach for lip feature ex­traction. Color is an important identifying feature for the lips. Prominent colors can be used as a far more efficient search criterion for detecting and extracting certain ob­jects, e.g., red for identifying the lips. In our previous work [8], we derived a modified version of the hue repn:­sentation for lip images. Hue is easilyjustified because of its color constancy across genders and races and its high discriminative power for detecting the lips. Thus, the first step in our analysis is a transformation from RGB to modified HSV. Figure I shows an overview of the visual front end for the feature extraction. It consists of three visual analysis stages: lip region localization, lip segmentation, and a final lip featme extraction. Figure I: Visual Processing Steps. In [9), we describe in detail how we locate the speaker's mouth region reliably from a color video :e­quence by using hue, saturation, and motion information. Next, we combine both color and edge information to segment the lip from its surroundings by using a Markov random field framework (MRF). Under the MRF model­ing a;sumption, image interpretation is formulated as a problem of maximizing the a posteriori probability of correct labeling given prior knowledge and actual cb­served data. Finally, the key points that define the lip position are detected and the relevant visual speech pa­rameters are derived. Fig. 2 shows examples of the ex­tracted feature key points. Figure 2: Measured feature points on the lips. Based on the extracted feature key points, the follow­ing geometric dimensions of the lips arc derived: mouth width w2, upper/lower lip width (h1,h3), lip opening height/width (h 2,wi), and the distance between the hori­zontal lip line and the upper lip (h4). An illustration of the geometry is shown in Fig. 3. Figure 3 : Illustration ofthe extracted geometric features of the lips. In addition to the geometric dimensions of the lips, we also detect the visibility of the tongue and teeth [9]. The parameter for the tongue is the total number of lip-<:olor pixels that lie within the inner lip contour, while the pa­rameter for the teeth is the total number of white pixels that lie within the bounding box. We applied the feature extraction algorithm on the Carnegie Mellon University database [IOJ \\ith ten test subjects. The database includes head-shoulder full frontal face color video sequences of a person talking. The fea­ture extraction algorithm works well for the data sets, which contain video sequences for several hours. In a few cases, a few pixels of inaccuracy have been observed. 3. AUDIO-VISUAL INTEGRATION 3.1 Previous Approaches The second major issue in an automatic speechreading system is how to incorporate the visual component into an acoustic speech recognizer so that optimal perfonn­ance can be achieved by using both modalities together. For engineering applications, two AV integrat ion mod­els are commonly used in automatic speechreading sys­tems: early integration (El) and late inte~:rntion (LJ). In early integration, audio -visual fusion is performed in the feature space to form a composite feature vector. Recog­nition is based on the augmented feature vector. In late integration, each modality is first preclass ified independ­ent of the other. The resulting audio and visual recogni­tion scores are then combined using a rule. Late integration offers several advantages over early integration because its implementation is simple and it does not require synchronization of the acoustic and vis­ual features. In late integration, each independent subsys­tem can be developed and trained separately. However, the use of separate models assumes conditional independ­ence between the two feature sets and, therefore, it fails to model the correlations between the visual and acoustic channels. Early integration provides a more general model by integrating the two components before recogni­tion. However, the classification is based on training a single HMM using the concatenated audio and visual featufe vectors. It forces the same state sequences upon the audio and visual components, which does not corre­spond to the way that people talk. Often the lips start moving l:efore voicing commences. Therefore an early integration model restricts the asynchrony between the two streams of information that naturally occurs in speech production. 3.2 Our Approach We propose the use of a generalized model-the cou­pled hidden Markov model (CHMM) to model the audio­visual interaction for speech recognition. The coupled hidden Markov model was firs t introduced by Brand in 1996 and was successfully used for modeling Tai Chi gestures [I 1]. In a coupled HMM, as shown in Fig. 4 , the traditional left-right HMM is expanded to a model con­taining two Markov chains, representing the audio and visual channels. The coupling between the two sub­processes is introduced by conditional ,erohabilities be­tween the hidden state variables Pr(S, ~H A,s,_1v) and Pr(S,vjs,_,",s,_,v). On the one hand, this architecture re­laxes the restriction of the early integration by allowing asynchrony between the two channels. On the other hand, unlike late integration, it incorporates temp oral coupling terms across the two sub-systems. Figure 4: A 3-state coupled hidden Markov model. Although the topology of a coupled HMM resembles that of an ordinary HMM, the inference and learning al­gorithm of ordinary HMMs are not directly applicable. To solve the inference problem in a coupled HMM, we employ the approximate lpproach proposed by Boyen and Koller [12). The key ingredient of the BK algorithm is the propagation of an approximate probability distribu­tion over the entire system using factored products de­fined over ndependent clusters. The accumulated error arising from the repeated lpproximation was proved to remain bounded indefinitely over time. The BK algo­rithm has been shown to be an efficient approach to solv­ing inference problems in general dynamic Bayesian networks. For learning parameters in the CHMM, forward and backward variables are first approximated. The !3K algo­rithm represents he forward variable as a product of marginal random variable over two sub-processes. The approximated forward variable at time t is then propa­gated through the transitional model and conditioned on evidence at time t+I using the junction tree algorithm [13]. To allow the algorithm to continue, the forward variable at t+I is approximated using a random variable that admits a compact representation by computing mar­ginals over each duster. The same procedures can be applied to approximating the backward variable /J These two variables are then used in an EM algorithm that learns the model in an iterative manner. 4. AUDIO-VISUAL SPEECH RECOGNITION We pcrfom1ed our experiments on audio-visual speech recognition using the audiovisual database available from Carnegie Mellon University [10]. This database includes ten test subjects (three females, seven males) speaking 78 isolated words repeated ten times. In our experiment, we use the data set consisting of the 31 "number" words: one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen , sixteen, seven­teen, eighteen, nineteen, twenty. thirty, forty, fifty, sixty, seventy, eighty, ninety, hundred, thousand, million, bil­lion. In the visual sub-system, we used the six geometric features defined in Fig. 3 and the two parameters tor the teeth and tongue. Delta features (first derivatives of the features over consecutive frames) were also ncluded in the visual features, forming a 16-dimensional feature vec­tor. They were computed by using a regression formula using a few frames before and after the current frame. The visual feature vectors were preprocessed by norma1­izing with respect to the average mouth width, w2, of each sp.:aker to account for the difference in scale between different speakers and different recording settings for the same person. In the acoustic subsystem, we used twelve mel fre­quency cepstral coefficients (MFCCs) and their corre­sponding delta parameters as the features-a 24­dimensional feature vector. The MFCCs were cbrived from FFT -based log spectra with a frame p;!riod of 11 msec. and a window length of25 msec. We conducted tests for both speaker-dependent and speaker-independent tasks. For the speaker-dependent task, the test was evaluated by using a leave-one-out pro­cedure. For the speaker-independent task, we used differ­ent speakers for training and testing. In all cases, the HMMs have ten states, and we mod­eled the observation vectors using two Gaussian mixtures for the speaker-independent task. Because of the limited training data available, we used a single Gaussian mixture in the speaker-dependent case. For early integration (EI), the classification was based on training a traditional HMM on the concatenated audio-visual observation vec­tors. Since the video has a frame rate of 33 ms., to match the audio frame rate of llms linear interpolation was applied to the visual features to fit the data values be­tween the existing feature data points. For the late inte­gration fusion (LI), the combined score takes the follow­ing form was computed using the formula where P, and Pv are the probability scores of the audio and visual components and the weighting factor "A. was set to 0.7 in our experiments. For comparison, we also n­clude results for a multistream HMM (MS), which is characterized by its output distribution: b;(o,)={b;(o/))rA ·{b;(o,v)}rv_ The exponents f and yare the weighting factors for each stream. We set f = 0.7 and y =0.3 in our experiments. Model training and Viterbi decoding of the HMMs were implemented using the HTK Toolkit [14). The BK algorithm for the coupled HMM was implemented using the Bayes Net Toolbox [ 15]. Prior to employing the BK algorithm, it is essential that the model parameters be well initialized. For this, we applied a traditional EM algorithm on the two separate HMMs, and uses the model parameters trained on the separate HMMs as the initial parameters in the CHMM. In the following, we present our experimental results on audio-visual speech recognition over a range of noise levels using these four models. Artificial white Gaussian noise was added to simulate various noise conditions. The experiment was conducted under a mismatched condi­tion-the recognizers were trained at 30dB SNR, and tested under varying noise levels. Tables 1 and 2 summa­rize the recognition performance using the four integra-tion schemes for the speaker-<lependent and speaker­independenttasks, respectively. As can be seen, all four integration models demonstrate improved recognition accuracy over audio only performance. While the recog­nition accuracy of the CHMM is very close to the best results from three other models in a speaker-<lependent task, the CHMM consistently outperforms others in the speaker-independent task. This might indicate that the CHMM ~:quires a larger training data for better model parameter estimates. SNR video audio EJ. u MS CHMM OdB 48.94 7.48 39.10 14.74 27.29 39.34 IOdB 48.94 34.90 72.19 57.06 75.06 75.27 30dB 48.94 84.29 88.48 91.23 91.52 91.23 Table 1: Audio-visual speech recognition performance in the speakcr-<lcpendent mode. The numbers represent the percent correct recognition. SNR video audio El u MS CHMM OdB 26.90 8.29 15.84 7.13 11.77 20.84 lOdB 26.90 31.13 45.03 32.45 42.77 50.18 30dB 26.90 68.03 68.74 63.52 68.65 72.19 Table 2: Audio-visual speech recognition performance in the speaker-independent mode. The numbers represent percent correct recognition. 5.SUMMARY In this paper, we demonstrated an automatic speechreading system for an audio-visual speech recogni­tion. By combining the visual speech features ooracted from our visual front end and a traditional acoustic front end, we performed the bimodal speech recognition using a coupled hidden Markov model. The combined system demonstrated significant perfonnmce improvement over an audio only subsystem. This gain is most distinct in low SNR, where traditional ASR performs poorly. Acknowledgments We would like to acknowledge the use of the audio­visual data base [10) from the Advanced Multimedia Processing Lab at Carnegie Mellon University. REFERENCF.S [I) W. H. Sumby and I. Pollack, "Visual Contribution to Speech Intelligibility in Noise", J. Acoust. Soc. Amer., val. 26,pp. 212-215,1954. [2] 	 E. D. Petajan, Automatic Lipreading to Enhance Speech Recognition, Ph.D. Thesis. I. of Illinois, 1984. [3) 	B. P. Yuhas, M. H. Goldstein, and T. J. ~jnowski, "Integration ofacoustic and visual speech signals us­ing neural nets," IEEE Comnum. Mag., pp. 65-71, Nov. 1989. [4] C. Bregler and Y. Konig, "Eigenlips for robust speech recognition," Proc. iEEE /CASSP, pp. 669-672, 1994. (5] P. L. Silsbee and A. C. Bovik, "Computer lipreading for improved accuracy in automatic speech recogni­tion," IEEE Trans. Speech Aud. Processing, vol. 4, pp. 337-351, 1996. [6] G. Potamianos, 	J. Luettin, and C. Neti, "Heirarchical discriminant features for audio-visual LVCSR," Proc. IEEE JCASSP, 2001. [7] I. Matthews, T. F. Coates, J. A. Bangham, S. Cox, and R. Harvey, "Extraction of visual features for lipread­ing," IEEE Trans. P AMI, vol. 24, 2002. (8] 	X. Zhang and R. M. Mersereau, "Lip feature extrac­tion towards an automatic speechreading system," Proc. IEEE JCIP, 2000. [9] X. Zhang, C. C. Broun, R. M. Mersereau, and M. A. Clements, "Automatic speechreading with applica­tions to human-computer interfaces," submitted to EURASIP J. App/. Sig. Proc., 2001. [10] URL:amp.ece.cmu.edu/intel/teature.data.html . [II] 	M. Brand, "Coupled hidden Markov models for modeling interacting processes," Tech. Rept TR 405, MIT Media Lab, 1996. [12) 	X. Boyen and D. Koller, "Tractable inference for complex stochastic processes," Proc. 14'h Ann. Conf Uncertainty in Artif Intel., pp. 33-42, 1998. [13] C. 	 Huang and A, Darwiche, "Inference in belief networks: a procedural guide," Int. J. Approx. Reasoning, vol. II, pp. 1-158, 1994. [14] 	 S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book, En­tropic Ltd., Cambridge, 1999. [15] K. Murphy, "The Bayes' net toolbox for Matlab," in Proc. Computing Science end Statistics Interface, vol. 33, 200 I. 

Audio-Visual Speech Recognition by Speechreading

https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=1261&amp;context=eeng_fac

Audio-Visual Speech Recognition by Speechreading

Abstract

Similar works

Full text

Available Versions

DigitalCommons@CalPoly