International audienceWe present an inversion framework to identify speech production properties from audiovisual information. Our system is built on a multimodal articulatory dataset comprising ultrasound, X-ray, magnetic resonance images as well as audio and stereovisual recordings of the speaker. Visual information is captured via stereovision while the vocal tract state is represented by a properly trained articulatory model. Inversion is based on an adaptive piecewise linear approximation of the audiovisualto- articulation mapping. The presented system can recover the hidden vocal tract shapes and may serve as a basis for a more widely applicable inversion setup

Aron, Michael

Berger, Marie-Odile

Katsamanis, Athanassios

Maragos, Petros

Roussos, Anastasios

INRIA a CCSD electronic archive server

HAL Id: inria-00327031https://hal.inria.fr/inria-00327031Submitted on 6 Jan 2009HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.Inversion from Audiovisual Speech to ArticulatoryInformation by Exploiting Multimodal DataAthanassios Katsamanis, Anastasios Roussos, Petros Maragos, Michael Aron,Marie-Odile BergerTo cite this version:Athanassios Katsamanis, Anastasios Roussos, Petros Maragos, Michael Aron, Marie-Odile Berger. In-version from Audiovisual Speech to Articulatory Information by Exploiting Multimodal Data. 8th In-ternational Seminar On Speech Production - ISSP’08, Dec 2008, Strasbourg, France. ￿inria-00327031￿Inversion from Audiovisual Speech to Articulatory Information by ExploitingMultimodal DataA. Katsamanis1, A. Roussos1, P. Maragos1, M. Aron2, M.-O. Berger21 National Technical University of Athens 2LORIA/INRIA Nancy Grand-EstE-mail: {nkatsam,troussos,maragos}@cs.ntua.gr {aron,berger}@loria.frAbstractWe present an inversion framework to identifyspeech production properties from audiovisual in-formation. Our system is built on a multimodalarticulatory dataset comprising ultrasound, X-ray,magnetic resonance images, electromagnetic artic-ulography data as well as audio and stereovisualrecordings of the speaker. Visual information iscaptured via stereovision while the vocal tract stateis represented by a properly trained articulatorymodel. The audiovisual-to-articulation relationshipis approximated by an adaptive piecewise linearmapping. The presented system can recover the hid-den vocal tract shapes and may serve as a basis fora more widely applicable inversion setup.1 IntroductionFrom an engineering viewpoint, inversion fromspeech to articulatory information refers to the prob-lem of identification of the underlying speech pro-duction system given the observed output. For thedevelopment of an inversion setup, three primary de-sign issues have to be addressed, namely the choiceof the speech representation, the description of theproduction system and the adoption of a proper com-putational framework. Related decisions are typi-cally motivated by speech production theory, the na-ture and amount of the available data and the specificapplication goals which the inversion scheme willhave to serve. In this context, we build on availablemultimodal articulatory data and present an inver-sion framework to recover articulation from audio-visual speech information. Speech is represented byboth audio spectral features and visual cues from thespeaker’s face. Articulation is described by means ofan appropriate articulatory model and the speech-to-articulation mapping is approximated in an adaptivepiecewise linear manner.Speech inversion has been traditionally regardedas the determination of the vocal tract (VT) shapefrom the audio speech signal only [3, 8, 7]. For-mant values, Line Spectral Frequencies or Mel-Frequency Cepstral Coefficients (MFCCs) are al-ternative acoustic parameterizations that have beenused. Introduction of the visual modality in thespeech inversion process can significantly improveinversion accuracy [5, 4]. Independent componentanalysis of the region around the lips [5] or active ap-pearance modeling of the face [4] provide the prac-tical means to achieve this.Regarding the representation of the vocal tract,several options have been proposed, each satisfy-ing certain requirements. For example, the tubeletmodel in [9] allows inversion from formants basedon the linear speech production theory. It is quiterestrictive though and does not easily account forsounds other than vowels. The representation viathe coordinates of points on significant articulatorsas used in [3] is more realistic but spatially sparseand not so informative for the entire VT state. How-ever, such data can be relatively easily acquired viaElectromagnetic Articulography (EMA) and haveallowed the application of data-driven techniques forinversion. A much more informative representationis the one achieved via an articulatory model [8, 7]that can describe the geometry or the cross-sectionalarea of the vocal tract and is controlled by a limitednumber of parameters. Such models have been builtfrom real articulatory data, i.e., X-ray [6] or Mag-netic Resonance Images (MRIs) [7, 2] of the vocaltract. The amount of corresponding data is howeverlimited and thus such models are not easily usable inInternational Seminar on Speech Production (ISSP 2008), Strasbourg, France, Dec. 2008 1  FH1−170(a) X-ray, contours & grid50 100 150 20050100150200  FH1−170(b) Intersection Points  49% Variance Expl.Max component 1Min component 1Mean Shape(c) 1st Component  29% Variance Expl.Max component 2Min component 2Mean Shape(d) 2nd ComponentFigure 1: Building the articulatory model from theX-ray data; positioning the grid and finding the in-tersection points with the vocal tract boundary. Thefirst model components are shown after PCA.a data-driven inversion scenario.As far as the inversion method is concerned, bothmodel-based [9, 8] and data-driven approaches [3]have been reported. In [8] efficient methods are de-scribed for the use of codebooks relating formantsand articulatory model parameters. To also exploitdynamic information, a piecewise-linear approxima-tion of the audio-articulatory relation is presentedin [3]. Each phoneme is modeled by a context-dependent Hidden Markov Model (HMM) and aseparate linear regression mapping is trained at eachHMM state between the observed MFCCs and thecorresponding articulatory parameters.Given this setting, in the current paper we pro-pose an audiovisual speech inversion frameworkbuilt on multimodal articulatory data. Visual infor-mation is incorporated in the form of 3D coordi-nates of markers painted on the speaker’s face andrecovered via stereovision. The VT shape is repre-sented by an articulatory model that is constructedfrom manually annotated X-ray data. Inversion isachieved by an HMM-based framework similar tothe one presented in [4]. The system has beenproperly adapted to recover the hidden articulatorymodel parameters. Experiments were performedon a recently acquired articulatory dataset compris-ing concurrent audio recordings, stereo-videos ofthe speaker’s face, EMA data and Ultrasound (US)videos of his tongue. To extract proper articulatoryparameters from this dataset we fitted the articula-tory model to the visible tongue contour part in ev-ery US frame. Registration of the X-ray referencesystem in the US images is achieved by properlyexploiting available head MRIs of the speaker andthe stereovisual data. Recovered articulatory shapesafter inversion closely follow the original ones anddemonstrate the potential of the approach.2 Framework DescriptionThe articulatory dataset on which the frameworkhas been built is described in detail in [1]. It in-cludes audio (44kHz), stereo-videos (120Hz) of theface, US images of the tongue (65Hz) and record-ings of EM sensors (40Hz) on the US-probe, thespeaker’s tongue and head. In the performed exper-iments, approximately 6 minutes of recordings havebeen exploited. There is only one French speakerand the uttered corpus includes a wide variety of iso-lated phonetic sequences (vowel-consonant-vowelor vowel-vowel) and a set of phonetically balancedFrench sentences. Additionally, 3D MRI data of thespeaker’s head are included (for 3 sustained vow-els). There also exist approximately 700 VT shapes,corresponding to roughly 30 seconds of manuallyannotated x-ray images of the speaker’s vocal tract(25Hz). These contours have been used to train aproper articulatory model.Articulatory Model The role of the articula-tory model is critical in our setup. It describesthe midsagittal VT shape and is constructed asin [6]. A semi-polar grid (to which we refer asVT grid) is properly positioned on the midsagittalplane, Fig.1(a), and the coordinates of the intersec-tion points with the vocal tract boundary are found,Fig.1(b). Principal Component Analysis (PCA) ofthe derived vectors determines the components of alinear model that can describe almost 96% of theshape variance using only 6 parameters. The firsttwo components are shown in Figs.1(c), 1(d) for themaximum and minimum values of the correspond-ing parameters. The goal is then to fit this modelto the US data of the tongue in order to derive anefficient description of the VT shape for the wholedataset. For this purpose, registration of the grid onInternational Seminar on Speech Production (ISSP 2008), Strasbourg, France, Dec. 2008 2−100−50050100050100−250−200−150 Face markersForeheadOuter VT wallEM sensorsUS img areaFigure 2: Registration of the multimodal articulatorydata for a particular time instance. The HCS is used.the US images is necessary.Registration of the Articulatory Data As a pre-liminary step, preprocessing is applied separately onthe different image modalities. The speaker’s outerVT wall and forehead surfaces are reconstructedfrom 3D MRI of his head (see Fig.2). The 3D po-sitions of the painted face markers at every time in-stance are tracked automatically using the image se-quences of the stereo camera pair (see Fig.3). TheUS image sequences are filtered using the prepro-cessing method described in [1], which emphasizesthe tongue contour (see Fig.4).The basic steps to combine and exploit the differ-ent articulatory data modalities are the following:Conversion to the EM coordinate system. The 3Dpositions of three EM sensors (behind the ears andon the probe) at the stereo coordinate system are ap-proximated using the stereo images, at a number oftime instances. In this way, we manage to have thesame positions expressed in both the stereo and EMsensor coordinate system. So, the coordinate trans-formation from the one system to the other can becomputed by registration.Compensation of the head movement. A coordi-nate system whose position and orientation are fixedin time w.r.t. the speaker’s head (which we refer toas Head Coordinate System - HCS) is used. The co-ordinate transformation from the EM system to theHCS is computed by registering the positions of theupper head part markers at every time to the posi-Figure 3: A stereovision image pair, with the trackedface markers. The markers on the upper head part(“+”) are used at the registration, whereas the mark-ers on the lips (“×”) are used for inversion.tions of the same markers at a reference time in-stance. The 3D trajectories of the EM sensors andthe rest of the face markers are expressed at the HCS(see Fig.2). Among these, the trajectories of the lipmarkers are afterwards used for the audiovisual in-version (see Sec.3).Registration of the VT grid on the US images. TheVT semipolar grid is originally configured on the X-ray images and then is positioned in the MRI mid-sagittal slice by registering the contours of the outerVT wall at the two modalities. This grid is extendedto 3D by considering that it is fixed to all MRI slices.Afterwards, it is expressed in the HCS by registeringthe forehead surface that is extracted from the MRIwith the corresponding face markers. Further, the3D position and orientation (in HCS) of the movingUS image plane are recovered using the 6 degrees-of-freedom EM sensor on the US probe, Fig.2. Inthe end, the intersection of the US image plane withthe 3D VT grid is computed for every time instance.Model fitting to tongue points in US data Usingthe filtered US images, each grid line is consideredto intersect the visible part of the tongue contouronly if the maximum image intensity on that lineis above a global threshold. If this is the case, thepoints of the current grid line whose intensities passa line-specific threshold are kept and finally the clos-est point to the outer vocal tract boundary is markedas tongue point (Fig.4-Left).At the last stage of the articulatory parameter ex-traction, the articulatory model is fitted to best matchthe coordinates of the tongue points on the US plane.Essentially we minimize the squared distance of thereconstructed shape from these specific points. Themissing articulatory parameters are considered to benormally distributed with statistical properties de-International Seminar on Speech Production (ISSP 2008), Strasbourg, France, Dec. 2008 3Figure 4: Left: Extraction of tongue points (red dots)on the VT semipolar grid (blue lines), for the sametime instance as in Fig. 2. The corresponding pre-procesed US frame is used. Right: Fitting the artic-ulatory model to tongue points “∗” determined froman ultrasound image. The green line corresponds tothe fitted model while the red line is the mean shape.termined from the model training data. Consider-ing these distributions as priors, the problem is thensolved by Bayesian inference. A result of the ap-plied model fitting process is shown in Fig. 4-Right.3 Inversion, Results and DiscussionHaving extracted articulatory parameters for thewhole dataset, as a result of the model fitting pro-cess, we then train an HMM-based audiovisual-articulatory mapping similar to the one describedin [4]. Acoustic information is represented by 16MFCCs while visual information is given in theform of the 3D coordinates of 8 markers on thespeaker’s lips. Phoneme-based multistream HMMsare trained and for each state a linear mapping be-tween the audiovisual and the articulatory parame-ters is determined. For inversion, the optimal HMMstate is first found via Viterbi decoding from the se-quence of the audiovisual observations. Then theunderlying sequence of articulatory parameters, i.e.,VT shapes, is computed as a result of Maximum APosteriori (MAP) estimation [4, 3].Recovered VT shapes along with the referenceones are shown in Fig.5, for the phonemes /i/ and/s/ respectively. We are currently working on the in-troduction of an articulatory synthesizer in the pro-posed framework in order to further improve the in-version results in an analysis-by-synthesis manner.Detailed evaluation is under way. Though safe con-clusions for the quality of the results cannot yet bedrawn, it is clear that the main benefit of the pro-posed framework is that it exploits a rich variety ofPhoneme i Phoneme sFigure 5: Recovered vocal tract shapes for thephonemes /i/ and /s/. The reference shapes are alsogiven in dashed lines.multimodal articulatory data, allows improved flex-ibility and can be more widely applicable.Acknowledgements The authors would like to thankY. Laprie and E. Kerrien at Loria for their help in mak-ing the articulatory data available and all the ASPI partic-ipants for very fruitful discussions. This work was sup-ported by European Community FP6 FET ASPI (contractno. 021324) and partially by grant ΠENEΔ-2003-EΔ866{co-financed by E.U.-European Social Fund (80%) andthe Greek Ministry of Development-GSRT (20%)}.References[1] M. Aron, A. Roussos, M.-O. Berger, E. Kerrien, andP. Maragos. Multimodality acquisition of articula-tory data and processing. In EUSIPCO, 2008.[2] P. Badin and A. Serrurier. Three-dimensional linearmodeling of tongue: Articulatory data and models.In Proc. of ISSP, 2006.[3] S. Hiroya and M. Honda. Estimation of articulatorymovements from speech acoustics using an HMM-based speech production model. IEEE Trans. on Sp.and Au. Proc., 12(2):175–185, March 2004.[4] A. Katsamanis, G. Papandreou, and P. Maragos.Face active appearance modeling and speech acous-tic information to recover articulation. To appear inIEEE Trans. on Au., Sp. and Lang. Proc., 2009.[5] H. Kjellstrom, O. Engwall. Audiovisual-to-articulatory inversion. to appear in Sp. Comm.,2008.[6] S. Maeda. Compensatory articulation duringspeech: evidence from the analysis and synthesisof vocal tract shapes using an articulatory model,chapter in Speech Production and Speech Model-ing, pages 131–149 , 1990.[7] P. Mokhtari, T. Kitamura, H. Takemoto, andK. Honda. Principal components of vocal-tract areafunctions and inversion of vowels by linear regres-sion of cepstrum coefficients. Journ. of Phonetics,35(1):20–39, Jan. 2007.[8] S. Ouni and Y. Laprie. Modeling the articulatoryspace using a hypercube codebook for acoustic-to-articulatory inversion. Journ. of the Ac. Soc. ofAmerica, 118(1):444–460, 2005.[9] J. Schoentgen and S. Ciocea. Kinematic formant-to-area mapping. Sp. Comm., 21(4):227–244, 1997.International Seminar on Speech Production (ISSP 2008), Strasbourg, France, Dec. 2008 4

Inversion from Audiovisual Speech to Articulatory Information by Exploiting Multimodal Data

HAL-Rennes 1

https://hal.inria.fr/inria-00327031/document

Inversion from Audiovisual Speech to Articulatory Information by Exploiting Multimodal Data

Abstract

Similar works

Full text

Available Versions

INRIA a CCSD electronic archive server

HAL-Rennes 1