    Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers

    We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality

    Evaluation of speech unit modelling for HMM-based speech synthesis for Arabic

    International audienceThis paper investigates the use of hidden Markov models (HMM) for Modern Standard Arabic speech synthesis. HMM-basedspeech synthesis systems require a description of each speech unit with a set of contextual features that specifies phonetic,phonological and linguistic aspects. To apply this method to Arabic language, a study of its particularities was conductedto extract suitable contextual features. Two phenomena are highlighted: vowel quantity and gemination. This work focuseson how to model geminated consonants (resp. long vowels), either considering them as fully-fledged phonemes or as thesame phonemes as their simple (resp. short) counterparts but with a different duration. Four modelling approaches have beenproposed for this purpose. Results of subjective and objective evaluations show that there is no important difference betweendifferentiating modelling units associated to geminated consonants (resp. long vowels) from modelling units associated tosimple consonants (resp. short vowels) and merging them as long as gemination and vowel quantity information is includedin the set of features

    Arabic Speech Corpus

    Identifying prosodic prominence patterns for English text-to-speech synthesis

    This thesis proposes to improve and enrich the expressiveness of English Text-to-Speech (TTS) synthesis by identifying and generating natural patterns of prosodic prominence. In most state-of-the-art TTS systems the prediction from text of prosodic prominence relations between words in an utterance relies on features that very loosely account for the combined effects of syntax, semantics, word informativeness and salience, on prosodic prominence. To improve prosodic prominence prediction we first follow up the classic approach in which prosodic prominence patterns are flattened into binary sequences of pitch accented and pitch unaccented words. We propose and motivate statistic and syntactic dependency based features that are complementary to the most predictive features proposed in previous works on automatic pitch accent prediction and show their utility on both read and spontaneous speech. Different accentuation patterns can be associated to the same sentence. Such variability rises the question on how evaluating pitch accent predictors when more patterns are allowed. We carry out a study on prosodic symbols variability on a speech corpus where different speakers read the same text and propose an information-theoretic definition of optionality of symbolic prosodic events that leads to a novel evaluation metric in which prosodic variability is incorporated as a factor affecting prediction accuracy. We additionally propose a method to take advantage of the optionality of prosodic events in unit-selection speech synthesis. To better account for the tight links between the prosodic prominence of a word and the discourse/sentence context, part of this thesis goes beyond the accent/no-accent dichotomy and is devoted to a novel task, the automatic detection of contrast, where contrast is meant as a (Information Structure’s) relation that ties two words that explicitly contrast with each other. This task is mainly motivated by the fact that contrastive words tend to be prosodically marked with particularly prominent pitch accents. The identification of contrastive word pairs is achieved by combining lexical information, syntactic information (which mainly aims to identify the syntactic parallelism that often activates contrast) and semantic information (mainly drawn from the Word- Net semantic lexicon), within a Support Vector Machines classifier. Once we have identified patterns of prosodic prominence we propose methods to incorporate such information in TTS synthesis and test its impact on synthetic speech naturalness trough some large scale perceptual experiments. The results of these experiments cast some doubts on the utility of a simple accent/no-accent distinction in Hidden Markov Model based speech synthesis while highlight the importance of contrastive accents

    Intonation Modelling for Speech Synthesis and Emphasis Preservation

    Speech-to-speech translation is a framework which recognises speech in an input language, translates it to a target language and synthesises speech in this target language. In such a system, variations in the speech signal which are inherent to natural human speech are lost, as the information goes through the different building blocks of the translation process. The work presented in this thesis addresses aspects of speech synthesis which are lost in traditional speech-to-speech translation approaches. The main research axis of this thesis is the study of prosody for speech synthesis and emphasis preservation. A first investigation of regional accents of spoken French is carried out to understand the sensitivity of native listeners with respect to accented speech synthesis. Listening tests show that standard adaptation methods for speech synthesis are not sufficient for listeners to perceive accentedness. On the other hand, combining adaptation with original prosody allows perception of accents. Addressing the need of a more suitable prosody model, a physiologically plausible intonation model is proposed. Inspired by the command-response model, it has basic components, which can be related to muscle responses to nerve impulses. These components are assumed to be a representation of muscle control of the vocal folds. A motivation for such a model is its theoretical language independence, based on the fact that humans share the same vocal apparatus. An automatic parameter extraction method which integrates a perceptually relevant measure is proposed with the model. This approach is evaluated and compared with the standard command-response model. Two corpora including sentences with emphasised words are presented, in the context of the SIWIS project. The first is a multilingual corpus with speech from multiple speaker; the second is a high quality speech synthesis oriented corpus from a professional speaker. Two broad uses of the model are evaluated. The first shows that it is difficult to predict model parameters; however the second shows that parameters can be transferred in the context of emphasis synthesis. A relation between model parameters and linguistic features such as stress and accent is demonstrated. Similar observations are made between the parameters and emphasis. Following, we investigate the extraction of atoms in emphasised speech and their transfer in neutral speech, which turns out to elicit emphasis perception. Using clustering methods, this is extended to the emphasis of other words, using linguistic context. This approach is validated by listening tests, in the case of English

    Animation and Interaction of Responsive, Expressive, and Tangible 3D Virtual Characters

    This thesis is framed within the field of 3D Character Animation. Virtual characters are used in many Human Computer Interaction applications such as video games and serious games. Within these virtual worlds they move and act in similar ways to humans controlled by users through some form of interface or by artificial intelligence. This work addresses the challenges of developing smoother movements and more natural behaviors driving motions in real-time, intuitively, and accurately. The interaction between virtual characters and intelligent objects will also be explored. With these subjects researched the work will contribute to creating more responsive, expressive, and tangible virtual characters. The navigation within virtual worlds uses locomotion such as walking, running, etc. To achieve maximum realism, actors' movements are captured and used to animate virtual characters. This is the philosophy of motion graphs: a structure that embeds movements where the continuous motion stream is generated from concatenating motion pieces. However, locomotion synthesis, using motion graphs, involves a tradeoff between the number of possible transitions between different kinds of locomotion, and the quality of these, meaning smooth transition between poses. To overcome this drawback, we propose the method of progressive transitions using Body Part Motion Graphs (BPMGs). This method deals with partial movements, and generates specific, synchronized transitions for each body part (group of joints) within a window of time. Therefore, the connectivity within the system is not linked to the similarity between global poses allowing us to find more and better quality transition points while increasing the speed of response and execution of these transitions in contrast to standard motion graphs method. Secondly, beyond getting faster transitions and smoother movements, virtual characters also interact with each other and with users by speaking. This interaction requires the creation of appropriate gestures according to the voice that they reproduced. Gestures are the nonverbal language that accompanies voiced language. The credibility of virtual characters when speaking is linked to the naturalness of their movements in sync with the voice in speech and intonation. Consequently, we analyzed the relationship between gestures, speech, and the performed gestures according to that speech. We defined intensity indicators for both gestures (GSI, Gesture Strength Indicator) and speech (PSI, Pitch Strength Indicator). We studied the relationship in time and intensity of these cues in order to establish synchronicity and intensity rules. Later we adapted the mentioned rules to select the appropriate gestures to the speech input (tagged text from speech signal) in the Gesture Motion Graph (GMG). The evaluation of resulting animations shows the importance of relating the intensity of speech and gestures to generate believable animations beyond time synchronization. Subsequently, we present a system that leads automatic generation of gestures and facial animation from a speech signal: BodySpeech. This system also includes animation improvements such as: increased use of data input, more flexible time synchronization, and new features like editing style of output animations. In addition, facial animation also takes into account speech intonation. Finally, we have moved virtual characters from virtual environments to the physical world in order to explore their interaction possibilities with real objects. To this end, we present AvatARs, virtual characters that have tangible representation and are integrated into reality through augmented reality apps on mobile devices. Users choose a physical object to manipulate in order to control the animation. They can select and configure the animation, which serves as a support for the virtual character represented. Then, we explored the interaction of AvatARs with intelligent physical objects like the Pleo social robot. Pleo is used to assist hospitalized children in therapy or simply for playing. Despite its benefits, there is a lack of emotional relationship and interaction between the children and Pleo which makes children lose interest eventually. This is why we have created a mixed reality scenario where Vleo (AvatAR as Pleo, virtual element) and Pleo (real element) interact naturally. This scenario has been tested and the results conclude that AvatARs enhances children's motivation to play with Pleo, opening a new horizon in the interaction between virtual characters and robots.Aquesta tesi s'emmarca dins del món de l'animació de personatges virtuals tridimensionals. Els personatges virtuals s'utilitzen en moltes aplicacions d'interacció home màquina, com els videojocs o els serious games, on es mouen i actuen de forma similar als humans dins de mons virtuals, i on són controlats pels usuaris per mitjà d'alguna interfície, o d'altra manera per sistemes intel·ligents. Reptes com aconseguir moviments fluids i comportament natural, controlar en temps real el moviment de manera intuitiva i precisa, i inclús explorar la interacció dels personatges virtuals amb elements físics intel·ligents; són els que es treballen a continuació amb l'objectiu de contribuir en la generació de personatges virtuals responsius, expressius i tangibles. La navegació dins dels mons virtuals fa ús de locomocions com caminar, córrer, etc. Per tal d'aconseguir el màxim de realisme, es capturen i reutilitzen moviments d'actors per animar els personatges virtuals. Així funcionen els motion graphs, una estructura que encapsula moviments i per mitjà de cerques dins d'aquesta, els concatena creant un flux continu. La síntesi de locomocions usant els motion graphs comporta un compromís entre el número de transicions entre les diferents locomocions, i la qualitat d'aquestes (similitud entre les postures a connectar). Per superar aquest inconvenient, proposem el mètode transicions progressives usant Body Part Motion Graphs (BPMGs). Aquest mètode tracta els moviments de manera parcial, i genera transicions específiques i sincronitzades per cada part del cos (grup d'articulacions) dins d'una finestra temporal. Per tant, la conectivitat del sistema no està lligada a la similitud de postures globals, permetent trobar més punts de transició i de més qualitat, i sobretot incrementant la rapidesa en resposta i execució de les transicions respecte als motion graphs estàndards. En segon lloc, més enllà d'aconseguir transicions ràpides i moviments fluids, els personatges virtuals també interaccionen entre ells i amb els usuaris parlant, creant la necessitat de generar moviments apropiats a la veu que reprodueixen. Els gestos formen part del llenguatge no verbal que acostuma a acompanyar a la veu. La credibilitat dels personatges virtuals parlants està lligada a la naturalitat dels seus moviments i a la concordança que aquests tenen amb la veu, sobretot amb l'entonació d'aquesta. Així doncs, hem realitzat l'anàlisi de la relació entre els gestos i la veu, i la conseqüent generació de gestos d'acord a la veu. S'han definit indicadors d'intensitat tant per gestos (GSI, Gesture Strength Indicator) com per la veu (PSI, Pitch Strength Indicator), i s'ha estudiat la relació entre la temporalitat i la intensitat de les dues senyals per establir unes normes de sincronia temporal i d'intensitat. Més endavant es presenta el Gesture Motion Graph (GMG), que selecciona gestos adients a la veu d'entrada (text anotat a partir de la senyal de veu) i les regles esmentades. L'avaluació de les animaciones resultants demostra la importància de relacionar la intensitat per generar animacions cre\"{ibles, més enllà de la sincronització temporal. Posteriorment, presentem un sistema de generació automàtica de gestos i animació facial a partir d'una senyal de veu: BodySpeech. Aquest sistema també inclou millores en l'animació, major reaprofitament de les dades d'entrada i sincronització més flexible, i noves funcionalitats com l'edició de l'estil les animacions de sortida. A més, l'animació facial també té en compte l'entonació de la veu. Finalment, s'han traslladat els personatges virtuals dels entorns virtuals al món físic per tal d'explorar les possibilitats d'interacció amb objectes reals. Per aquest fi, presentem els AvatARs, personatges virtuals que tenen representació tangible i que es visualitzen integrats en la realitat a través d'un dispositiu mòbil gràcies a la realitat augmentada. El control de l'animació es duu a terme per mitjà d'un objecte físic que l'usuari manipula, seleccionant i parametritzant les animacions, i que al mateix temps serveix com a suport per a la representació del personatge virtual. Posteriorment, s'ha explorat la interacció dels AvatARs amb objectes físics intel·ligents com el robot social Pleo. El Pleo s'utilitza per a assistir a nens hospitalitzats en teràpia o simplement per jugar. Tot i els seus beneficis, hi ha una manca de relació emocional i interacció entre els nens i el Pleo que amb el temps fa que els nens perdin l'interès en ell. Així doncs, hem creat un escenari d'interacció mixt on el Vleo (un AvatAR en forma de Pleo; element virtual) i el Pleo (element real) interactuen de manera natural. Aquest escenari s'ha testejat i els resultats conclouen que els AvatARs milloren la motivació per jugar amb el Pleo, obrint un nou horitzó en la interacció dels personatges virtuals amb robots.Esta tesis se enmarca dentro del mundo de la animación de personajes virtuales tridimensionales. Los personajes virtuales se utilizan en muchas aplicaciones de interacción hombre máquina, como los videojuegos y los serious games, donde dentro de mundo virtuales se mueven y actúan de manera similar a los humanos, y son controlados por usuarios por mediante de alguna interfaz, o de otro modo, por sistemas inteligentes. Retos como conseguir movimientos fluidos y comportamiento natural, controlar en tiempo real el movimiento de manera intuitiva y precisa, y incluso explorar la interacción de los personajes virtuales con elementos físicos inteligentes; son los que se trabajan a continuación con el objetivo de contribuir en la generación de personajes virtuales responsivos, expresivos y tangibles. La navegación dentro de los mundos virtuales hace uso de locomociones como andar, correr, etc. Para conseguir el máximo realismo, se capturan y reutilizan movimientos de actores para animar los personajes virtuales. Así funcionan los motion graphs, una estructura que encapsula movimientos y que por mediante búsquedas en ella, los concatena creando un flujo contínuo. La síntesi de locomociones usando los motion graphs comporta un compromiso entre el número de transiciones entre las distintas locomociones, y la calidad de estas (similitud entre las posturas a conectar). Para superar este inconveniente, proponemos el método transiciones progresivas usando Body Part Motion Graphs (BPMGs). Este método trata los movimientos de manera parcial, y genera transiciones específicas y sincronizadas para cada parte del cuerpo (grupo de articulaciones) dentro de una ventana temporal. Por lo tanto, la conectividad del sistema no está vinculada a la similitud de posturas globales, permitiendo encontrar más puntos de transición y de más calidad, incrementando la rapidez en respuesta y ejecución de las transiciones respeto a los motion graphs estándards. En segundo lugar, más allá de conseguir transiciones rápidas y movimientos fluídos, los personajes virtuales también interaccionan entre ellos y con los usuarios hablando, creando la necesidad de generar movimientos apropiados a la voz que reproducen. Los gestos forman parte del lenguaje no verbal que acostumbra a acompañar a la voz. La credibilidad de los personajes virtuales parlantes está vinculada a la naturalidad de sus movimientos y a la concordancia que estos tienen con la voz, sobretodo con la entonación de esta. Así pues, hemos realizado el análisis de la relación entre los gestos y la voz, y la consecuente generación de gestos de acuerdo a la voz. Se han definido indicadores de intensidad tanto para gestos (GSI, Gesture Strength Indicator) como para la voz (PSI, Pitch Strength Indicator), y se ha estudiado la relación temporal y de intensidad para establecer unas reglas de sincronía temporal y de intensidad. Más adelante se presenta el Gesture Motion Graph (GMG), que selecciona gestos adientes a la voz de entrada (texto etiquetado a partir de la señal de voz) y las normas mencionadas. La evaluación de las animaciones resultantes demuestra la importancia de relacionar la intensidad para generar animaciones creíbles, más allá de la sincronización temporal. Posteriormente, presentamos un sistema de generación automática de gestos y animación facial a partir de una señal de voz: BodySpeech. Este sistema también incluye mejoras en la animación, como un mayor aprovechamiento de los datos de entrada y una sincronización más flexible, y nuevas funcionalidades como la edición del estilo de las animaciones de salida. Además, la animación facial también tiene en cuenta la entonación de la voz. Finalmente, se han trasladado los personajes virtuales de los entornos virtuales al mundo físico para explorar las posibilidades de interacción con objetos reales. Para este fin, presentamos los AvatARs, personajes virtuales que tienen representación tangible y que se visualizan integrados en la realidad a través de un dispositivo móvil gracias a la realidad aumentada. El control de la animación se lleva a cabo mediante un objeto físico que el usuario manipula, seleccionando y configurando las animaciones, y que a su vez sirve como soporte para la representación del personaje. Posteriormente, se ha explorado la interacción de los AvatARs con objetos físicos inteligentes como el robot Pleo. Pleo se utiliza para asistir a niños en terapia o simplemente para jugar. Todo y sus beneficios, hay una falta de relación emocional y interacción entre los niños y Pleo que con el tiempo hace que los niños pierdan el interés. Así pues, hemos creado un escenario de interacción mixto donde Vleo (AvatAR en forma de Pleo; virtual) y Pleo (real) interactúan de manera natural. Este escenario se ha testeado y los resultados concluyen que los AvatARs mejoran la motivación para jugar con Pleo, abriendo un nuevo horizonte en la interacción de los personajes virtuales con robots


    In this article, we describe and interpret a set of acoustic and linguistic features that characterise emotional/emotion-related user states – confined to the one database processed: four classes in a German corpus of children interacting with a pet robot. To this end, we collected a very large feature vector consisting of more than 4000 features extracted at different sites. We performed extensive feature selection (Sequential Forward Floating Search) for seven acoustic and four linguistic types of features, ending up in a small number of ‘most important ’ features which we try to interpret by discussing the impact of different feature and extraction types. We establish different measures of impact and discuss the mutual influence of acoustics and linguistics

    Modelling talking human faces

    This thesis investigates a number of new approaches for visual speech synthesis using data-driven methods to implement a talking face. The main contributions in this thesis are the following. The accuracy of shared Gaussian process latent variable model (SGPLVM) built using the active appearance model (AAM) and relative spectral transform-perceptual linear prediction (RASTAPLP) features is improved by employing a more accurate AAM. This is the first study to report that using a more accurate AAM improves the accuracy of SGPLVM. Objective evaluation via reconstruction error is performed to compare the proposed approach against previously existing methods. In addition, it is shown experimentally that the accuracy of AAM can be improved by using a larger number of landmarks and/or larger number of samples in the training data. The second research contribution is a new method for visual speech synthesis utilising a fully Bayesian method namely the manifold relevance determination (MRD) for modelling dynamical systems through probabilistic non-linear dimensionality reduction. This is the first time MRD was used in the context of generating talking faces from the input speech signal. The expressive power of this model is in the ability to consider non-linear mappings between audio and visual features within a Bayesian approach. An efficient latent space has been learnt iii Abstract iv using a fully Bayesian latent representation relying on conditional nonlinear independence framework. In the SGPLVM the structure of the latent space cannot be automatically estimated because of using a maximum likelihood formulation. In contrast to SGPLVM the Bayesian approaches allow the automatic determination of the dimensionality of the latent spaces. The proposed method compares favourably against several other state-of-the-art methods for visual speech generation, which is shown in quantitative and qualitative evaluation on two different datasets. Finally, the possibility of incremental learning of AAM for inclusion in the proposed MRD approach for visual speech generation is investigated. The quantitative results demonstrate that using MRD in conjunction with incremental AAMs produces only slightly less accurate results than using batch methods. These results support a way of training this kind of models on computers with limited resources, for example in mobile computing. Overall, this thesis proposes several improvements to the current state-of-the-art in generating talking faces from speech signal leading to perceptually more convincing results

    HMM-based Speech Synthesis from Audio Book Data

    In contrast to hand-crafted speech databases, which contain short out-of-context sentences in fairly unemphatic speech style, audio books contain rich prosody including intonation contours, pitch accents and phrasing patterns, which is a good pre-requisite for building a natural sounding synthetic voice. The following paper will give an overview of the steps that are involved in building a synthetic voice from audio book data. After an introduction to the theory of HMM-based speech synthesis, the properties of the speech database will be described in detail. It will be argued that it is necessary to model specific properties of the database, such as higher pitched speech or questions, to achieve a better quality synthetic voice. Furthermore, the acoustic modelling of these properties will be explained in detail. Finally, the synthetic voice is evaluated on the basis of an online listening test

    Predicting Head Pose From Speech

    Speech animation, the process of animating a human-like model to give the impression it is talking, most commonly relies on the work of skilled animators, or performance capture. These approaches are time consuming, expensive, and lack the ability to scale. This thesis develops algorithms for content driven speech animation; models that learn visual actions from data without semantic labelling, to predict realistic speech animation from recorded audio. We achieve these goals by _rst forming a multi-modal corpus that represents the style of speech we want to model; speech that is natural, expressive and prosodic. This allows us to train deep recurrent neural networks to predict compelling animation. We _rst develop methods to predict the rigid head pose of a speaker. Predicting the head pose of a speaker from speech is not wholly deterministic, so our methods provide a large variety of plausible head pose trajectories from a single utterance. We then apply our methods to learn how to predict the head pose of the listener while in conversation, using only the voice of the speaker. Finally, we show how to predict the lip sync, facial expression, and rigid head pose of the speaker, simultaneously, solely from speec
