1,098 research outputs found

    King's speech: pronounce a foreign language with style

    Get PDF
    Computer assisted pronunciation training requires strategies that capture the attention of the learners and guide them along the learning pathway. In this paper, we introduce an immersive storytelling scenario for creating appropriate learning conditions. The proposed learning interaction is orchestrated by a spoken karaoke. We motivate the concept of the spoken karaoke and describe our design. Driven by the requirements of the proposed scenario, we suggest a modular architecture designed for immersive learning applications. We present our prototype system and our approach for the processing of spoken and visual interaction modalities. Finally, we discuss how technological challenges can be addressed in order to enable the learner's self-evaluation

    Lip syncing method for realistic expressive 3D face model

    Get PDF
    Lip synchronization of 3D face model is now being used in a multitude of important fields. It brings a more human, social and dramatic reality to computer games, films and interactive multimedia, and is growing in use and importance. High level of realism can be used in demanding applications such as computer games and cinema. Authoring lip syncing with complex and subtle expressions is still difficult and fraught with problems in terms of realism. This research proposed a lip syncing method of realistic expressive 3D face model. Animated lips requires a 3D face model capable of representing the myriad shapes the human face experiences during speech and a method to produce the correct lip shape at the correct time. The paper presented a 3D face model designed to support lip syncing that align with input audio file. It deforms using Raised Cosine Deformation (RCD) function that is grafted onto the input facial geometry. The face model was based on MPEG-4 Facial Animation (FA) Standard. This paper proposed a method to animate the 3D face model over time to create animated lip syncing using a canonical set of visemes for all pairwise combinations of a reduced phoneme set called ProPhone. The proposed research integrated emotions by the consideration of Ekman model and Plutchik’s wheel with emotive eye movements by implementing Emotional Eye Movements Markup Language (EEMML) to produce realistic 3D face model. © 2017 Springer Science+Business Media New Yor

    The computer synthesis of expressive three-dimensional facial character animation.

    Get PDF
    This present research is concerned with the design, development and implementation of three-dimensional computer-generated facial images capable of expression gesture and speech. A review of previous work in chapter one shows that to date the model of computer-generated faces has been one in which construction and animation were not separated and which therefore possessed only a limited expressive range. It is argued in chapter two that the physical description of the face cannot be seen as originating from a single generic mould. Chapter three therefore describes data acquisition techniques employed in the computer generation of free-form surfaces which are applicable to three-dimensional faces. Expressions are the result of the distortion of the surface of the skin by the complex interactions of bone, muscle and skin. Chapter four demonstrates with static images and short animation sequences in video that a muscle model process algorithm can simulate the primary characteristics of the facial muscles. Three-dimensional speech synchronization was the most complex problem to achieve effectively. Chapter five describes two successful approaches: the direct mapping of mouth shapes in two dimensions to the model in three dimensions, and geometric distortions of the mouth created by the contraction of specified muscle combinations. Chapter six describes the implementation of software for this research and argues the case for a parametric approach. Chapter seven is concerned with the control of facial articulations and discusses a more biological approach to these. Finally chapter eight draws conclusions from the present research and suggests further extensions

    A comprehensive multimodal dataset for contactless lip reading and acoustic analysis

    Get PDF
    Small-scale motion detection using non-invasive remote sensing techniques has recently garnered significant interest in the field of speech recognition. Our dataset paper aims to facilitate the enhancement and restoration of speech information from diverse data sources for speakers. In this paper, we introduce a novel multimodal dataset based on Radio Frequency, visual, text, audio, laser and lip landmark information, also called RVTALL. Specifically, the dataset consists of 7.5 GHz Channel Impulse Response (CIR) data from ultra-wideband (UWB) radars, 77 GHz frequency modulated continuous wave (FMCW) data from millimeter wave (mmWave) radar, visual and audio information, lip landmarks and laser data, offering a unique multimodal approach to speech recognition research. Meanwhile, a depth camera is adopted to record the landmarks of the subject’s lip and voice. Approximately 400 minutes of annotated speech profiles are provided, which are collected from 20 participants speaking 5 vowels, 15 words, and 16 sentences. The dataset has been validated and has potential for the investigation of lip reading and multimodal speech recognition

    Closing the gap: human factors in cross-device media synchronization

    Get PDF
    The continuing growth in the mobile phone arena, particularly in terms of device capabilities and ownership is having a transformational impact on media consumption. It is now possible to consider orchestrated multi-stream experiences delivered across many devices, rather than the playback of content from a single device. However, there are significant challenges in realising such a vision, particularly around the management of synchronicity between associated media streams. This is compounded by the heterogeneous nature of user devices, the networks upon which they operate, and the perceptions of users. This paper describes IMSync, an open inter-stream synchronisation framework that is QoE-aware. IMSync adopts efficient monitoring and control mechanisms, alongside a QoE perception model that has been derived from a series of subjective user experiments. Based on an observation of lag, IMSync is able to use this model of impact to determine an appropriate strategy to catch-up with playback whilst minimising the potential detrimental impacts on a users QoE. The impact model adopts a balanced approach: trading off the potential impact on QoE of initiating a re-synchronisation process compared with retaining the current levels of non-synchronicity, in order to maintain high levels of QoE. A series of experiments demonstrate the potential of the framework as a basis for enabling new, immersive media experiences

    A semantic memory bank assisted by an embodied conversational agents for mobile devices

    Get PDF
    Alzheimer’s disease is a type of dementia that causes memory loss and interferes with intellectual abilities seriously. It has no current cure and therapeutic efficiency of current medication is limited. However, there is evidence that non-pharmacological treatments could be useful to stimulate cognitive abilities. In the last few year, several studies have focused on describing and under- standing how Virtual Coaches (VC) could be key drivers for health promotion in home care settings. The use of VC gains an augmented attention in the considerations of medical innovations. In this paper, we propose an approach that exploits semantic technologies and Embodied Conversational Agents to help patients training cognitive abilities using mobile devices. In this work, semantic technologies are used to provide knowledge about the memory of a specific person, who exploits the structured data stored in a linked data repository and take advantage of the flexibility provided by ontologies to define search domains and expand the agent’s capabilities. Our Memory Bank Embodied Conversational Agent (MBECA) is used to interact with the patient and ease the interaction with new devices. The framework is oriented to Alzheimer’s patients, caregivers, and therapists

    Lip syncing method for realistic expressive three-dimensional face model

    Get PDF
    Lip synchronization of 3D face model is now being used in a multitude of important fields. It brings a more human and dramatic reality to computer games, films and interactive multimedia, and is growing in use and importance. High level realism can be used in demanding applications such as computer games and cinema. Authoring lip syncing with complex and subtle expressions is still difficult and fraught with problems in terms of realism. Thus, this study proposes a lip syncing method of realistic expressive 3D face model. Animated lips require a 3D face model capable of representing the movement of face muscles during speech and a method to produce the correct lip shape at the correct time. The 3D face model is designed based on MPEG-4 facial animation standard to support lip syncing that is aligned with input audio file. It deforms using Raised Cosine Deformation function that is grafted onto the input facial geometry. This study also proposes a method to animate the 3D face model over time to create animated lip syncing using a canonical set of visemes for all pairwise combinations of a reduced phoneme set called ProPhone. Finally, this study integrates emotions by considering both Ekman model and Plutchik’s wheel with emotive eye movements by implementing Emotional Eye Movements Markup Language to produce realistic 3D face model. The experimental results show that the proposed model can generate visually satisfactory animations with Mean Square Error of 0.0020 for neutral, 0.0024 for happy expression, 0.0020 for angry expression, 0.0030 for fear expression, 0.0026 for surprise expression, 0.0010 for disgust expression, and 0.0030 for sad expression

    MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model

    Full text link
    Face-to-face communication is a common scenario including roles of speakers and listeners. Most existing research methods focus on producing speaker videos, while the generation of listener heads remains largely overlooked. Responsive listening head generation is an important task that aims to model face-to-face communication scenarios by generating a listener head video given a speaker video and a listener head image. An ideal generated responsive listening video should respond to the speaker with attitude or viewpoint expressing while maintaining diversity in interaction patterns and accuracy in listener identity information. To achieve this goal, we propose the \textbf{M}ulti-\textbf{F}aceted \textbf{R}esponsive Listening Head Generation Network (MFR-Net). Specifically, MFR-Net employs the probabilistic denoising diffusion model to predict diverse head pose and expression features. In order to perform multi-faceted response to the speaker video, while maintaining accurate listener identity preservation, we design the Feature Aggregation Module to boost listener identity features and fuse them with other speaker-related features. Finally, a renderer finetuned with identity consistency loss produces the final listening head videos. Our extensive experiments demonstrate that MFR-Net not only achieves multi-faceted responses in diversity and speaker identity information but also in attitude and viewpoint expression.Comment: Accepted by ACM MM 202
    • …
    corecore