1,098 research outputs found
King's speech: pronounce a foreign language with style
Computer assisted pronunciation training requires strategies that capture the attention of the learners and guide them along the learning pathway. In this paper, we introduce an immersive storytelling scenario for creating appropriate learning conditions. The proposed learning interaction is orchestrated by a spoken karaoke. We motivate the concept of the spoken karaoke and describe our design. Driven by the requirements of the proposed scenario, we suggest a modular architecture designed for immersive learning applications. We present our prototype system and our approach for the processing of spoken and visual interaction modalities. Finally, we discuss how technological challenges can be addressed in order to enable the learner's self-evaluation
Lip syncing method for realistic expressive 3D face model
Lip synchronization of 3D face model is now being used in a multitude of important fields. It brings a more human, social and dramatic reality to computer games, films and interactive multimedia, and is growing in use and importance. High level of realism can be used in demanding applications such as computer games and cinema. Authoring lip syncing with complex and subtle expressions is still difficult and fraught with problems in terms of realism. This research proposed a lip syncing method of realistic expressive 3D face model. Animated lips requires a 3D face model capable of representing the myriad shapes the human face experiences during speech and a method to produce the correct lip shape at the correct time. The paper presented a 3D face model designed to support lip syncing that align with input audio file. It deforms using Raised Cosine Deformation (RCD) function that is grafted onto the input facial geometry. The face model was based on MPEG-4 Facial Animation (FA) Standard. This paper proposed a method to animate the 3D face model over time to create animated lip syncing using a canonical set of visemes for all pairwise combinations of a reduced phoneme set called ProPhone. The proposed research integrated emotions by the consideration of Ekman model and Plutchik’s wheel with emotive eye movements by implementing Emotional Eye Movements Markup Language (EEMML) to produce realistic 3D face model. © 2017 Springer Science+Business Media New Yor
The computer synthesis of expressive three-dimensional facial character animation.
This present research is concerned with the design, development and implementation of three-dimensional
computer-generated facial images capable of expression
gesture and speech.
A review of previous work in chapter one shows that to date
the model of computer-generated faces has been one in which
construction and animation were not separated and which
therefore possessed only a limited expressive range. It is
argued in chapter two that the physical description of the
face cannot be seen as originating from a single generic
mould. Chapter three therefore describes data acquisition
techniques employed in the computer generation of free-form
surfaces which are applicable to three-dimensional faces.
Expressions are the result of the distortion of the surface
of the skin by the complex interactions of bone, muscle and
skin. Chapter four demonstrates with static images and short
animation sequences in video that a muscle model process
algorithm can simulate the primary characteristics of the
facial muscles.
Three-dimensional speech synchronization was the most
complex problem to achieve effectively. Chapter five
describes two successful approaches: the direct mapping of
mouth shapes in two dimensions to the model in three
dimensions, and geometric distortions of the mouth created
by the contraction of specified muscle combinations.
Chapter six describes the implementation of software for
this research and argues the case for a parametric approach.
Chapter seven is concerned with the control of facial
articulations and discusses a more biological approach to
these. Finally chapter eight draws conclusions from the
present research and suggests further extensions
A comprehensive multimodal dataset for contactless lip reading and acoustic analysis
Small-scale motion detection using non-invasive remote sensing techniques has recently garnered significant interest in the field of speech recognition. Our dataset paper aims to facilitate the enhancement and restoration of speech information from diverse data sources for speakers. In this paper, we introduce a novel multimodal dataset based on Radio Frequency, visual, text, audio, laser and lip landmark information, also called RVTALL. Specifically, the dataset consists of 7.5 GHz Channel Impulse Response (CIR) data from ultra-wideband (UWB) radars, 77 GHz frequency modulated continuous wave (FMCW) data from millimeter wave (mmWave) radar, visual and audio information, lip landmarks and laser data, offering a unique multimodal approach to speech recognition research. Meanwhile, a depth camera is adopted to record the landmarks of the subject’s lip and voice. Approximately 400 minutes of annotated speech profiles are provided, which are collected from 20 participants speaking 5 vowels, 15 words, and 16 sentences. The dataset has been validated and has potential for the investigation of lip reading and multimodal speech recognition
Closing the gap: human factors in cross-device media synchronization
The continuing growth in the mobile phone arena, particularly in terms of device capabilities and ownership is having a transformational impact on media consumption. It is now possible to consider orchestrated multi-stream experiences delivered across many devices, rather than the playback of content from a single device. However, there are significant challenges in realising such a vision, particularly around the management of synchronicity between associated media streams. This is compounded by the heterogeneous nature of user devices, the networks upon which they operate, and the perceptions of users. This paper describes IMSync, an open inter-stream synchronisation framework that is QoE-aware. IMSync adopts efficient monitoring and control mechanisms, alongside a QoE perception model that has been derived from a series of subjective user experiments. Based on an observation of lag, IMSync is able to use this model of impact to determine an appropriate strategy to catch-up with playback whilst minimising the potential detrimental impacts on a users QoE. The impact model adopts a balanced approach: trading off the potential impact on QoE of initiating a re-synchronisation process compared with retaining the current levels of non-synchronicity, in order to maintain high levels of QoE. A series of experiments demonstrate the potential of the framework as a basis for enabling new, immersive media experiences
A semantic memory bank assisted by an embodied conversational agents for mobile devices
Alzheimer’s disease is a type of dementia that causes memory loss and interferes with intellectual abilities seriously. It has no current cure and therapeutic efficiency of current medication is limited. However, there is evidence that non-pharmacological treatments could be useful to stimulate cognitive abilities. In the last few year, several studies have focused on describing and under- standing how Virtual Coaches (VC) could be key drivers for health promotion in home care settings. The use of VC gains an augmented attention in the considerations of medical innovations. In this paper, we propose an approach that exploits semantic technologies and Embodied Conversational Agents to help patients training cognitive abilities using mobile devices. In this work, semantic technologies are used to provide knowledge about the memory of a specific person, who exploits the structured data stored in a linked data repository and take advantage of the flexibility provided by ontologies to define search domains and expand the agent’s capabilities. Our Memory Bank Embodied Conversational Agent (MBECA) is used to interact with the patient and ease the interaction with new devices. The framework is oriented to Alzheimer’s patients, caregivers, and therapists
Lip syncing method for realistic expressive three-dimensional face model
Lip synchronization of 3D face model is now being used in a multitude of important fields. It brings a more human and dramatic reality to computer games, films and interactive multimedia, and is growing in use and importance. High level realism can be used in demanding applications such as computer games and cinema. Authoring lip syncing with complex and subtle expressions is still difficult and fraught with problems in terms of realism. Thus, this study proposes a lip syncing method of realistic expressive 3D face model. Animated lips require a 3D face model capable of representing the movement of face muscles during speech and a method to produce the correct lip shape at the correct time. The 3D face model is designed based on MPEG-4 facial animation standard to support lip syncing that is aligned with input audio file. It deforms using Raised Cosine Deformation function that is grafted onto the input facial geometry. This study also proposes a method to animate the 3D face model over time to create animated lip syncing using a canonical set of visemes for all pairwise combinations of a reduced phoneme set called ProPhone. Finally, this study integrates emotions by considering both Ekman model and Plutchik’s wheel with emotive eye movements by implementing Emotional Eye Movements Markup Language to produce realistic 3D face model. The experimental results show that the proposed model can generate visually satisfactory animations with Mean Square Error of 0.0020 for neutral, 0.0024 for happy expression, 0.0020 for angry expression, 0.0030 for fear expression, 0.0026 for surprise expression, 0.0010 for disgust expression, and 0.0030 for sad expression
MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model
Face-to-face communication is a common scenario including roles of speakers
and listeners. Most existing research methods focus on producing speaker
videos, while the generation of listener heads remains largely overlooked.
Responsive listening head generation is an important task that aims to model
face-to-face communication scenarios by generating a listener head video given
a speaker video and a listener head image. An ideal generated responsive
listening video should respond to the speaker with attitude or viewpoint
expressing while maintaining diversity in interaction patterns and accuracy in
listener identity information. To achieve this goal, we propose the
\textbf{M}ulti-\textbf{F}aceted \textbf{R}esponsive Listening Head Generation
Network (MFR-Net). Specifically, MFR-Net employs the probabilistic denoising
diffusion model to predict diverse head pose and expression features. In order
to perform multi-faceted response to the speaker video, while maintaining
accurate listener identity preservation, we design the Feature Aggregation
Module to boost listener identity features and fuse them with other
speaker-related features. Finally, a renderer finetuned with identity
consistency loss produces the final listening head videos. Our extensive
experiments demonstrate that MFR-Net not only achieves multi-faceted responses
in diversity and speaker identity information but also in attitude and
viewpoint expression.Comment: Accepted by ACM MM 202
- …