Search CORE

1,098 research outputs found

King's speech: pronounce a foreign language with style

Author: Athanasopoulos Georgios
Chatelain Julie
Cierro Alessandro
Guérit Robin
Hagihara Kaori
Lucas Céline
Lugan Sébastien
Macq Benoît
Publication venue: 'Universidade Catolica Portuguesa'
Publication date: 01/01/2018
Field of study

Computer assisted pronunciation training requires strategies that capture the attention of the learners and guide them along the learning pathway. In this paper, we introduce an immersive storytelling scenario for creating appropriate learning conditions. The proposed learning interaction is orchestrated by a spoken karaoke. We motivate the concept of the spoken karaoke and describe our design. Driven by the requirements of the proposed scenario, we suggest a modular architecture designed for immersive learning applications. We present our prototype system and our approach for the processing of spoken and visual interaction modalities. Finally, we discuss how technological challenges can be addressed in order to enable the learner's self-evaluation

Revistas Científicas da Universidade Católica Portuguesa

DIAL UCLouvain

Lip syncing method for realistic expressive 3D face model

Author: Ali IR
Alkawaz MH
Kolivand H
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Lip synchronization of 3D face model is now being used in a multitude of important fields. It brings a more human, social and dramatic reality to computer games, films and interactive multimedia, and is growing in use and importance. High level of realism can be used in demanding applications such as computer games and cinema. Authoring lip syncing with complex and subtle expressions is still difficult and fraught with problems in terms of realism. This research proposed a lip syncing method of realistic expressive 3D face model. Animated lips requires a 3D face model capable of representing the myriad shapes the human face experiences during speech and a method to produce the correct lip shape at the correct time. The paper presented a 3D face model designed to support lip syncing that align with input audio file. It deforms using Raised Cosine Deformation (RCD) function that is grafted onto the input facial geometry. The face model was based on MPEG-4 Facial Animation (FA) Standard. This paper proposed a method to animate the 3D face model over time to create animated lip syncing using a canonical set of visemes for all pairwise combinations of a reduced phoneme set called ProPhone. The proposed research integrated emotions by the consideration of Ekman model and Plutchik’s wheel with emotive eye movements by implementing Emotional Eye Movements Markup Language (EEMML) to produce realistic 3D face model. © 2017 Springer Science+Business Media New Yor

LJMU Research Online (Liverpool John Moores University)

The computer synthesis of expressive three-dimensional facial character animation.

Author: Waters K.
Waters K.
Publication venue
Publication date: 01/01/1988
Field of study

This present research is concerned with the design, development and implementation of three-dimensional computer-generated facial images capable of expression gesture and speech. A review of previous work in chapter one shows that to date the model of computer-generated faces has been one in which construction and animation were not separated and which therefore possessed only a limited expressive range. It is argued in chapter two that the physical description of the face cannot be seen as originating from a single generic mould. Chapter three therefore describes data acquisition techniques employed in the computer generation of free-form surfaces which are applicable to three-dimensional faces. Expressions are the result of the distortion of the surface of the skin by the complex interactions of bone, muscle and skin. Chapter four demonstrates with static images and short animation sequences in video that a muscle model process algorithm can simulate the primary characteristics of the facial muscles. Three-dimensional speech synchronization was the most complex problem to achieve effectively. Chapter five describes two successful approaches: the direct mapping of mouth shapes in two dimensions to the model in three dimensions, and geometric distortions of the mouth created by the contraction of specified muscle combinations. Chapter six describes the implementation of software for this research and argues the case for a parametric approach. Chapter seven is concerned with the control of facial articulations and discusses a more biological approach to these. Finally chapter eight draws conclusions from the present research and suggests further extensions

Middlesex University Research Repository

A comprehensive multimodal dataset for contactless lip reading and acoustic analysis

Author: Abbasi Qammer H.
Chen Zikang
Chetty Kevin
Cooper Jonathan
Faccio Daniele
Ge Yao
Imran Muhammad
Li Haobo
Li Wenda
Tang Chong
Wang Jingyan
Publication venue: Nature Research
Publication date: 13/12/2023
Field of study

Small-scale motion detection using non-invasive remote sensing techniques has recently garnered significant interest in the field of speech recognition. Our dataset paper aims to facilitate the enhancement and restoration of speech information from diverse data sources for speakers. In this paper, we introduce a novel multimodal dataset based on Radio Frequency, visual, text, audio, laser and lip landmark information, also called RVTALL. Specifically, the dataset consists of 7.5 GHz Channel Impulse Response (CIR) data from ultra-wideband (UWB) radars, 77 GHz frequency modulated continuous wave (FMCW) data from millimeter wave (mmWave) radar, visual and audio information, lip landmarks and laser data, offering a unique multimodal approach to speech recognition research. Meanwhile, a depth camera is adopted to record the landmarks of the subject’s lip and voice. Approximately 400 minutes of annotated speech profiles are provided, which are collected from 20 participants speaking 5 vowels, 15 words, and 16 sentences. The dataset has been validated and has potential for the investigation of lip reading and multimodal speech recognition

Enlighten

University of Dundee Online Publications

Closing the gap: human factors in cross-device media synchronization

Author: Bird Jamie
Fawcett Lyndon
Jellico Jamie
Mu Mu
Race Nicholas
Simpson Steven
Stokking Hans
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/12/2016
Field of study

The continuing growth in the mobile phone arena, particularly in terms of device capabilities and ownership is having a transformational impact on media consumption. It is now possible to consider orchestrated multi-stream experiences delivered across many devices, rather than the playback of content from a single device. However, there are significant challenges in realising such a vision, particularly around the management of synchronicity between associated media streams. This is compounded by the heterogeneous nature of user devices, the networks upon which they operate, and the perceptions of users. This paper describes IMSync, an open inter-stream synchronisation framework that is QoE-aware. IMSync adopts efficient monitoring and control mechanisms, alongside a QoE perception model that has been derived from a series of subjective user experiments. Based on an observation of lag, IMSync is able to use this model of impact to determine an appropriate strategy to catch-up with playback whilst minimising the potential detrimental impacts on a users QoE. The impact model adopts a balanced approach: trading off the potential impact on QoE of initiating a re-synchronisation process compared with retaining the current levels of non-synchronicity, in order to maintain high levels of QoE. A series of experiments demonstrate the potential of the framework as a basis for enabling new, immersive media experiences

Crossref

University of Northampton's Research Explorer

Lancaster E-Prints

NECTAR

A semantic memory bank assisted by an embodied conversational agents for mobile devices

Author: Blesa Alfonso
Magallon Juan
Martin-Albo Jose
Seron Francisco
Zaldivar Angel
Publication venue: 'Science Publishing Group'
Publication date: 01/01/2021
Field of study

Alzheimer’s disease is a type of dementia that causes memory loss and interferes with intellectual abilities seriously. It has no current cure and therapeutic efficiency of current medication is limited. However, there is evidence that non-pharmacological treatments could be useful to stimulate cognitive abilities. In the last few year, several studies have focused on describing and under- standing how Virtual Coaches (VC) could be key drivers for health promotion in home care settings. The use of VC gains an augmented attention in the considerations of medical innovations. In this paper, we propose an approach that exploits semantic technologies and Embodied Conversational Agents to help patients training cognitive abilities using mobile devices. In this work, semantic technologies are used to provide knowledge about the memory of a specific person, who exploits the structured data stored in a linked data repository and take advantage of the flexibility provided by ontologies to define search domains and expand the agent’s capabilities. Our Memory Bank Embodied Conversational Agent (MBECA) is used to interact with the patient and ease the interaction with new devices. The framework is oriented to Alzheimer’s patients, caregivers, and therapists

Repositorio Universidad de Zaragoza

Lip syncing method for realistic expressive three-dimensional face model

Author: Al-Rubaye Itimad Raheem Ali
Publication venue
Publication date: 01/03/2016
Field of study

Lip synchronization of 3D face model is now being used in a multitude of important fields. It brings a more human and dramatic reality to computer games, films and interactive multimedia, and is growing in use and importance. High level realism can be used in demanding applications such as computer games and cinema. Authoring lip syncing with complex and subtle expressions is still difficult and fraught with problems in terms of realism. Thus, this study proposes a lip syncing method of realistic expressive 3D face model. Animated lips require a 3D face model capable of representing the movement of face muscles during speech and a method to produce the correct lip shape at the correct time. The 3D face model is designed based on MPEG-4 facial animation standard to support lip syncing that is aligned with input audio file. It deforms using Raised Cosine Deformation function that is grafted onto the input facial geometry. This study also proposes a method to animate the 3D face model over time to create animated lip syncing using a canonical set of visemes for all pairwise combinations of a reduced phoneme set called ProPhone. Finally, this study integrates emotions by considering both Ekman model and Plutchik’s wheel with emotive eye movements by implementing Emotional Eye Movements Markup Language to produce realistic 3D face model. The experimental results show that the proposed model can generate visually satisfactory animations with Mean Square Error of 0.0020 for neutral, 0.0024 for happy expression, 0.0020 for angry expression, 0.0030 for fear expression, 0.0026 for surprise expression, 0.0010 for disgust expression, and 0.0030 for sad expression

Universiti Teknologi Malaysia Institutional Repository

MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model

Author: Chai Yesheng
Dai Jiao
Fu Xiaomeng
Han Jizhong
Liu Jin
Wang Xi
Yu Cai
Publication venue
Publication date: 31/08/2023
Field of study

Face-to-face communication is a common scenario including roles of speakers and listeners. Most existing research methods focus on producing speaker videos, while the generation of listener heads remains largely overlooked. Responsive listening head generation is an important task that aims to model face-to-face communication scenarios by generating a listener head video given a speaker video and a listener head image. An ideal generated responsive listening video should respond to the speaker with attitude or viewpoint expressing while maintaining diversity in interaction patterns and accuracy in listener identity information. To achieve this goal, we propose the \textbf{M}ulti-\textbf{F}aceted \textbf{R}esponsive Listening Head Generation Network (MFR-Net). Specifically, MFR-Net employs the probabilistic denoising diffusion model to predict diverse head pose and expression features. In order to perform multi-faceted response to the speaker video, while maintaining accurate listener identity preservation, we design the Feature Aggregation Module to boost listener identity features and fuse them with other speaker-related features. Finally, a renderer finetuned with identity consistency loss produces the final listening head videos. Our extensive experiments demonstrate that MFR-Net not only achieves multi-faceted responses in diversity and speaker identity information but also in attitude and viewpoint expression.Comment: Accepted by ACM MM 202

arXiv.org e-Print Archive