1,257 research outputs found
EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation
Speech-driven 3D face animation aims to generate realistic facial expressions
that match the speech content and emotion. However, existing methods often
neglect emotional facial expressions or fail to disentangle them from speech
content. To address this issue, this paper proposes an end-to-end neural
network to disentangle different emotions in speech so as to generate rich 3D
facial expressions. Specifically, we introduce the emotion disentangling
encoder (EDE) to disentangle the emotion and content in the speech by
cross-reconstructed speech signals with different emotion labels. Then an
emotion-guided feature fusion decoder is employed to generate a 3D talking face
with enhanced emotion. The decoder is driven by the disentangled identity,
emotional, and content embeddings so as to generate controllable personal and
emotional styles. Finally, considering the scarcity of the 3D emotional talking
face data, we resort to the supervision of facial blendshapes, which enables
the reconstruction of plausible 3D faces from 2D emotional data, and contribute
a large-scale 3D emotional talking face dataset (3D-ETF) to train the network.
Our experiments and user studies demonstrate that our approach outperforms
state-of-the-art methods and exhibits more diverse facial movements. We
recommend watching the supplementary video:
https://ziqiaopeng.github.io/emotalkComment: Accepted by ICCV 202
Realistic Lip Syncing for Virtual Character Using Common Viseme Set
Speech is one of the most important interaction methods between the humans. Therefore, most of avatar researches focus on this area with significant attention. Creating animated speech requires a facial model capable of representing the myriad shapes the human face expressions during speech. Moreover, a method to produce the correct shape at the correct time is also in order. One of the main challenges is to create precise lip movements of the avatar and synchronize it with a recorded audio. This paper proposes a new lip synchronization algorithm for realistic applications, which can be employed to generate synchronized facial movements among the audio generated from natural speech or through a text-to-speech engine. This method requires an animator to construct animations using a canonical set of visemes for all pair wise combination of a reduced phoneme set. These animations are then stitched together smoothly to construct the final animation
Deep Person Generation: A Survey from the Perspective of Face, Pose and Cloth Synthesis
Deep person generation has attracted extensive research attention due to its
wide applications in virtual agents, video conferencing, online shopping and
art/movie production. With the advancement of deep learning, visual appearances
(face, pose, cloth) of a person image can be easily generated or manipulated on
demand. In this survey, we first summarize the scope of person generation, and
then systematically review recent progress and technical trends in deep person
generation, covering three major tasks: talking-head generation (face),
pose-guided person generation (pose) and garment-oriented person generation
(cloth). More than two hundred papers are covered for a thorough overview, and
the milestone works are highlighted to witness the major technical
breakthrough. Based on these fundamental tasks, a number of applications are
investigated, e.g., virtual fitting, digital human, generative data
augmentation. We hope this survey could shed some light on the future prospects
of deep person generation, and provide a helpful foundation for full
applications towards digital human
Assistive technologies for severe and profound hearing loss: beyond hearing aids and implants
Assistive technologies offer capabilities that were previously inaccessible to individuals with severe and profound hearing loss who have no or limited access to hearing aids and implants. This literature review aims to explore existing assistive technologies and identify what still needs to be done. It is found that there is a lack of focus on the overall objectives of assistive technologies. In addition, several other issues are identified i.e. only a very small number of assistive technologies developed within a research context have led to commercial devices, there is a predisposition to use the latest expensive technologies and a tendency to avoid designing products universally. Finally, the further development of plug-ins that translate the text content of a website to various sign languages is needed to make information on the internet more accessible
Hi Sheldon! Creating Deep Personalized Characters from TV Shows
Imagine an interesting multimodal interactive scenario that you can see,
hear, and chat with an AI-generated digital character, who is capable of
behaving like Sheldon from The Big Bang Theory, as a DEEP copy from appearance
to personality. Towards this fantastic multimodal chatting scenario, we propose
a novel task, named Deep Personalized Character Creation (DPCC): creating
multimodal chat personalized characters from multimodal data such as TV shows.
Specifically, given a single- or multi-modality input (text, audio, video), the
goal of DPCC is to generate a multi-modality (text, audio, video) response,
which should be well-matched the personality of a specific character such as
Sheldon, and of high quality as well. To support this novel task, we further
collect a character centric multimodal dialogue dataset, named Deep
Personalized Character Dataset (DPCD), from TV shows. DPCD contains
character-specific multimodal dialogue data of ~10k utterances and ~6 hours of
audio/video per character, which is around 10 times larger compared to existing
related datasets.On DPCD, we present a baseline method for the DPCC task and
create 5 Deep personalized digital Characters (DeepCharacters) from Big Bang TV
Shows. We conduct both subjective and objective experiments to evaluate the
multimodal response from DeepCharacters in terms of characterization and
quality. The results demonstrates that, on our collected DPCD dataset, the
proposed baseline can create personalized digital characters for generating
multimodal response.Our collected DPCD dataset, the code of data collection and
our baseline will be published soon
Lip syncing method for realistic expressive three-dimensional face model
Lip synchronization of 3D face model is now being used in a multitude of important fields. It brings a more human and dramatic reality to computer games, films and interactive multimedia, and is growing in use and importance. High level realism can be used in demanding applications such as computer games and cinema. Authoring lip syncing with complex and subtle expressions is still difficult and fraught with problems in terms of realism. Thus, this study proposes a lip syncing method of realistic expressive 3D face model. Animated lips require a 3D face model capable of representing the movement of face muscles during speech and a method to produce the correct lip shape at the correct time. The 3D face model is designed based on MPEG-4 facial animation standard to support lip syncing that is aligned with input audio file. It deforms using Raised Cosine Deformation function that is grafted onto the input facial geometry. This study also proposes a method to animate the 3D face model over time to create animated lip syncing using a canonical set of visemes for all pairwise combinations of a reduced phoneme set called ProPhone. Finally, this study integrates emotions by considering both Ekman model and Plutchik’s wheel with emotive eye movements by implementing Emotional Eye Movements Markup Language to produce realistic 3D face model. The experimental results show that the proposed model can generate visually satisfactory animations with Mean Square Error of 0.0020 for neutral, 0.0024 for happy expression, 0.0020 for angry expression, 0.0030 for fear expression, 0.0026 for surprise expression, 0.0010 for disgust expression, and 0.0030 for sad expression
- …