140 research outputs found
An Implementation of Multimodal Fusion System for Intelligent Digital Human Generation
With the rapid development of artificial intelligence (AI), digital humans
have attracted more and more attention and are expected to achieve a wide range
of applications in several industries. Then, most of the existing digital
humans still rely on manual modeling by designers, which is a cumbersome
process and has a long development cycle. Therefore, facing the rise of digital
humans, there is an urgent need for a digital human generation system combined
with AI to improve development efficiency. In this paper, an implementation
scheme of an intelligent digital human generation system with multimodal fusion
is proposed. Specifically, text, speech and image are taken as inputs, and
interactive speech is synthesized using large language model (LLM), voiceprint
extraction, and text-to-speech conversion techniques. Then the input image is
age-transformed and a suitable image is selected as the driving image. Then,
the modification and generation of digital human video content is realized by
digital human driving, novel view synthesis, and intelligent dressing
techniques. Finally, we enhance the user experience through style transfer,
super-resolution, and quality evaluation. Experimental results show that the
system can effectively realize digital human generation. The related code is
released at https://github.com/zyj-2000/CUMT_2D_PhotoSpeaker
TEXT-DRIVEN MOUTH ANIMATION FOR HUMAN COMPUTER INTERACTION WITH PERSONAL ASSISTANT
International audiencePersonal assistants are becoming more pervasive in our environments but still do not provide natural interactions. Their lack of realism in term of expressiveness and their lack of visual feedback can create frustrating experiences and make users lose patience. In this sense, we propose an end-to-end trainable neural architecture for text-driven 3D mouth animations. Previous works showed such architectures provide better realism and could open the door for integrated affective Human Computer Interface (HCI). Our study shows that such visual feedback improves users' comfort for 78% of the candidates significantly while slightly improving their time perception
VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild
We present VideoReTalking, a new system to edit the faces of a real-world
talking head video according to input audio, producing a high-quality and
lip-syncing output video even with a different emotion. Our system disentangles
this objective into three sequential tasks: (1) face video generation with a
canonical expression; (2) audio-driven lip-sync; and (3) face enhancement for
improving photo-realism. Given a talking-head video, we first modify the
expression of each frame according to the same expression template using the
expression editing network, resulting in a video with the canonical expression.
This video, together with the given audio, is then fed into the lip-sync
network to generate a lip-syncing video. Finally, we improve the photo-realism
of the synthesized faces through an identity-aware face enhancement network and
post-processing. We use learning-based approaches for all three steps and all
our modules can be tackled in a sequential pipeline without any user
intervention. Furthermore, our system is a generic approach that does not need
to be retrained to a specific person. Evaluations on two widely-used datasets
and in-the-wild examples demonstrate the superiority of our framework over
other state-of-the-art methods in terms of lip-sync accuracy and visual
quality.Comment: Accepted by SIGGRAPH Asia 2022 Conference Proceedings. Project page:
https://vinthony.github.io/video-retalking
Final Report to NSF of the Standards for Facial Animation Workshop
The human face is an important and complex communication channel. It is a very familiar and sensitive object of human perception. The facial animation field has increased greatly in the past few years as fast computer graphics workstations have made the modeling and real-time animation of hundreds of thousands of polygons affordable and almost commonplace. Many applications have been developed such as teleconferencing, surgery, information assistance systems, games, and entertainment. To solve these different problems, different approaches for both animation control and modeling have been developed
DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation
In recent years, audio-driven 3D facial animation has gained significant
attention, particularly in applications such as virtual reality, gaming, and
video conferencing. However, accurately modeling the intricate and subtle
dynamics of facial expressions remains a challenge. Most existing studies
approach the facial animation task as a single regression problem, which often
fail to capture the intrinsic inter-modal relationship between speech signals
and 3D facial animation and overlook their inherent consistency. Moreover, due
to the limited availability of 3D-audio-visual datasets, approaches learning
with small-size samples have poor generalizability that decreases the
performance. To address these issues, in this study, we propose a cross-modal
dual-learning framework, termed DualTalker, aiming at improving data usage
efficiency as well as relating cross-modal dependencies. The framework is
trained jointly with the primary task (audio-driven facial animation) and its
dual task (lip reading) and shares common audio/motion encoder components. Our
joint training framework facilitates more efficient data usage by leveraging
information from both tasks and explicitly capitalizing on the complementary
relationship between facial motion and audio to improve performance.
Furthermore, we introduce an auxiliary cross-modal consistency loss to mitigate
the potential over-smoothing underlying the cross-modal complementary
representations, enhancing the mapping of subtle facial expression dynamics.
Through extensive experiments and a perceptual user study conducted on the VOCA
and BIWI datasets, we demonstrate that our approach outperforms current
state-of-the-art methods both qualitatively and quantitatively. We have made
our code and video demonstrations available at
https://github.com/sabrina-su/iadf.git
OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous Head Motions
One-shot talking head generation has no explicit head movement reference,
thus it is difficult to generate talking heads with head motions. Some existing
works only edit the mouth area and generate still talking heads, leading to
unreal talking head performance. Other works construct one-to-one mapping
between audio signal and head motion sequences, introducing ambiguity
correspondences into the mapping since people can behave differently in head
motions when speaking the same content. This unreasonable mapping form fails to
model the diversity and produces either nearly static or even exaggerated head
motions, which are unnatural and strange. Therefore, the one-shot talking head
generation task is actually a one-to-many ill-posed problem and people present
diverse head motions when speaking. Based on the above observation, we propose
OSM-Net, a \textit{one-to-many} one-shot talking head generation network with
natural head motions. OSM-Net constructs a motion space that contains rich and
various clip-level head motion features. Each basis of the space represents a
feature of meaningful head motion in a clip rather than just a frame, thus
providing more coherent and natural motion changes in talking heads. The
driving audio is mapped into the motion space, around which various motion
features can be sampled within a reasonable range to achieve the one-to-many
mapping. Besides, the landmark constraint and time window feature input improve
the accurate expression feature extraction and video generation. Extensive
experiments show that OSM-Net generates more natural realistic head motions
under reasonable one-to-many mapping paradigm compared with other methods.Comment: Paper Under Revie
Coarticulation and speech synchronization in MPEG-4 based facial animation
In this paper, we present a novel coarticulation and speech synchronization framework compliant with MPEG-4 facial animation. The system we have developed uses MPEG-4 facial animation standard and other development to enable the creation, editing and playback of high resolution 3D models; MPEG-4 animation streams; and is compatible with well-known related systems such as Greta and Xface. It supports text-to-speech for dynamic speech synchronization. The framework enables real-time model simplification using quadric-based surfaces. Our coarticulation approach provides realistic and high performance lip-sync animation, based on Cohen-Massaro’s model of coarticulation adapted to MPEG-4 facial animation (FA) specification. The preliminary experiments show that the coarticulation technique we have developed gives overall good and promising results when compared to related techniques
A physically-based muscle and skin model for facial animation
Facial animation is a popular area of research which has been around for over thirty years, but even with this long time scale, automatically creating realistic facial expressions is still an unsolved goal. This work furthers the state of the art in computer facial animation by introducing a new muscle and skin model and a method of easily transferring a full muscle and bone animation setup from one head mesh to another with very little user input.
The developed muscle model allows muscles of any shape to be accurately simulated, preserving volume during contraction and interacting with surrounding muscles and skin in a lifelike manner. The muscles can drive a rigid body model of a jaw, giving realistic physically-based movement to all areas of the face.
The skin model has multiple layers, mimicking the natural structure of skin and it connects onto the muscle model and is deformed realistically by the movements of the muscles and underlying bones. The skin smoothly transfers underlying movements into skin surface movements and propagates forces smoothly across the face.
Once a head model has been set up with muscles and bones, moving this muscle and bone set to another head is a simple matter using the developed techniques. The developed software employs principles from forensic reconstruction, using specific landmarks on the head to map the bone and muscles to the new head model and once the muscles and skull have been quickly transferred, they provide animation capabilities on the new mesh within minutes
- …