353 research outputs found
SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
Generating talking head videos through a face image and a piece of speech
audio still contains many challenges. ie, unnatural head movement, distorted
expression, and identity modification. We argue that these issues are mainly
because of learning from the coupled 2D motion fields. On the other hand,
explicitly using 3D information also suffers problems of stiff expression and
incoherent video. We present SadTalker, which generates 3D motion coefficients
(head pose, expression) of the 3DMM from audio and implicitly modulates a novel
3D-aware face render for talking head generation. To learn the realistic motion
coefficients, we explicitly model the connections between audio and different
types of motion coefficients individually. Precisely, we present ExpNet to
learn the accurate facial expression from audio by distilling both coefficients
and 3D-rendered faces. As for the head pose, we design PoseVAE via a
conditional VAE to synthesize head motion in different styles. Finally, the
generated 3D motion coefficients are mapped to the unsupervised 3D keypoints
space of the proposed face render, and synthesize the final video. We conduct
extensive experiments to show the superior of our method in terms of motion and
video quality.Comment: Project page: https://sadtalker.github.i
CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding
This paper proposes a talking face generation method named "CP-EB" that takes
an audio signal as input and a person image as reference, to synthesize a
photo-realistic people talking video with head poses controlled by a short
video clip and proper eye blinking embedding. It's noted that not only the head
pose but also eye blinking are both important aspects for deep fake detection.
The implicit control of poses by video has already achieved by the state-of-art
work. According to recent research, eye blinking has weak correlation with
input audio which means eye blinks extraction from audio and generation are
possible. Hence, we propose a GAN-based architecture to extract eye blink
feature from input audio and reference video respectively and employ
contrastive training between them, then embed it into the concatenated features
of identity and poses to generate talking face images. Experimental results
show that the proposed method can generate photo-realistic talking face with
synchronous lips motions, natural head poses and blinking eyes.Comment: Accepted by the 21st IEEE International Symposium on Parallel and
Distributed Processing with Applications (IEEE ISPA 2023
Localization of sound sources : a systematic review
Sound localization is a vast field of research and advancement which is used in many useful applications to facilitate communication, radars, medical aid, and speech enhancement to but name a few. Many different methods are presented in recent times in this field to gain benefits. Various types of microphone arrays serve the purpose of sensing the incoming sound. This paper presents an overview of the importance of using sound localization in different applications along with the use and limitations of ad-hoc microphones over other microphones. In order to overcome these limitations certain approaches are also presented. Detailed explanation of some of the existing methods that are used for sound localization using microphone arrays in the recent literature is given. Existing methods are studied in a comparative fashion along with the factors that influence the choice
of one method over the others. This review is done in order to form a basis for choosing the best fit method for our use
RADIO: Reference-Agnostic Dubbing Video Synthesis
One of the most challenging problems in audio-driven talking head generation
is achieving high-fidelity detail while ensuring precise synchronization. Given
only a single reference image, extracting meaningful identity attributes
becomes even more challenging, often causing the network to mirror the facial
and lip structures too closely. To address these issues, we introduce RADIO, a
framework engineered to yield high-quality dubbed videos regardless of the pose
or expression in reference images. The key is to modulate the decoder layers
using latent space composed of audio and reference features. Additionally, we
incorporate ViT blocks into the decoder to emphasize high-fidelity details,
especially in the lip region. Our experimental results demonstrate that RADIO
displays high synchronization without the loss of fidelity. Especially in harsh
scenarios where the reference frame deviates significantly from the ground
truth, our method outperforms state-of-the-art methods, highlighting its
robustness. Pre-trained model and codes will be made public after the review.Comment: Under revie
VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild
We present VideoReTalking, a new system to edit the faces of a real-world
talking head video according to input audio, producing a high-quality and
lip-syncing output video even with a different emotion. Our system disentangles
this objective into three sequential tasks: (1) face video generation with a
canonical expression; (2) audio-driven lip-sync; and (3) face enhancement for
improving photo-realism. Given a talking-head video, we first modify the
expression of each frame according to the same expression template using the
expression editing network, resulting in a video with the canonical expression.
This video, together with the given audio, is then fed into the lip-sync
network to generate a lip-syncing video. Finally, we improve the photo-realism
of the synthesized faces through an identity-aware face enhancement network and
post-processing. We use learning-based approaches for all three steps and all
our modules can be tackled in a sequential pipeline without any user
intervention. Furthermore, our system is a generic approach that does not need
to be retrained to a specific person. Evaluations on two widely-used datasets
and in-the-wild examples demonstrate the superiority of our framework over
other state-of-the-art methods in terms of lip-sync accuracy and visual
quality.Comment: Accepted by SIGGRAPH Asia 2022 Conference Proceedings. Project page:
https://vinthony.github.io/video-retalking
Emergent leaders through looking and speaking: from audio-visual data to multimodal recognition
In this paper we present a multimodal analysis of emergent leadership in small groups using audio-visual features and discuss our experience in designing and collecting a data corpus for this purpose. The ELEA Audio-Visual Synchronized corpus (ELEA AVS) was collected using a light portable setup and contains recordings of small group meetings. The participants in each group performed the winter survival task and filled in questionnaires related to personality and several social concepts such as leadership and dominance. In addition, the corpus includes annotations on participants' performance in the survival task, and also annotations of social concepts from external viewers. Based on this corpus, we present the feasibility of predicting the emergent leader in small groups using automatically extracted audio and visual features, based on speaking turns and visual attention, and we focus specifically on multimodal features that make use of the looking at participants while speaking and looking at while not speaking measures. Our findings indicate that emergent leadership is related, but not equivalent, to dominance, and while multimodal features bring a moderate degree of effectiveness in inferring the leader, much simpler features extracted from the audio channel are found to give better performanc
GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance
Although existing speech-driven talking face generation methods achieve
significant progress, they are far from real-world application due to the
avatar-specific training demand and unstable lip movements. To address the
above issues, we propose the GSmoothFace, a novel two-stage generalized talking
face generation model guided by a fine-grained 3d face model, which can
synthesize smooth lip dynamics while preserving the speaker's identity. Our
proposed GSmoothFace model mainly consists of the Audio to Expression
Prediction (A2EP) module and the Target Adaptive Face Translation (TAFT)
module. Specifically, we first develop the A2EP module to predict expression
parameters synchronized with the driven speech. It uses a transformer to
capture the long-term audio context and learns the parameters from the
fine-grained 3D facial vertices, resulting in accurate and smooth
lip-synchronization performance. Afterward, the well-designed TAFT module,
empowered by Morphology Augmented Face Blending (MAFB), takes the predicted
expression parameters and target video as inputs to modify the facial region of
the target video without distorting the background content. The TAFT
effectively exploits the identity appearance and background context in the
target video, which makes it possible to generalize to different speakers
without retraining. Both quantitative and qualitative experiments confirm the
superiority of our method in terms of realism, lip synchronization, and visual
quality. See the project page for code, data, and request pre-trained models:
https://zhanghm1995.github.io/GSmoothFace
Reprogramming Audio-driven Talking Face Synthesis into Text-driven
In this paper, we propose a method to reprogram pre-trained audio-driven
talking face synthesis models to be able to operate with text inputs. As the
audio-driven talking face synthesis model takes speech audio as inputs, in
order to generate a talking avatar with the desired speech content, speech
recording needs to be performed in advance. However, this is burdensome to
record audio for every video to be generated. In order to alleviate this
problem, we propose a novel method that embeds input text into the learned
audio latent space of the pre-trained audio-driven model. To this end, we
design a Text-to-Audio Embedding Module (TAEM) which is guided to learn to map
a given text input to the audio latent features. Moreover, to model the speaker
characteristics lying in the audio features, we propose to inject visual
speaker embedding into the TAEM, which is obtained from a single face image.
After training, we can synthesize talking face videos with either text or
speech audio
Hi Sheldon! Creating Deep Personalized Characters from TV Shows
Imagine an interesting multimodal interactive scenario that you can see,
hear, and chat with an AI-generated digital character, who is capable of
behaving like Sheldon from The Big Bang Theory, as a DEEP copy from appearance
to personality. Towards this fantastic multimodal chatting scenario, we propose
a novel task, named Deep Personalized Character Creation (DPCC): creating
multimodal chat personalized characters from multimodal data such as TV shows.
Specifically, given a single- or multi-modality input (text, audio, video), the
goal of DPCC is to generate a multi-modality (text, audio, video) response,
which should be well-matched the personality of a specific character such as
Sheldon, and of high quality as well. To support this novel task, we further
collect a character centric multimodal dialogue dataset, named Deep
Personalized Character Dataset (DPCD), from TV shows. DPCD contains
character-specific multimodal dialogue data of ~10k utterances and ~6 hours of
audio/video per character, which is around 10 times larger compared to existing
related datasets.On DPCD, we present a baseline method for the DPCC task and
create 5 Deep personalized digital Characters (DeepCharacters) from Big Bang TV
Shows. We conduct both subjective and objective experiments to evaluate the
multimodal response from DeepCharacters in terms of characterization and
quality. The results demonstrates that, on our collected DPCD dataset, the
proposed baseline can create personalized digital characters for generating
multimodal response.Our collected DPCD dataset, the code of data collection and
our baseline will be published soon
- …