64 research outputs found
Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks
Given an audio clip and a reference face image, the goal of the talking head
generation is to generate a high-fidelity talking head video. Although some
audio-driven methods of generating talking head videos have made some
achievements in the past, most of them only focused on lip and audio
synchronization and lack the ability to reproduce the facial expressions of the
target person. To this end, we propose a talking head generation model
consisting of a Memory-Sharing Emotion Feature extractor (MSEF) and an
Attention-Augmented Translator based on U-net (AATU). Firstly, MSEF can extract
implicit emotional auxiliary features from audio to estimate more accurate
emotional face landmarks.~Secondly, AATU acts as a translator between the
estimated landmarks and the photo-realistic video frames. Extensive qualitative
and quantitative experiments have shown the superiority of the proposed method
to the previous works. Codes will be made publicly available
AudioViewer: Learning to Visualize Sounds
A long-standing goal in the field of sensory substitution is to enable sound
perception for deaf and hard of hearing (DHH) people by visualizing audio
content. Different from existing models that translate to hand sign language,
between speech and text, or text and images, we target immediate and low-level
audio to video translation that applies to generic environment sounds as well
as human speech. Since such a substitution is artificial, without labels for
supervised learning, our core contribution is to build a mapping from audio to
video that learns from unpaired examples via high-level constraints. For
speech, we additionally disentangle content from style, such as gender and
dialect. Qualitative and quantitative results, including a human study,
demonstrate that our unpaired translation approach maintains important audio
features in the generated video and that videos of faces and numbers are well
suited for visualizing high-dimensional audio features that can be parsed by
humans to match and distinguish between sounds and words. Code and models are
available at https://chunjinsong.github.io/audioviewe
- …