15 research outputs found
FaceLit: Neural 3D Relightable Faces
We propose a generative framework, FaceLit, capable of generating a 3D face
that can be rendered at various user-defined lighting conditions and views,
learned purely from 2D images in-the-wild without any manual annotation. Unlike
existing works that require careful capture setup or human labor, we rely on
off-the-shelf pose and illumination estimators. With these estimates, we
incorporate the Phong reflectance model in the neural volume rendering
framework. Our model learns to generate shape and material properties of a face
such that, when rendered according to the natural statistics of pose and
illumination, produces photorealistic face images with multiview 3D and
illumination consistency. Our method enables photorealistic generation of faces
with explicit illumination and view controls on multiple datasets - FFHQ,
MetFaces and CelebA-HQ. We show state-of-the-art photorealism among 3D aware
GANs on FFHQ dataset achieving an FID score of 3.5.Comment: CVPR 202
HUGS: Human Gaussian Splats
Recent advances in neural rendering have improved both training and rendering
times by orders of magnitude. While these methods demonstrate state-of-the-art
quality and speed, they are designed for photogrammetry of static scenes and do
not generalize well to freely moving humans in the environment. In this work,
we introduce Human Gaussian Splats (HUGS) that represents an animatable human
together with the scene using 3D Gaussian Splatting (3DGS). Our method takes
only a monocular video with a small number of (50-100) frames, and it
automatically learns to disentangle the static scene and a fully animatable
human avatar within 30 minutes. We utilize the SMPL body model to initialize
the human Gaussians. To capture details that are not modeled by SMPL (e.g.
cloth, hairs), we allow the 3D Gaussians to deviate from the human body model.
Utilizing 3D Gaussians for animated humans brings new challenges, including the
artifacts created when articulating the Gaussians. We propose to jointly
optimize the linear blend skinning weights to coordinate the movements of
individual Gaussians during animation. Our approach enables novel-pose
synthesis of human and novel view synthesis of both the human and the scene. We
achieve state-of-the-art rendering quality with a rendering speed of 60 FPS
while being ~100x faster to train over previous work. Our code will be
announced here: https://github.com/apple/ml-hug
Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications
We consider the task of animating 3D facial geometry from speech signal.
Existing works are primarily deterministic, focusing on learning a one-to-one
mapping from speech signal to 3D face meshes on small datasets with limited
speakers. While these models can achieve high-quality lip articulation for
speakers in the training set, they are unable to capture the full and diverse
distribution of 3D facial motions that accompany speech in the real world.
Importantly, the relationship between speech and facial motion is one-to-many,
containing both inter-speaker and intra-speaker variations and necessitating a
probabilistic approach. In this paper, we identify and address key challenges
that have so far limited the development of probabilistic models: lack of
datasets and metrics that are suitable for training and evaluating them, as
well as the difficulty of designing a model that generates diverse results
while remaining faithful to a strong conditioning signal as speech. We first
propose large-scale benchmark datasets and metrics suitable for probabilistic
modeling. Then, we demonstrate a probabilistic model that achieves both
diversity and fidelity to speech, outperforming other methods across the
proposed benchmarks. Finally, we showcase useful applications of probabilistic
models trained on these large-scale datasets: we can generate diverse
speech-driven 3D facial motion that matches unseen speaker styles extracted
from reference clips; and our synthetic meshes can be used to improve the
performance of downstream audio-visual models
Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis
Adapting generic speech recognition models to specific individuals is a
challenging problem due to the scarcity of personalized data. Recent works have
proposed boosting the amount of training data using personalized text-to-speech
synthesis. Here, we ask two fundamental questions about this strategy: when is
synthetic data effective for personalization, and why is it effective in those
cases? To address the first question, we adapt a state-of-the-art automatic
speech recognition (ASR) model to target speakers from four benchmark datasets
representative of different speaker types. We show that ASR personalization
with synthetic data is effective in all cases, but particularly when (i) the
target speaker is underrepresented in the global data, and (ii) the capacity of
the global model is limited. To address the second question of why personalized
synthetic data is effective, we use controllable speech synthesis to generate
speech with varied styles and content. Surprisingly, we find that the text
content of the synthetic data, rather than style, is important for speaker
adaptation. These results lead us to propose a data selection strategy for ASR
personalization based on speech content.Comment: ICASSP 202
Novel-View Acoustic Synthesis from 3D Reconstructed Rooms
We investigate the benefit of combining blind audio recordings with 3D scene
information for novel-view acoustic synthesis. Given audio recordings from 2-4
microphones and the 3D geometry and material of a scene containing multiple
unknown sound sources, we estimate the sound anywhere in the scene. We identify
the main challenges of novel-view acoustic synthesis as sound source
localization, separation, and dereverberation. While naively training an
end-to-end network fails to produce high-quality results, we show that
incorporating room impulse responses (RIRs) derived from 3D reconstructed rooms
enables the same network to jointly tackle these tasks. Our method outperforms
existing methods designed for the individual tasks, demonstrating its
effectiveness at utilizing 3D visual information. In a simulated study on the
Matterport3D-NVAS dataset, our model achieves near-perfect accuracy on source
localization, a PSNR of 26.44 dB and a SDR of 14.23 dB for source separation
and dereverberation, resulting in a PSNR of 25.55 dB and a SDR of 14.20 dB on
novel-view acoustic synthesis. Code, pretrained model, and video results are
available on the project webpage (https://github.com/apple/ml-nvas3d)
Corpus Synthesis for Zero-shot ASR domain Adaptation using Large Language Models
While Automatic Speech Recognition (ASR) systems are widely used in many
real-world applications, they often do not generalize well to new domains and
need to be finetuned on data from these domains. However, target-domain data
usually are not readily available in many scenarios. In this paper, we propose
a new strategy for adapting ASR models to new target domains without any text
or speech from those domains. To accomplish this, we propose a novel data
synthesis pipeline that uses a Large Language Model (LLM) to generate a target
domain text corpus, and a state-of-the-art controllable speech synthesis model
to generate the corresponding speech. We propose a simple yet effective
in-context instruction finetuning strategy to increase the effectiveness of LLM
in generating text corpora for new domains. Experiments on the SLURP dataset
show that the proposed method achieves an average relative word error rate
improvement of on unseen target domains without any performance drop in
source domains
Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models
Controllable generative sequence models with the capability to extract and
replicate the style of specific examples enable many applications, including
narrating audiobooks in different voices, auto-completing and auto-correcting
written handwriting, and generating missing training samples for downstream
recognition tasks. However, under an unsupervised-style setting, typical
training algorithms for controllable sequence generative models suffer from the
training-inference mismatch, where the same sample is used as content and style
input during training but unpaired samples are given during inference. In this
paper, we tackle the training-inference mismatch encountered during
unsupervised learning of controllable generative sequence models. The proposed
method is simple yet effective, where we use a style transformation module to
transfer target style information into an unrelated style input. This method
enables training using unpaired content and style samples and thereby mitigate
the training-inference mismatch. We apply style equalization to text-to-speech
and text-to-handwriting synthesis on three datasets. We conduct thorough
evaluation, including both quantitative and qualitative user studies. Our
results show that by mitigating the training-inference mismatch with the
proposed style equalization, we achieve style replication scores comparable to
real data in our user studies.Comment: ICML 202