3,191 research outputs found
Animating Through Warping: an Efficient Method for High-Quality Facial Expression Animation
Advances in deep neural networks have considerably improved the art of
animating a still image without operating in 3D domain. Whereas, prior arts can
only animate small images (typically no larger than 512x512) due to memory
limitations, difficulty of training and lack of high-resolution (HD) training
datasets, which significantly reduce their potential for applications in movie
production and interactive systems. Motivated by the idea that HD images can be
generated by adding high-frequency residuals to low-resolution results produced
by a neural network, we propose a novel framework known as Animating Through
Warping (ATW) to enable efficient animation of HD images.
Specifically, the proposed framework consists of two modules, a novel
two-stage neural-network generator and a novel post-processing module known as
Animating Through Warping (ATW). It only requires the generator to be trained
on small images and can do inference on an image of any size. During inference,
an HD input image is decomposed into a low-resolution component(128x128) and
its corresponding high-frequency residuals. The generator predicts the
low-resolution result as well as the motion field that warps the input face to
the desired status (e.g., expressions categories or action units). Finally, the
ResWarp module warps the residuals based on the motion field and adding the
warped residuals to generates the final HD results from the naively up-sampled
low-resolution results. Experiments show the effectiveness and efficiency of
our method in generating high-resolution animations. Our proposed framework
successfully animates a 4K facial image, which has never been achieved by prior
neural models. In addition, our method generally guarantee the temporal
coherency of the generated animations. Source codes will be made publicly
available.Comment: 18 pages, 13 figures, Accepted to ACM Multimedia 202
Automatic Animation of Hair Blowing in Still Portrait Photos
We propose a novel approach to animate human hair in a still portrait photo.
Existing work has largely studied the animation of fluid elements such as water
and fire. However, hair animation for a real image remains underexplored, which
is a challenging problem, due to the high complexity of hair structure and
dynamics. Considering the complexity of hair structure, we innovatively treat
hair wisp extraction as an instance segmentation problem, where a hair wisp is
referred to as an instance. With advanced instance segmentation networks, our
method extracts meaningful and natural hair wisps. Furthermore, we propose a
wisp-aware animation module that animates hair wisps with pleasing motions
without noticeable artifacts. The extensive experiments show the superiority of
our method. Our method provides the most pleasing and compelling viewing
experience in the qualitative experiments and outperforms state-of-the-art
still-image animation methods by a large margin in the quantitative evaluation.
Project url: \url{https://nevergiveu.github.io/AutomaticHairBlowing/}Comment: Accepted to ICCV 202
Capture, Learning, and Synthesis of 3D Speaking Styles
Audio-driven 3D facial animation has been widely explored, but achieving
realistic, human-like performance is still unsolved. This is due to the lack of
available 3D datasets, models, and standard evaluation metrics. To address
this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans
captured at 60 fps and synchronized audio from 12 speakers. We then train a
neural network on our dataset that factors identity from facial motion. The
learned model, VOCA (Voice Operated Character Animation) takes any speech
signal as input - even speech in languages other than English - and
realistically animates a wide range of adult faces. Conditioning on subject
labels during training allows the model to learn a variety of realistic
speaking styles. VOCA also provides animator controls to alter speaking style,
identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball
rotations) during animation. To our knowledge, VOCA is the only realistic 3D
facial animation model that is readily applicable to unseen subjects without
retargeting. This makes VOCA suitable for tasks like in-game video, virtual
reality avatars, or any scenario in which the speaker, speech, or language is
not known in advance. We make the dataset and model available for research
purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201
Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers
We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality
Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image
We study the problem of synthesizing a long-term dynamic video from only a
single image. This is challenging since it requires consistent visual content
movements given large camera motions. Existing methods either hallucinate
inconsistent perpetual views or struggle with long camera trajectories. To
address these issues, it is essential to estimate the underlying 4D (including
3D geometry and scene motion) and fill in the occluded regions. To this end, we
present Make-It-4D, a novel method that can generate a consistent long-term
dynamic video from a single image. On the one hand, we utilize layered depth
images (LDIs) to represent a scene, and they are then unprojected to form a
feature point cloud. To animate the visual content, the feature point cloud is
displaced based on the scene flow derived from motion estimation and the
corresponding camera pose. Such 4D representation enables our method to
maintain the global consistency of the generated dynamic video. On the other
hand, we fill in the occluded regions by using a pretrained diffusion model to
inpaint and outpaint the input image. This enables our method to work under
large camera motions. Benefiting from our design, our method can be
training-free which saves a significant amount of training time. Experimental
results demonstrate the effectiveness of our approach, which showcases
compelling rendering results.Comment: accepted by ACM MM'2
- …