21 research outputs found
Everybody Dance Now
This paper presents a simple method for "do as I do" motion transfer: given a
source video of a person dancing, we can transfer that performance to a novel
(amateur) target after only a few minutes of the target subject performing
standard moves. We approach this problem as video-to-video translation using
pose as an intermediate representation. To transfer the motion, we extract
poses from the source subject and apply the learned pose-to-appearance mapping
to generate the target subject. We predict two consecutive frames for
temporally coherent video results and introduce a separate pipeline for
realistic face synthesis. Although our method is quite simple, it produces
surprisingly compelling results (see video). This motivates us to also provide
a forensics tool for reliable synthetic content detection, which is able to
distinguish videos synthesized by our system from real data. In addition, we
release a first-of-its-kind open-source dataset of videos that can be legally
used for training and motion transfer.Comment: In ICCV 201
Can Language Models Learn to Listen?
We present a framework for generating appropriate facial responses from a
listener in dyadic social interactions based on the speaker's words. Given an
input transcription of the speaker's words with their timestamps, our approach
autoregressively predicts a response of a listener: a sequence of listener
facial gestures, quantized using a VQ-VAE. Since gesture is a language
component, we propose treating the quantized atomic motion elements as
additional language token inputs to a transformer-based large language model.
Initializing our transformer with the weights of a language model pre-trained
only on text results in significantly higher quality listener responses than
training a transformer from scratch. We show that our generated listener motion
is fluent and reflective of language semantics through quantitative metrics and
a qualitative user study. In our evaluation, we analyze the model's ability to
utilize temporal and semantic aspects of spoken text. Project page:
https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/Comment: ICCV 2023; Project page:
https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen
Temporally Guided Music-to-Body-Movement Generation
This paper presents a neural network model to generate virtual violinist's
3-D skeleton movements from music audio. Improved from the conventional
recurrent neural network models for generating 2-D skeleton data in previous
works, the proposed model incorporates an encoder-decoder architecture, as well
as the self-attention mechanism to model the complicated dynamics in body
movement sequences. To facilitate the optimization of self-attention model,
beat tracking is applied to determine effective sizes and boundaries of the
training examples. The decoder is accompanied with a refining network and a
bowing attack inference mechanism to emphasize the right-hand behavior and
bowing attack timing. Both objective and subjective evaluations reveal that the
proposed model outperforms the state-of-the-art methods. To the best of our
knowledge, this work represents the first attempt to generate 3-D violinists'
body movements considering key features in musical body movement
Recommended from our members
Modeling Visual Minutiae: Gestures, Styles, and Temporal Patterns
The human visual system is highly adept at making use of the rich subtleties of the visual world such as non-verbal communication signals, style, emotion, and the fine-grained details of individuals. Computer vision systems, by contrast, excel in categorical tasks, such as classification and detection, where training often relies on single-word or simple bounding-box annotations. These simple annotations do not capture the richness of the visual world which is often hard to describe in words or localize in an image. Our current systems are thus left to only make use of the obvious, easily describable parts of the visual input. This dissertation investigates several initial directions toward modeling visual minutiae and endowing computer vision systems with rich perception.Part I describes methods for learning directly from video data without the need for human-provided annotations. The section begins by discussing the use of multi-modal correlations between audio and motion for modeling conversational gestures---an essential part of human communication that is currently ignored by machine perception. The section then proposes a simple method for capturing the appearance details of individual people in motion, which can be used to implement a "do-as-I-do'' motion-transfer application.Part II explores ways to discover temporal visual patterns in historical data. The section begins by discussing data-mining methods in a dataset of historical high school yearbook portraits where fashion and behavioral styles change over time. The rest of the section proposes an unsupervised method to learn to disentangle the time-varying visual factors from the permanent ones in a large dataset of urban scenes.Part III discusses one possible avenue for testing whether our man-made systems have achieved human-like rich perception by comparing their performance to that of humans on a unique dataset of abstract art