Search CORE

21 research outputs found

Everybody Dance Now

Author: Chan Caroline
Efros Alexei A.
Ginosar Shiry
Zhou Tinghui
Publication venue
Publication date: 27/08/2019
Field of study

This paper presents a simple method for "do as I do" motion transfer: given a source video of a person dancing, we can transfer that performance to a novel (amateur) target after only a few minutes of the target subject performing standard moves. We approach this problem as video-to-video translation using pose as an intermediate representation. To transfer the motion, we extract poses from the source subject and apply the learned pose-to-appearance mapping to generate the target subject. We predict two consecutive frames for temporally coherent video results and introduce a separate pipeline for realistic face synthesis. Although our method is quite simple, it produces surprisingly compelling results (see video). This motivates us to also provide a forensics tool for reliable synthetic content detection, which is able to distinguish videos synthesized by our system from real data. In addition, we release a first-of-its-kind open-source dataset of videos that can be legally used for training and motion transfer.Comment: In ICCV 201

arXiv.org e-Print Archive

Crossref

Can Language Models Learn to Listen?

Author: Darrell Trevor
Ginosar Shiry
Kanazawa Angjoo
Klein Dan
Ng Evonne
Subramanian Sanjay
Publication venue
Publication date: 21/08/2023
Field of study

We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words. Given an input transcription of the speaker's words with their timestamps, our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE. Since gesture is a language component, we propose treating the quantized atomic motion elements as additional language token inputs to a transformer-based large language model. Initializing our transformer with the weights of a language model pre-trained only on text results in significantly higher quality listener responses than training a transformer from scratch. We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study. In our evaluation, we analyze the model's ability to utilize temporal and semantic aspects of spoken text. Project page: https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/Comment: ICCV 2023; Project page: https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen

arXiv.org e-Print Archive

Temporally Guided Music-to-Body-Movement Generation

Author: Berg Tamara
Ginosar Shiry
Huang Yu-Fen
Huang Yu-Fen
Isola Phillip
Li Bochen
Li Bochen
Liu Jun-Wei
MacRitchie Jennifer
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 16/09/2020
Field of study

This paper presents a neural network model to generate virtual violinist's 3-D skeleton movements from music audio. Improved from the conventional recurrent neural network models for generating 2-D skeleton data in previous works, the proposed model incorporates an encoder-decoder architecture, as well as the self-attention mechanism to model the complicated dynamics in body movement sequences. To facilitate the optimization of self-attention model, beat tracking is applied to determine effective sizes and boundaries of the training examples. The decoder is accompanied with a refining network and a bowing attack inference mechanism to emphasize the right-hand behavior and bowing attack timing. Both objective and subjective evaluations reveal that the proposed model outperforms the state-of-the-art methods. To the best of our knowledge, this work represents the first attempt to generate 3-D violinists' body movements considering key features in musical body movement

arXiv.org e-Print Archive

Crossref

Recommended from our members

Modeling Visual Minutiae: Gestures, Styles, and Temporal Patterns

Author: Ginosar Shiry Sara
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

The human visual system is highly adept at making use of the rich subtleties of the visual world such as non-verbal communication signals, style, emotion, and the fine-grained details of individuals. Computer vision systems, by contrast, excel in categorical tasks, such as classification and detection, where training often relies on single-word or simple bounding-box annotations. These simple annotations do not capture the richness of the visual world which is often hard to describe in words or localize in an image. Our current systems are thus left to only make use of the obvious, easily describable parts of the visual input. This dissertation investigates several initial directions toward modeling visual minutiae and endowing computer vision systems with rich perception.Part I describes methods for learning directly from video data without the need for human-provided annotations. The section begins by discussing the use of multi-modal correlations between audio and motion for modeling conversational gestures---an essential part of human communication that is currently ignored by machine perception. The section then proposes a simple method for capturing the appearance details of individual people in motion, which can be used to implement a "do-as-I-do'' motion-transfer application.Part II explores ways to discover temporal visual patterns in historical data. The section begins by discussing data-mining methods in a dataset of historical high school yearbook portraits where fashion and behavioral styles change over time. The rest of the section proposes an unsupervised method to learn to disentangle the time-varying visual factors from the permanent ones in a large dataset of urban scenes.Part III discusses one possible avenue for testing whether our man-made systems have achieved human-like rich perception by comparing their performance to that of humans on a unique dataset of abstract art

eScholarship - University of California