12,247 research outputs found
Object Referring in Videos with Language and Human Gaze
We investigate the problem of object referring (OR) i.e. to localize a target
object in a visual scene coming with a language description. Humans perceive
the world more as continued video snippets than as static images, and describe
objects not only by their appearance, but also by their spatio-temporal context
and motion features. Humans also gaze at the object when they issue a referring
expression. Existing works for OR mostly focus on static images only, which
fall short in providing many such cues. This paper addresses OR in videos with
language and human gaze. To that end, we present a new video dataset for OR,
with 30, 000 objects over 5, 000 stereo video sequences annotated for their
descriptions and gaze. We further propose a novel network model for OR in
videos, by integrating appearance, motion, gaze, and spatio-temporal context
into one network. Experimental results show that our method effectively
utilizes motion cues, human gaze, and spatio-temporal context. Our method
outperforms previousOR methods. For dataset and code, please refer
https://people.ee.ethz.ch/~arunv/ORGaze.html.Comment: Accepted to CVPR 2018, 10 pages, 6 figure
Modeling of Performance Creative Evaluation Driven by Multimodal Affective Data
Performance creative evaluation can be achieved through affective data, and the use of affective featuresto evaluate performance creative is a new research trend. This paper proposes a “Performance Creative—Multimodal Affective (PC-MulAff)” model based on the multimodal affective features for performance creative evaluation. The multimedia data acquisition equipment is used to collect the physiological data of the audience, including the multimodal affective data such as the facial expression, heart rate and eye movement. Calculate affective features of multimodal data combined with director annotation, and defined “Performance Creative—Affective Acceptance (PC-Acc)” based on multimodal affective features to evaluate the quality of performance creative. This paper verifies the PC-MulAff model on different performance data sets. The experimental results show that the PC-MulAff model shows high evaluation quality in different performance forms. In the creative evaluation of dance performance, the accuracy of the model is 7.44% and 13.95% higher than that of the single textual and single video evaluation
Learning Speech-driven 3D Conversational Gestures from Video
We propose the first approach to automatically and jointly synthesize both
the synchronous 3D conversational body and hand gestures, as well as 3D face
and head animations, of a virtual character from speech input. Our algorithm
uses a CNN architecture that leverages the inherent correlation between facial
expression and hand gestures. Synthesis of conversational body gestures is a
multi-modal problem since many similar gestures can plausibly accompany the
same input speech. To synthesize plausible body gestures in this setting, we
train a Generative Adversarial Network (GAN) based model that measures the
plausibility of the generated sequences of 3D body motion when paired with the
input audio features. We also contribute a new way to create a large corpus of
more than 33 hours of annotated body, hand, and face data from in-the-wild
videos of talking people. To this end, we apply state-of-the-art monocular
approaches for 3D body and hand pose estimation as well as dense 3D face
performance capture to the video corpus. In this way, we can train on orders of
magnitude more data than previous algorithms that resort to complex in-studio
motion capture solutions, and thereby train more expressive synthesis
algorithms. Our experiments and user study show the state-of-the-art quality of
our speech-synthesized full 3D character animations
PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network
It is common in everyday spoken communication that we look at the turning
head of a talker to listen to his/her voice. Humans see the talker to listen
better, so do machines. However, previous studies on audio-visual speaker
extraction have not effectively handled the varying talking face. This paper
studies how to take full advantage of the varying talking face. We propose a
Pose-Invariant Audio-Visual Speaker Extraction Network (PIAVE) that
incorporates an additional pose-invariant view to improve audio-visual speaker
extraction. Specifically, we generate the pose-invariant view from each
original pose orientation, which enables the model to receive a consistent
frontal view of the talker regardless of his/her head pose, therefore, forming
a multi-view visual input for the speaker. Experiments on the multi-view MEAD
and in-the-wild LRS3 dataset demonstrate that PIAVE outperforms the
state-of-the-art and is more robust to pose variations.Comment: Interspeech 202
Recommended from our members
The role of HG in the analysis of temporal iteration and interaural correlation
- …