6 research outputs found
Deep Cross-Modal Audio-Visual Generation
Cross-modal audio-visual perception has been a long-lasting topic in
psychology and neurology, and various studies have discovered strong
correlations in human perception of auditory and visual stimuli. Despite works
in computational multimodal modeling, the problem of cross-modal audio-visual
generation has not been systematically studied in the literature. In this
paper, we make the first attempt to solve this cross-modal generation problem
leveraging the power of deep generative adversarial training. Specifically, we
use conditional generative adversarial networks to achieve cross-modal
audio-visual generation of musical performances. We explore different encoding
methods for audio and visual signals, and work on two scenarios:
instrument-oriented generation and pose-oriented generation. Being the first to
explore this new problem, we compose two new datasets with pairs of images and
sounds of musical performances of different instruments. Our experiments using
both classification and human evaluations demonstrate that our model has the
ability to generate one modality, i.e., audio/visual, from the other modality,
i.e., visual/audio, to a good extent. Our experiments on various design choices
along with the datasets will facilitate future research in this new problem
space
Multichannel Attention Network for Analyzing Visual Behavior in Public Speaking
Public speaking is an important aspect of human communication and
interaction. The majority of computational work on public speaking concentrates
on analyzing the spoken content, and the verbal behavior of the speakers. While
the success of public speaking largely depends on the content of the talk, and
the verbal behavior, non-verbal (visual) cues, such as gestures and physical
appearance also play a significant role. This paper investigates the importance
of visual cues by estimating their contribution towards predicting the
popularity of a public lecture. For this purpose, we constructed a large
database of more than TED talk videos. As a measure of popularity of the
TED talks, we leverage the corresponding (online) viewers' ratings from
YouTube. Visual cues related to facial and physical appearance, facial
expressions, and pose variations are extracted from the video frames using
convolutional neural network (CNN) models. Thereafter, an attention-based long
short-term memory (LSTM) network is proposed to predict the video popularity
from the sequence of visual features. The proposed network achieves
state-of-the-art prediction accuracy indicating that visual cues alone contain
highly predictive information about the popularity of a talk. Furthermore, our
network learns a human-like attention mechanism, which is particularly useful
for interpretability, i.e. how attention varies with time, and across different
visual cues by indicating their relative importance
SpeechMirror: A Multimodal Visual Analytics System for Personalized Reflection of Online Public Speaking Effectiveness
As communications are increasingly taking place virtually, the ability to
present well online is becoming an indispensable skill. Online speakers are
facing unique challenges in engaging with remote audiences. However, there has
been a lack of evidence-based analytical systems for people to comprehensively
evaluate online speeches and further discover possibilities for improvement.
This paper introduces SpeechMirror, a visual analytics system facilitating
reflection on a speech based on insights from a collection of online speeches.
The system estimates the impact of different speech techniques on effectiveness
and applies them to a speech to give users awareness of the performance of
speech techniques. A similarity recommendation approach based on speech factors
or script content supports guided exploration to expand knowledge of
presentation evidence and accelerate the discovery of speech delivery
possibilities. SpeechMirror provides intuitive visualizations and interactions
for users to understand speech factors. Among them, SpeechTwin, a novel
multimodal visual summary of speech, supports rapid understanding of critical
speech factors and comparison of different speech samples, and SpeechPlayer
augments the speech video by integrating visualization of the speaker's body
language with interaction, for focused analysis. The system utilizes
visualizations suited to the distinct nature of different speech factors for
user comprehension. The proposed system and visualization techniques were
evaluated with domain experts and amateurs, demonstrating usability for users
with low visualization literacy and its efficacy in assisting users to develop
insights for potential improvement.Comment: Main paper (11 pages, 6 figures) and Supplemental document (11 pages,
11 figures). Accepted by VIS 202