2,695 research outputs found
Speaker-following Video Subtitles
We propose a new method for improving the presentation of subtitles in video
(e.g. TV and movies). With conventional subtitles, the viewer has to constantly
look away from the main viewing area to read the subtitles at the bottom of the
screen, which disrupts the viewing experience and causes unnecessary eyestrain.
Our method places on-screen subtitles next to the respective speakers to allow
the viewer to follow the visual content while simultaneously reading the
subtitles. We use novel identification algorithms to detect the speakers based
on audio and visual information. Then the placement of the subtitles is
determined using global optimization. A comprehensive usability study indicated
that our subtitle placement method outperformed both conventional
fixed-position subtitling and another previous dynamic subtitling method in
terms of enhancing the overall viewing experience and reducing eyestrain
Ventriloquism effect with sound stimuli varying in both azimuth and elevation
Copyright 2015 Acoustical Society of America. This article may be downloaded for personal use only. Any other use requires prior permission of the author and the Acoustical Society of America.The following article appeared in Etienne Hendrickx, Mathieu Paquier, Vincent Koehl and Julian Palacino, Ventriloquism effect with sound stimuli varying in both azimuth and elevation, The Journal of the Acoustical Society of America 2015, vol. 138, no 6, pp. 3686–3697.and may be found at http://link.aip.org/link/?JAS/138/3686International audienceWhen presented with a spatially discordant auditory-visual stimulus, subjects sometimes perceive the sound and the visual stimuli as coming from the same location. Such a phenomenon is often referred to as perceptual fusion or ventriloquism, as it evokes the illusion created by a ventriloquist when his voice seems to emanate from his puppet rather than from his mouth. While this effect has been extensively examined in the horizontal plane and to a lesser extent in distance, few psychoacoustic studies have focused on elevation. In the present experiment, sequences of a man talking were presented to subjects. His voice could be reproduced on different loudspeakers, which created disparities in both azimuth and elevation between the sound and the visual stimuli. For each presentation, subjects had to indicate whether the voice seemed to emanate from the mouth of the actor or not. Results showed that ventriloquism could be observed with larger audiovisual disparities in elevation than in azimuth
Zero-shot keyword spotting for visual speech recognition in-the-wild
Visual keyword spotting (KWS) is the problem of estimating whether a text
query occurs in a given recording using only video information. This paper
focuses on visual KWS for words unseen during training, a real-world, practical
setting which so far has received no attention by the community. To this end,
we devise an end-to-end architecture comprising (a) a state-of-the-art visual
feature extractor based on spatiotemporal Residual Networks, (b) a
grapheme-to-phoneme model based on sequence-to-sequence neural networks, and
(c) a stack of recurrent neural networks which learn how to correlate visual
features with the keyword representation. Different to prior works on KWS,
which try to learn word representations merely from sequences of graphemes
(i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder
model which learns how to map words to their pronunciation. We demonstrate that
our system obtains very promising visual-only KWS results on the challenging
LRS2 database, for keywords unseen during training. We also show that our
system outperforms a baseline which addresses KWS via automatic speech
recognition (ASR), while it drastically improves over other recently proposed
ASR-free KWS methods.Comment: Accepted at ECCV-201
Vision-based Detection of Acoustic Timed Events: a Case Study on Clarinet Note Onsets
Acoustic events often have a visual counterpart. Knowledge of visual
information can aid the understanding of complex auditory scenes, even when
only a stereo mixdown is available in the audio domain, \eg identifying which
musicians are playing in large musical ensembles. In this paper, we consider a
vision-based approach to note onset detection. As a case study we focus on
challenging, real-world clarinetist videos and carry out preliminary
experiments on a 3D convolutional neural network based on multiple streams and
purposely avoiding temporal pooling. We release an audiovisual dataset with 4.5
hours of clarinetist videos together with cleaned annotations which include
about 36,000 onsets and the coordinates for a number of salient points and
regions of interest. By performing several training trials on our dataset, we
learned that the problem is challenging. We found that the CNN model is highly
sensitive to the optimization algorithm and hyper-parameters, and that treating
the problem as binary classification may prevent the joint optimization of
precision and recall. To encourage further research, we publicly share our
dataset, annotations and all models and detail which issues we came across
during our preliminary experiments.Comment: Proceedings of the First International Conference on Deep Learning
and Music, Anchorage, US, May, 2017 (arXiv:1706.08675v1 [cs.NE]
Egocentric Auditory Attention Localization in Conversations
In a noisy conversation environment such as a dinner party, people often
exhibit selective auditory attention, or the ability to focus on a particular
speaker while tuning out others. Recognizing who somebody is listening to in a
conversation is essential for developing technologies that can understand
social behavior and devices that can augment human hearing by amplifying
particular sound sources. The computer vision and audio research communities
have made great strides towards recognizing sound sources and speakers in
scenes. In this work, we take a step further by focusing on the problem of
localizing auditory attention targets in egocentric video, or detecting who in
a camera wearer's field of view they are listening to. To tackle the new and
challenging Selective Auditory Attention Localization problem, we propose an
end-to-end deep learning approach that uses egocentric video and multichannel
audio to predict the heatmap of the camera wearer's auditory attention. Our
approach leverages spatiotemporal audiovisual features and holistic reasoning
about the scene to make predictions, and outperforms a set of baselines on a
challenging multi-speaker conversation dataset. Project page:
https://fkryan.github.io/saa
- …