120,390 research outputs found
Speech segmentation is facilitated by visual cues
Evidence from infant studies indicates that language learning can be facilitated by multimodal cues. We extended this observation to adult language learning by studying the effects of simultaneous visual cues (nonassociated object images) on speech segmentation performance. Our results indicate that segmentation of new words from a continuous speech stream is facilitated by simultaneous visual input that it is presented at or near syllables that exhibit the low transitional probability indicative of word boundaries. This indicates that temporal audio-visual contiguity helps in directing attention to word boundaries at the earliest stages of language learning. Off-boundary or arrhythmic picture sequences did not affect segmentation performance, suggesting that the language learning system can effectively disregard noninformative visual information. Detection of temporal contiguity between multimodal stimuli may be useful in both infants and second-language learners not only for facilitating speech segmentation, but also for detecting word-object relationships in natural environments
Movie Description
Audio Description (AD) provides linguistic descriptions of movies and allows
visually impaired people to follow a movie along with their peers. Such
descriptions are by design mainly visual and thus naturally form an interesting
data source for computer vision and computational linguistics. In this work we
propose a novel dataset which contains transcribed ADs, which are temporally
aligned to full length movies. In addition we also collected and aligned movie
scripts used in prior work and compare the two sources of descriptions. In
total the Large Scale Movie Description Challenge (LSMDC) contains a parallel
corpus of 118,114 sentences and video clips from 202 movies. First we
characterize the dataset by benchmarking different approaches for generating
video descriptions. Comparing ADs to scripts, we find that ADs are indeed more
visual and describe precisely what is shown rather than what should happen
according to the scripts created prior to movie production. Furthermore, we
present and compare the results of several teams who participated in a
challenge organized in the context of the workshop "Describing and
Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at
ICCV 2015
Lip Reading Sentences in the Wild
The goal of this work is to recognise phrases and sentences being spoken by a
talking face, with or without the audio. Unlike previous works that have
focussed on recognising a limited number of words or phrases, we tackle lip
reading as an open-world problem - unconstrained natural language sentences,
and in the wild videos.
Our key contributions are: (1) a 'Watch, Listen, Attend and Spell' (WLAS)
network that learns to transcribe videos of mouth motion to characters; (2) a
curriculum learning strategy to accelerate training and to reduce overfitting;
(3) a 'Lip Reading Sentences' (LRS) dataset for visual speech recognition,
consisting of over 100,000 natural sentences from British television.
The WLAS model trained on the LRS dataset surpasses the performance of all
previous work on standard lip reading benchmark datasets, often by a
significant margin. This lip reading performance beats a professional lip
reader on videos from BBC television, and we also demonstrate that visual
information helps to improve speech recognition performance even when the audio
is available
- …