21,979 research outputs found
Object Referring in Videos with Language and Human Gaze
We investigate the problem of object referring (OR) i.e. to localize a target
object in a visual scene coming with a language description. Humans perceive
the world more as continued video snippets than as static images, and describe
objects not only by their appearance, but also by their spatio-temporal context
and motion features. Humans also gaze at the object when they issue a referring
expression. Existing works for OR mostly focus on static images only, which
fall short in providing many such cues. This paper addresses OR in videos with
language and human gaze. To that end, we present a new video dataset for OR,
with 30, 000 objects over 5, 000 stereo video sequences annotated for their
descriptions and gaze. We further propose a novel network model for OR in
videos, by integrating appearance, motion, gaze, and spatio-temporal context
into one network. Experimental results show that our method effectively
utilizes motion cues, human gaze, and spatio-temporal context. Our method
outperforms previousOR methods. For dataset and code, please refer
https://people.ee.ethz.ch/~arunv/ORGaze.html.Comment: Accepted to CVPR 2018, 10 pages, 6 figure
Detecting Emotional Involvement in Professional News Reporters: An Analysis of Speech and Gestures
This study is aimed to investigate the extent to which reporters\u2019 voice and body behaviour may betray different degrees of emotional involvement when reporting on emergency situations. The hypothesis is that emotional involvement is associated with an increase in body movements and pitch and intensity variation. The object of investigation is a corpus of 21 10-second videos of Italian news reports on flooding taken from Italian nation-wide TV channels. The gestures and body movements of the reporters were first inspected visually. Then, measures of the reporters\u2019 pitch and intensity variations were calculated and related with the reporters' gestures. The effects of the variability in the reporters' voice and gestures were tested with an evaluation test. The results show that the reporters vary greatly in the extent to which they move their hands and body in their reportings. Two gestures seem to characterise reporters\u2019 communication of emergencies: beats and deictics. The reporters\u2019 use of gestures partially parallels the reporters\u2019 variations in pitch and intensity. The evaluation study shows that increased gesturing is associated with greater emotional involvement and less professionalism. The data was used to create an ontology of gestures for the communication of emergenc
Can you see what i am talking about? Human speech triggers referential expectation in four-month-old infants
Infants’ sensitivity to selectively attend to human speech and to process it in a unique way has been widely reported in the past. However, in order to successfully acquire language, one should also understand that speech is a referential, and that words can stand for other entities in the world. While there has been some evidence showing that young infants can make inferences about the communicative intentions of a speaker, whether they would also appreciate the direct relationship between a specific word and its referent, is still unknown. In the present study we tested four-month-old infants to see whether they would expect to find a referent when they hear human speech. Our results showed that compared to other auditory stimuli or to silence, when infants were listening to speech they were more prepared to find some visual referents of the words, as signalled by their faster orienting towards the visual objects. Hence, our study is the first to report evidence that infants at a very young age already understand the referential relationship between auditory words and physical objects, thus show a precursor in appreciating the symbolic nature of language, even if they do not understand yet the meanings of words
Markers of Discourse Structure in Child-Directed Speech
Although the language we encounter is typically embedded in rich discourse contexts, existing models of sentence processing focus largely on phenomena that occur sentence internally. Here we analyze a video corpus of child-caregiver interactions with the aim of characterizing how discourse structure is reflected in child-directed speech and in children’s and caregivers ’ behavior. We use topic continuity as a measure of discourse structure, examining how caregivers introduce and discuss objects across sentences. We develop a variant on a Hidden Markov Model to identify coherent discourses, taking into account speakers ’ intended referent and the time delays between utterances. Using the discourses found by this model, we analyze how the lexical, syntactic, and social properties of caregiver-child interaction change over the course of a sequence of topically-related utterances. Our findings suggest that cues used to signal topicality in adult discourse are also available in child-directed speech and that children’s responses reflect joint attention in communication
Learning to Localize and Align Fine-Grained Actions to Sparse Instructions
Automatic generation of textual video descriptions that are time-aligned with
video content is a long-standing goal in computer vision. The task is
challenging due to the difficulty of bridging the semantic gap between the
visual and natural language domains. This paper addresses the task of
automatically generating an alignment between a set of instructions and a first
person video demonstrating an activity. The sparse descriptions and ambiguity
of written instructions create significant alignment challenges. The key to our
approach is the use of egocentric cues to generate a concise set of action
proposals, which are then matched to recipe steps using object recognition and
computational linguistic techniques. We obtain promising results on both the
Extended GTEA Gaze+ dataset and the Bristol Egocentric Object Interactions
Dataset
Recommended from our members
Examining the role of social cues in early word learning
Infant word learning has become a popular field of study over the past decade. Research during this time has shown that infants can learn, in a short period of time, to attach words to objects. Two experiments on the role of social cues in early word learning are reported using tightly controlled conditions. Fourteen- and 18-month-old infants were trained by viewing a video of an adult pointing and nodding towards one of two different novel objects appearing on a screen simultaneously, while novel labels were emitted through a speaker. Infants’ looking times to each object were recorded both during training and test trials. Our analyses indicated that both 14-and 18-month-olds looked significantly longer at the object that the adult pointed to in the training trials. However, only 18-month-olds showed any evidence of looking longer at the target object during the test in the consistent condition than in the inconsistent (control) condition. These studies are important because they show, in a controlled laboratory study of infant word learning, that different types of social cues are available at different ages. Fourteen-month-olds are aware of adult pointing and head turning and can follow those cues to an object during training. However, it isn’t until 18 months of age that infants seem able to use those cues in the service of actual word learning.Psycholog
- …