118 research outputs found
Object Referring in Visual Scene with Spoken Language
Object referring has important applications, especially for human-machine
interaction. While having received great attention, the task is mainly attacked
with written language (text) as input rather than spoken language (speech),
which is more natural. This paper investigates Object Referring with Spoken
Language (ORSpoken) by presenting two datasets and one novel approach. Objects
are annotated with their locations in images, text descriptions and speech
descriptions. This makes the datasets ideal for multi-modality learning. The
approach is developed by carefully taking down ORSpoken problem into three
sub-problems and introducing task-specific vision-language interactions at the
corresponding levels. Experiments show that our method outperforms competing
methods consistently and significantly. The approach is also evaluated in the
presence of audio noise, showing the efficacy of the proposed vision-language
interaction methods in counteracting background noise.Comment: 10 pages, Submitted to WACV 201
Learning to Ground Instructional Articles in Videos through Narrations
In this paper we present an approach for localizing steps of procedural
activities in narrated how-to videos. To deal with the scarcity of labeled data
at scale, we source the step descriptions from a language knowledge base
(wikiHow) containing instructional articles for a large variety of procedural
tasks. Without any form of manual supervision, our model learns to temporally
ground the steps of procedural articles in how-to videos by matching three
modalities: frames, narrations, and step descriptions. Specifically, our method
aligns steps to video by fusing information from two distinct pathways: i) {\em
direct} alignment of step descriptions to frames, ii) {\em indirect} alignment
obtained by composing steps-to-narrations with narrations-to-video
correspondences. Notably, our approach performs global temporal grounding of
all steps in an article at once by exploiting order information, and is trained
with step pseudo-labels which are iteratively refined and aggressively
filtered. In order to validate our model we introduce a new evaluation
benchmark -- HT-Step -- obtained by manually annotating a 124-hour subset of
HowTo100M\footnote{A test server is accessible at
\url{https://eval.ai/web/challenges/challenge-page/2082}.} with steps sourced
from wikiHow articles. Experiments on this benchmark as well as zero-shot
evaluations on CrossTask demonstrate that our multi-modality alignment yields
dramatic gains over several baselines and prior works. Finally, we show that
our inner module for matching narration-to-video outperforms by a large margin
the state of the art on the HTM-Align narration-video alignment benchmark.Comment: 17 pages, 4 figures and 10 table
Placing Objects in Gesture Space: Toward Real-Time Understanding of Spatial Descriptions
Han T, Kennington C, Schlangen D. Placing Objects in Gesture Space: Toward Real-Time Understanding of Spatial Descriptions. In: Proceedings of the thirty-second AAAI conference on artificial intelligence (AAAI18). New Orleans: The association for the advancement of artificial intelligence; 2018
Camera-based estimation of student's attention in class
Two essential elements of classroom lecturing are the teacher and the students. This human core can easily be lost in the overwhelming list of technological supplements aimed at improving the teaching/learning experience. We start from the question of whether we can formulate a technological intervention around the human connection, and find indicators which would tell us when the teacher is not reaching the audience. Our approach is based on principles of unobtrusive measurements and social signal processing. Our assumption is that students with different levels of attention will display different non-verbal behaviour during the lecture. Inspired by information theory, we formulated a theoretical background for our assumptions around the idea of synchronization between the sender and receiver, and between several receivers focused on the same sender. Based on this foundation we present a novel set of behaviour metrics as the main contribution. By using a camera-based system to observe lectures, we recorded an extensive dataset in order to verify our assumptions. In our first study on motion, we found that differences in attention are manifested on the level of audience movement synchronization. We formulated the measure of ``motion lag'' based on the idea that attentive students would have a common behaviour pattern. For our second set of metrics we explored ways to substitute intrusive eye-tracking equipment in order to record gaze information of the entire audience. To achieve this we conducted an experiment on the relationship between head orientation and gaze direction. Based on acquired results we formulated an improved model of gaze uncertainty than the ones currently used in similar studies. In combination with improvements on head detection and pose estimation, we extracted measures of audience head and gaze behaviour from our remote recording system. From the collected data we found that synchronization between student's head orientation and teacher's motion serves as a reliable indicator of the attentiveness of students. To illustrate the predictive power of our features, a supervised-learning model was trained achieving satisfactory results at predicting student's attention
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Large foundation models can exhibit unique capabilities depending on the
domain of data they are trained on. While these domains are generic, they may
only barely overlap. For example, visual-language models (VLMs) are trained on
Internet-scale image captions, but large language models (LMs) are further
trained on Internet-scale text with no images (e.g. from spreadsheets, to SAT
questions). As a result, these models store different forms of commonsense
knowledge across different domains. In this work, we show that this model
diversity is symbiotic, and can be leveraged to build AI systems with
structured Socratic dialogue -- in which new multimodal tasks are formulated as
a guided language-based exchange between different pre-existing foundation
models, without additional finetuning. In the context of egocentric perception,
we present a case study of Socratic Models (SMs) that can provide meaningful
results for complex tasks such as generating free-form answers to contextual
questions about egocentric video, by formulating video Q&A as short story Q&A,
i.e. summarizing the video into a short story, then answering questions about
it. Additionally, SMs can generate captions for Internet images, and are
competitive with state-of-the-art on zero-shot video-to-text retrieval with
42.8 R@1 on MSR-VTT 1k-A. SMs demonstrate how to compose foundation models
zero-shot to capture new multimodal functionalities, without domain-specific
data collection. Prototypes are available at socraticmodels.github.io.Comment: https://socraticmodels.github.io
- …