Search CORE

118 research outputs found

Object Referring in Visual Scene with Spoken Language

Author: Dai Dengxin
Van Gool Luc
Vasudevan Arun Balajee
Publication venue
Publication date: 05/12/2017
Field of study

Object referring has important applications, especially for human-machine interaction. While having received great attention, the task is mainly attacked with written language (text) as input rather than spoken language (speech), which is more natural. This paper investigates Object Referring with Spoken Language (ORSpoken) by presenting two datasets and one novel approach. Objects are annotated with their locations in images, text descriptions and speech descriptions. This makes the datasets ideal for multi-modality learning. The approach is developed by carefully taking down ORSpoken problem into three sub-problems and introducing task-specific vision-language interactions at the corresponding levels. Experiments show that our method outperforms competing methods consistently and significantly. The approach is also evaluated in the presence of audio noise, showing the efficacy of the proposed vision-language interaction methods in counteracting background noise.Comment: 10 pages, Submitted to WACV 201

arXiv.org e-Print Archive

Crossref

Integrating knowledge of hands and objects into egocentric action recognition

Author: Ma Jian
Publication venue
Publication date: 05/12/2023
Field of study

Explore Bristol Research

Learning to Ground Instructional Articles in Videos through Narrations

Author: Afouras Triantafyllos
Mavroudi Effrosyni
Torresani Lorenzo
Publication venue
Publication date: 06/06/2023
Field of study

In this paper we present an approach for localizing steps of procedural activities in narrated how-to videos. To deal with the scarcity of labeled data at scale, we source the step descriptions from a language knowledge base (wikiHow) containing instructional articles for a large variety of procedural tasks. Without any form of manual supervision, our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities: frames, narrations, and step descriptions. Specifically, our method aligns steps to video by fusing information from two distinct pathways: i) {\em direct} alignment of step descriptions to frames, ii) {\em indirect} alignment obtained by composing steps-to-narrations with narrations-to-video correspondences. Notably, our approach performs global temporal grounding of all steps in an article at once by exploiting order information, and is trained with step pseudo-labels which are iteratively refined and aggressively filtered. In order to validate our model we introduce a new evaluation benchmark -- HT-Step -- obtained by manually annotating a 124-hour subset of HowTo100M\footnote{A test server is accessible at \url{https://eval.ai/web/challenges/challenge-page/2082}.} with steps sourced from wikiHow articles. Experiments on this benchmark as well as zero-shot evaluations on CrossTask demonstrate that our multi-modality alignment yields dramatic gains over several baselines and prior works. Finally, we show that our inner module for matching narration-to-video outperforms by a large margin the state of the art on the HTM-Align narration-video alignment benchmark.Comment: 17 pages, 4 figures and 10 table

arXiv.org e-Print Archive

ML Datasets as Synthetic Cognitive Experience Records

Author: H. Castro
M. T. Andrade
Publication venue
Publication date: 01/01/2018
Field of study

Repositório Aberto da Universidade do Porto

Lying to yourself and lying to others:Social desirability and language features

Author: Austin Elizabeth
Gill Alastiar J
Hancock Jeffrey T
Oberlander Jon
Publication venue
Publication date: 01/01/2006
Field of study

Edinburgh Research Explorer

Ego4D:Around the World in 3,000 Hours of Egocentric Video

Author: Arbelaez Pablo
Crandall David
Damen Dima
Farinella Giovanni Maria
Fragomeni Adriano
Ghanem Bernard
Grauman Kristen
Jawahar C.V.
Kitani Kris
Malik Jitendra
Munro Jonathan P N
Oliva Aude
Park Hyun Soo
Price Will
Rehg James M.
Sato Yoichi
Shou Mike Zheng
Torrallba Antonio
Wray Michael
Publication venue: Institute of Electrical and Electronics Engineers (IEEE)
Publication date: 19/06/2022
Field of study

Explore Bristol Research

Placing Objects in Gesture Space: Toward Real-Time Understanding of Spatial Descriptions

Author: Han Ting
Kennington Casey
Schlangen David
Publication venue: The association for the advancement of artificial intelligence
Publication date: 01/01/2018
Field of study

Han T, Kennington C, Schlangen D. Placing Objects in Gesture Space: Toward Real-Time Understanding of Spatial Descriptions. In: Proceedings of the thirty-second AAAI conference on artificial intelligence (AAAI18). New Orleans: The association for the advancement of artificial intelligence; 2018

Publications at Bielefeld University

Camera-based estimation of student's attention in class

Author: Raca Mirko
Publication venue: Lausanne, EPFL
Publication date: 13/10/2015
Field of study

Two essential elements of classroom lecturing are the teacher and the students. This human core can easily be lost in the overwhelming list of technological supplements aimed at improving the teaching/learning experience. We start from the question of whether we can formulate a technological intervention around the human connection, and find indicators which would tell us when the teacher is not reaching the audience. Our approach is based on principles of unobtrusive measurements and social signal processing. Our assumption is that students with different levels of attention will display different non-verbal behaviour during the lecture. Inspired by information theory, we formulated a theoretical background for our assumptions around the idea of synchronization between the sender and receiver, and between several receivers focused on the same sender. Based on this foundation we present a novel set of behaviour metrics as the main contribution. By using a camera-based system to observe lectures, we recorded an extensive dataset in order to verify our assumptions. In our first study on motion, we found that differences in attention are manifested on the level of audience movement synchronization. We formulated the measure of ``motion lag'' based on the idea that attentive students would have a common behaviour pattern. For our second set of metrics we explored ways to substitute intrusive eye-tracking equipment in order to record gaze information of the entire audience. To achieve this we conducted an experiment on the relationship between head orientation and gaze direction. Based on acquired results we formulated an improved model of gaze uncertainty than the ones currently used in similar studies. In combination with improvements on head detection and pose estimation, we extracted measures of audience head and gaze behaviour from our remote recording system. From the collected data we found that synchronization between student's head orientation and teacher's motion serves as a reliable indicator of the attentiveness of students. To illustrate the predictive power of our features, a supervised-learning model was trained achieving satisfactory results at predicting student's attention

Infoscience - École polytechnique fédérale de Lausanne

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Author: Choromanski Krzysztof
Florence Pete
Lee Johnny
Purohit Aveek
Ryoo Michael
Sindhwani Vikas
Tombari Federico
Vanhoucke Vincent
Welker Stefan
Wong Adrian
Zeng Andy
Publication venue
Publication date: 01/04/2022
Field of study

Large foundation models can exhibit unique capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g. from spreadsheets, to SAT questions). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this model diversity is symbiotic, and can be leveraged to build AI systems with structured Socratic dialogue -- in which new multimodal tasks are formulated as a guided language-based exchange between different pre-existing foundation models, without additional finetuning. In the context of egocentric perception, we present a case study of Socratic Models (SMs) that can provide meaningful results for complex tasks such as generating free-form answers to contextual questions about egocentric video, by formulating video Q&A as short story Q&A, i.e. summarizing the video into a short story, then answering questions about it. Additionally, SMs can generate captions for Internet images, and are competitive with state-of-the-art on zero-shot video-to-text retrieval with 42.8 R@1 on MSR-VTT 1k-A. SMs demonstrate how to compose foundation models zero-shot to capture new multimodal functionalities, without domain-specific data collection. Prototypes are available at socraticmodels.github.io.Comment: https://socraticmodels.github.io

arXiv.org e-Print Archive