Search CORE

14 research outputs found

Eyes and ears together: new task for multimodal spoken content analysis

Author: Jones Gareth J.F.
Metze Florian
Moriya Yasufumi
Sanabria Ramon
Publication venue: CEUR-WS
Publication date: 01/10/2018
Field of study

Human speech processing is often a multimodal process combining audio and visual processing. Eyes and Ears Together proposes two benchmark multimodal speech processing tasks: (1) multimodal automatic speech recognition (ASR) and (2) multimodal co-reference resolution on the spoken multimedia. These tasks are motivated by our desire to address the difficulties of ASR for multimedia spoken content. We review prior work on the integration of multimodal signals into speech processing for multimedia data, introduce a multimedia dataset for our proposed tasks, and outline these tasks

Irish Universities

DCU Online Research Access Service

Learning to Localize and Align Fine-Grained Actions to Sparse Instructions

Author: Alayrac Jean-Baptiste
Hahn Meera
Laptev Ivan
Rehg James M.
Ruiz Nataniel
Publication venue
Publication date: 22/09/2018
Field of study

Automatic generation of textual video descriptions that are time-aligned with video content is a long-standing goal in computer vision. The task is challenging due to the difficulty of bridging the semantic gap between the visual and natural language domains. This paper addresses the task of automatically generating an alignment between a set of instructions and a first person video demonstrating an activity. The sparse descriptions and ambiguity of written instructions create significant alignment challenges. The key to our approach is the use of egocentric cues to generate a concise set of action proposals, which are then matched to recipe steps using object recognition and computational linguistic techniques. We obtain promising results on both the Extended GTEA Gaze+ dataset and the Bristol Egocentric Object Interactions Dataset

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Video Question Answering on Screencast Tutorials

Author: Jin Hailin
Kim Seokhwan
Xu Ning
Zhao Wentian
Publication venue: 'International Joint Conferences on Artificial Intelligence'
Publication date: 02/08/2020
Field of study

This paper presents a new video question answering task on screencast tutorials. We introduce a dataset including question, answer and context triples from the tutorial videos for a software. Unlike other video question answering works, all the answers in our dataset are grounded to the domain knowledge base. An one-shot recognition algorithm is designed to extract the visual cues, which helps enhance the performance of video question answering. We also propose several baseline neural network architectures based on various aspects of video contexts from the dataset. The experimental results demonstrate that our proposed models significantly improve the question answering performances by incorporating multi-modal contexts and domain knowledge

arXiv.org e-Print Archive

Crossref

Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation

Author: Deng Zhiwei
Narasimhan Karthik
Russakovsky Olga
Publication venue
Publication date: 01/01/2020
Field of study

The ability to perform effective planning is crucial for building an instruction-following agent. When navigating through a new environment, an agent is challenged with (1) connecting the natural language instructions with its progressively growing knowledge of the world; and (2) performing long-range planning and decision making in the form of effective exploration and error correction. Current methods are still limited on both fronts despite extensive efforts. In this paper, we introduce the Evolving Graphical Planner (EGP), a model that performs global planning for navigation based on raw sensory input. The model dynamically constructs a graphical representation, generalizes the action space to allow for more flexible decision making, and performs efficient planning on a proxy graph representation. We evaluate our model on a challenging Vision-and-Language Navigation (VLN) task with photorealistic images and achieve superior performance compared to previous navigation architectures. For instance, we achieve a 53% success rate on the test split of the Room-to-Room navigation task through pure imitation learning, outperforming previous navigation architectures by up to 5%

arXiv.org e-Print Archive

Princeton University Open Access Repository