108,237 research outputs found
Visual Semantic Parsing: From Images to Abstract Meaning Representation
The success of scene graphs for visual scene understanding has brought
attention to the benefits of abstracting a visual input (e.g., image) into a
structured representation, where entities (people and objects) are nodes
connected by edges specifying their relations. Building these representations,
however, requires expensive manual annotation in the form of images paired with
their scene graphs or frames. These formalisms remain limited in the nature of
entities and relations they can capture. In this paper, we propose to leverage
a widely-used meaning representation in the field of natural language
processing, the Abstract Meaning Representation (AMR), to address these
shortcomings. Compared to scene graphs, which largely emphasize spatial
relationships, our visual AMR graphs are more linguistically informed, with a
focus on higher-level semantic concepts extrapolated from visual input.
Moreover, they allow us to generate meta-AMR graphs to unify information
contained in multiple image descriptions under one representation. Through
extensive experimentation and analysis, we demonstrate that we can re-purpose
an existing text-to-AMR parser to parse images into AMRs. Our findings point to
important future research directions for improved scene understanding.Comment: published in CoNLL 202
A semantic and language-based representation of an environmental scene
The modeling of a landscape environment is a cognitive activity that requires appropriate spatial representations. The research presented in this paper introduces a structural and semantic categorization of a landscape view based on panoramic photographs that act as a substitute of a given natural environment. Verbal descriptions of a landscape scene provide themodeling input of our approach. This structure-based model identifies the spatial, relational, and semantic constructs that emerge from these descriptions. Concepts in the environment are qualified according to a semantic classification, their proximity and direction to the observer, and the spatial relations that qualify them. The resulting model is represented in a way that constitutes a modeling support for the study of environmental scenes, and a contribution for further research oriented to the mapping of a verbal description onto a geographical information system-based representation
DeepStory: Video Story QA by Deep Embedded Memory Networks
Question-answering (QA) on video contents is a significant challenge for
achieving human-level intelligence as it involves both vision and language in
real-world settings. Here we demonstrate the possibility of an AI agent
performing video story QA by learning from a large amount of cartoon videos. We
develop a video-story learning model, i.e. Deep Embedded Memory Networks
(DEMN), to reconstruct stories from a joint scene-dialogue video stream using a
latent embedding space of observed data. The video stories are stored in a
long-term memory component. For a given question, an LSTM-based attention model
uses the long-term memory to recall the best question-story-answer triplet by
focusing on specific words containing key information. We trained the DEMN on a
novel QA dataset of children's cartoon video series, Pororo. The dataset
contains 16,066 scene-dialogue pairs of 20.5-hour videos, 27,328 fine-grained
sentences for scene description, and 8,913 story-related QA pairs. Our
experimental results show that the DEMN outperforms other QA models. This is
mainly due to 1) the reconstruction of video stories in a scene-dialogue
combined form that utilize the latent embedding and 2) attention. DEMN also
achieved state-of-the-art results on the MovieQA benchmark.Comment: 7 pages, accepted for IJCAI 201
- âŚ