2,500 research outputs found
Predicting Motivations of Actions by Leveraging Text
Understanding human actions is a key problem in computer vision. However,
recognizing actions is only the first step of understanding what a person is
doing. In this paper, we introduce the problem of predicting why a person has
performed an action in images. This problem has many applications in human
activity understanding, such as anticipating or explaining an action. To study
this problem, we introduce a new dataset of people performing actions annotated
with likely motivations. However, the information in an image alone may not be
sufficient to automatically solve this task. Since humans can rely on their
lifetime of experiences to infer motivation, we propose to give computer vision
systems access to some of these experiences by using recently developed natural
language models to mine knowledge stored in massive amounts of text. While we
are still far away from fully understanding motivation, our results suggest
that transferring knowledge from language into vision can help machines
understand why people in images might be performing an action.Comment: CVPR 201
A region-based image caption generator with refined descriptions
Describing the content of an image is a challenging task. To enable detailed description, it requires the detection and recognition of objects, people, relationships and associated attributes. Currently, the majority of the existing research relies on holistic techniques, which may lose details relating to important aspects in a scene. In order to deal with such a challenge, we propose a novel region-based deep learning architecture for image description generation. It employs a regional object detector, recurrent neural network (RNN)-based attribute prediction, and an encoder–decoder language generator embedded with two RNNs to produce refined and detailed descriptions of a given image. Most importantly, the proposed system focuses on a local based approach to further improve upon existing holistic methods, which relates specifically to image regions of people and objects in an image. Evaluated with the IAPR TC-12 dataset, the proposed system shows impressive performance and outperforms state-of-the-art methods using various evaluation metrics. In particular, the proposed system shows superiority over existing methods when dealing with cross-domain indoor scene images
Enhancing image captioning with depth information using a Transformer-based framework
Captioning images is a challenging scene-understanding task that connects
computer vision and natural language processing. While image captioning models
have been successful in producing excellent descriptions, the field has
primarily focused on generating a single sentence for 2D images. This paper
investigates whether integrating depth information with RGB images can enhance
the captioning task and generate better descriptions. For this purpose, we
propose a Transformer-based encoder-decoder framework for generating a
multi-sentence description of a 3D scene. The RGB image and its corresponding
depth map are provided as inputs to our framework, which combines them to
produce a better understanding of the input scene. Depth maps could be ground
truth or estimated, which makes our framework widely applicable to any RGB
captioning dataset. We explored different fusion approaches to fuse RGB and
depth images. The experiments are performed on the NYU-v2 dataset and the
Stanford image paragraph captioning dataset. During our work with the NYU-v2
dataset, we found inconsistent labeling that prevents the benefit of using
depth information to enhance the captioning task. The results were even worse
than using RGB images only. As a result, we propose a cleaned version of the
NYU-v2 dataset that is more consistent and informative. Our results on both
datasets demonstrate that the proposed framework effectively benefits from
depth information, whether it is ground truth or estimated, and generates
better captions. Code, pre-trained models, and the cleaned version of the
NYU-v2 dataset will be made publically available.Comment: 19 pages, 5 figures, 13 table
- …