13,282 research outputs found
Action Classification and Highlighting in Videos
Inspired by recent advances in neural machine translation, that jointly align
and translate using encoder-decoder networks equipped with attention, we
propose an attentionbased LSTM model for human activity recognition. Our model
jointly learns to classify actions and highlight frames associated with the
action, by attending to salient visual information through a jointly learned
soft-attention networks. We explore attention informed by various forms of
visual semantic features, including those encoding actions, objects and scenes.
We qualitatively show that soft-attention can learn to effectively attend to
important objects and scene information correlated with specific human actions.
Further, we show that, quantitatively, our attention-based LSTM outperforms the
vanilla LSTM and CNN models used by stateof-the-art methods. On a large-scale
youtube video dataset, ActivityNet, our model outperforms competing methods in
action classification
Cross-Modal Attentional Context Learning for RGB-D Object Detection
Recognizing objects from simultaneously sensed photometric (RGB) and depth
channels is a fundamental yet practical problem in many machine vision
applications such as robot grasping and autonomous driving. In this paper, we
address this problem by developing a Cross-Modal Attentional Context (CMAC)
learning framework, which enables the full exploitation of the context
information from both RGB and depth data. Compared to existing RGB-D object
detection frameworks, our approach has several appealing properties. First, it
consists of an attention-based global context model for exploiting adaptive
contextual information and incorporating this information into a region-based
CNN (e.g., Fast RCNN) framework to achieve improved object detection
performance. Second, our CMAC framework further contains a fine-grained object
part attention module to harness multiple discriminative object parts inside
each possible object region for superior local feature representation. While
greatly improving the accuracy of RGB-D object detection, the effective
cross-modal information fusion as well as attentional context modeling in our
proposed model provide an interpretable visualization scheme. Experimental
results demonstrate that the proposed method significantly improves upon the
state of the art on all public benchmarks.Comment: Accept as a regular paper to IEEE Transactions on Image Processin
A Review on Deep Learning Techniques Applied to Semantic Segmentation
Image semantic segmentation is more and more being of interest for computer
vision and machine learning researchers. Many applications on the rise need
accurate and efficient segmentation mechanisms: autonomous driving, indoor
navigation, and even virtual or augmented reality systems to name a few. This
demand coincides with the rise of deep learning approaches in almost every
field or application target related to computer vision, including semantic
segmentation or scene understanding. This paper provides a review on deep
learning methods for semantic segmentation applied to various application
areas. Firstly, we describe the terminology of this field as well as mandatory
background concepts. Next, the main datasets and challenges are exposed to help
researchers decide which are the ones that best suit their needs and their
targets. Then, existing methods are reviewed, highlighting their contributions
and their significance in the field. Finally, quantitative results are given
for the described methods and the datasets in which they were evaluated,
following up with a discussion of the results. At last, we point out a set of
promising future works and draw our own conclusions about the state of the art
of semantic segmentation using deep learning techniques.Comment: Submitted to TPAMI on Apr. 22, 201
Mining Mid-level Visual Patterns with Deep CNN Activations
The purpose of mid-level visual element discovery is to find clusters of
image patches that are both representative and discriminative. Here we study
this problem from the prospective of pattern mining while relying on the
recently popularized Convolutional Neural Networks (CNNs). We observe that a
fully-connected CNN activation extracted from an image patch typically
possesses two appealing properties that enable its seamless integration with
pattern mining techniques. The marriage between CNN activations and association
rule mining, a well-known pattern mining technique in the literature, leads to
fast and effective discovery of representative and discriminative patterns from
a huge number of image patches. When we retrieve and visualize image patches
with the same pattern, surprisingly, they are not only visually similar but
also semantically consistent, and thus give rise to a mid-level visual element
in our work. Given the patterns and retrieved mid-level visual elements, we
propose two methods to generate image feature representations for each. The
first method is to use the patterns as codewords in a dictionary, similar to
the Bag-of-Visual-Words model, we compute a Bag-of-Patterns representation. The
second one relies on the retrieved mid-level visual elements to construct a
Bag-of-Elements representation. We evaluate the two encoding methods on scene
and object classification tasks, and demonstrate that our approach outperforms
or matches recent works using CNN activations for these tasks.Comment: 20 page
Multi-scale Volumes for Deep Object Detection and Localization
This study aims to analyze the benefits of improved multi-scale reasoning for
object detection and localization with deep convolutional neural networks. To
that end, an efficient and general object detection framework which operates on
scale volumes of a deep feature pyramid is proposed. In contrast to the
proposed approach, most current state-of-the-art object detectors operate on a
single-scale in training, while testing involves independent evaluation across
scales. One benefit of the proposed approach is in better capturing of
multi-scale contextual information, resulting in significant gains in both
detection performance and localization quality of objects on the PASCAL VOC
dataset and a multi-view highway vehicles dataset. The joint detection and
localization scale-specific models are shown to especially benefit detection of
challenging object categories which exhibit large scale variation as well as
detection of small objects.Comment: To appear in Pattern Recognition 201
Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships
Context is important for accurate visual recognition. In this work we propose
an object detection algorithm that not only considers object visual appearance,
but also makes use of two kinds of context including scene contextual
information and object relationships within a single image. Therefore, object
detection is regarded as both a cognition problem and a reasoning problem when
leveraging these structured information. Specifically, this paper formulates
object detection as a problem of graph structure inference, where given an
image the objects are treated as nodes in a graph and relationships between the
objects are modeled as edges in such graph. To this end, we present a so-called
Structure Inference Network (SIN), a detector that incorporates into a typical
detection framework (e.g. Faster R-CNN) with a graphical model which aims to
infer object state. Comprehensive experiments on PASCAL VOC and MS COCO
datasets indicate that scene context and object relationships truly improve the
performance of object detection with more desirable and reasonable outputs.Comment: published in CVPR 201
Context-based Object Viewpoint Estimation: A 2D Relational Approach
The task of object viewpoint estimation has been a challenge since the early
days of computer vision. To estimate the viewpoint (or pose) of an object,
people have mostly looked at object intrinsic features, such as shape or
appearance. Surprisingly, informative features provided by other, extrinsic
elements in the scene, have so far mostly been ignored. At the same time,
contextual cues have been proven to be of great benefit for related tasks such
as object detection or action recognition. In this paper, we explore how
information from other objects in the scene can be exploited for viewpoint
estimation. In particular, we look at object configurations by following a
relational neighbor-based approach for reasoning about object relations. We
show that, starting from noisy object detections and viewpoint estimates,
exploiting the estimated viewpoint and location of other objects in the scene
can lead to improved object viewpoint predictions. Experiments on the KITTI
dataset demonstrate that object configurations can indeed be used as a
complementary cue to appearance-based viewpoint estimation. Our analysis
reveals that the proposed context-based method can improve object viewpoint
estimation by reducing specific types of viewpoint estimation errors commonly
made by methods that only consider local information. Moreover, considering
contextual information produces superior performance in scenes where a high
number of object instances occur. Finally, our results suggest that, following
a cautious relational neighbor formulation brings improvements over its
aggressive counterpart for the task of object viewpoint estimation.Comment: Computer Vision and Image Understanding (CVIU
Learning Actor Relation Graphs for Group Activity Recognition
Modeling relation between actors is important for recognizing group activity
in a multi-person scene. This paper aims at learning discriminative relation
between actors efficiently using deep models. To this end, we propose to build
a flexible and efficient Actor Relation Graph (ARG) to simultaneously capture
the appearance and position relation between actors. Thanks to the Graph
Convolutional Network, the connections in ARG could be automatically learned
from group activity videos in an end-to-end manner, and the inference on ARG
could be efficiently performed with standard matrix operations. Furthermore, in
practice, we come up with two variants to sparsify ARG for more effective
modeling in videos: spatially localized ARG and temporal randomized ARG. We
perform extensive experiments on two standard group activity recognition
datasets: the Volleyball dataset and the Collective Activity dataset, where
state-of-the-art performance is achieved on both datasets. We also visualize
the learned actor graphs and relation features, which demonstrate that the
proposed ARG is able to capture the discriminative relation information for
group activity recognition.Comment: Accepted by CVPR 201
Image-Level Attentional Context Modeling Using Nested-Graph Neural Networks
We introduce a new scene graph generation method called image-level
attentional context modeling (ILAC). Our model includes an attentional graph
network that effectively propagates contextual information across the graph
using image-level features. Whereas previous works use an object-centric
context, we build an image-level context agent to encode the scene properties.
The proposed method comprises a single-stream network that iteratively refines
the scene graph with a nested graph neural network. We demonstrate that our
approach achieves competitive performance with the state-of-the-art for scene
graph generation on the Visual Genome dataset, while requiring fewer parameters
than other methods. We also show that ILAC can improve regular object detectors
by incorporating relational image-level information.Comment: NIPS 2018, Relational Representation Learning Worksho
STEP: Spatio-Temporal Progressive Learning for Video Action Detection
In this paper, we propose Spatio-TEmporal Progressive (STEP) action
detector---a progressive learning framework for spatio-temporal action
detection in videos. Starting from a handful of coarse-scale proposal cuboids,
our approach progressively refines the proposals towards actions over a few
steps. In this way, high-quality proposals (i.e., adhere to action movements)
can be gradually obtained at later steps by leveraging the regression outputs
from previous steps. At each step, we adaptively extend the proposals in time
to incorporate more related temporal context. Compared to the prior work that
performs action detection in one run, our progressive learning framework is
able to naturally handle the spatial displacement within action tubes and
therefore provides a more effective way for spatio-temporal modeling. We
extensively evaluate our approach on UCF101 and AVA, and demonstrate superior
detection results. Remarkably, we achieve mAP of 75.0% and 18.6% on the two
datasets with 3 progressive steps and using respectively only 11 and 34 initial
proposals.Comment: CVPR 2019 (Oral
- …