13,282 research outputs found

    Action Classification and Highlighting in Videos

    Full text link
    Inspired by recent advances in neural machine translation, that jointly align and translate using encoder-decoder networks equipped with attention, we propose an attentionbased LSTM model for human activity recognition. Our model jointly learns to classify actions and highlight frames associated with the action, by attending to salient visual information through a jointly learned soft-attention networks. We explore attention informed by various forms of visual semantic features, including those encoding actions, objects and scenes. We qualitatively show that soft-attention can learn to effectively attend to important objects and scene information correlated with specific human actions. Further, we show that, quantitatively, our attention-based LSTM outperforms the vanilla LSTM and CNN models used by stateof-the-art methods. On a large-scale youtube video dataset, ActivityNet, our model outperforms competing methods in action classification

    Cross-Modal Attentional Context Learning for RGB-D Object Detection

    Full text link
    Recognizing objects from simultaneously sensed photometric (RGB) and depth channels is a fundamental yet practical problem in many machine vision applications such as robot grasping and autonomous driving. In this paper, we address this problem by developing a Cross-Modal Attentional Context (CMAC) learning framework, which enables the full exploitation of the context information from both RGB and depth data. Compared to existing RGB-D object detection frameworks, our approach has several appealing properties. First, it consists of an attention-based global context model for exploiting adaptive contextual information and incorporating this information into a region-based CNN (e.g., Fast RCNN) framework to achieve improved object detection performance. Second, our CMAC framework further contains a fine-grained object part attention module to harness multiple discriminative object parts inside each possible object region for superior local feature representation. While greatly improving the accuracy of RGB-D object detection, the effective cross-modal information fusion as well as attentional context modeling in our proposed model provide an interpretable visualization scheme. Experimental results demonstrate that the proposed method significantly improves upon the state of the art on all public benchmarks.Comment: Accept as a regular paper to IEEE Transactions on Image Processin

    A Review on Deep Learning Techniques Applied to Semantic Segmentation

    Full text link
    Image semantic segmentation is more and more being of interest for computer vision and machine learning researchers. Many applications on the rise need accurate and efficient segmentation mechanisms: autonomous driving, indoor navigation, and even virtual or augmented reality systems to name a few. This demand coincides with the rise of deep learning approaches in almost every field or application target related to computer vision, including semantic segmentation or scene understanding. This paper provides a review on deep learning methods for semantic segmentation applied to various application areas. Firstly, we describe the terminology of this field as well as mandatory background concepts. Next, the main datasets and challenges are exposed to help researchers decide which are the ones that best suit their needs and their targets. Then, existing methods are reviewed, highlighting their contributions and their significance in the field. Finally, quantitative results are given for the described methods and the datasets in which they were evaluated, following up with a discussion of the results. At last, we point out a set of promising future works and draw our own conclusions about the state of the art of semantic segmentation using deep learning techniques.Comment: Submitted to TPAMI on Apr. 22, 201

    Mining Mid-level Visual Patterns with Deep CNN Activations

    Full text link
    The purpose of mid-level visual element discovery is to find clusters of image patches that are both representative and discriminative. Here we study this problem from the prospective of pattern mining while relying on the recently popularized Convolutional Neural Networks (CNNs). We observe that a fully-connected CNN activation extracted from an image patch typically possesses two appealing properties that enable its seamless integration with pattern mining techniques. The marriage between CNN activations and association rule mining, a well-known pattern mining technique in the literature, leads to fast and effective discovery of representative and discriminative patterns from a huge number of image patches. When we retrieve and visualize image patches with the same pattern, surprisingly, they are not only visually similar but also semantically consistent, and thus give rise to a mid-level visual element in our work. Given the patterns and retrieved mid-level visual elements, we propose two methods to generate image feature representations for each. The first method is to use the patterns as codewords in a dictionary, similar to the Bag-of-Visual-Words model, we compute a Bag-of-Patterns representation. The second one relies on the retrieved mid-level visual elements to construct a Bag-of-Elements representation. We evaluate the two encoding methods on scene and object classification tasks, and demonstrate that our approach outperforms or matches recent works using CNN activations for these tasks.Comment: 20 page

    Multi-scale Volumes for Deep Object Detection and Localization

    Full text link
    This study aims to analyze the benefits of improved multi-scale reasoning for object detection and localization with deep convolutional neural networks. To that end, an efficient and general object detection framework which operates on scale volumes of a deep feature pyramid is proposed. In contrast to the proposed approach, most current state-of-the-art object detectors operate on a single-scale in training, while testing involves independent evaluation across scales. One benefit of the proposed approach is in better capturing of multi-scale contextual information, resulting in significant gains in both detection performance and localization quality of objects on the PASCAL VOC dataset and a multi-view highway vehicles dataset. The joint detection and localization scale-specific models are shown to especially benefit detection of challenging object categories which exhibit large scale variation as well as detection of small objects.Comment: To appear in Pattern Recognition 201

    Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships

    Full text link
    Context is important for accurate visual recognition. In this work we propose an object detection algorithm that not only considers object visual appearance, but also makes use of two kinds of context including scene contextual information and object relationships within a single image. Therefore, object detection is regarded as both a cognition problem and a reasoning problem when leveraging these structured information. Specifically, this paper formulates object detection as a problem of graph structure inference, where given an image the objects are treated as nodes in a graph and relationships between the objects are modeled as edges in such graph. To this end, we present a so-called Structure Inference Network (SIN), a detector that incorporates into a typical detection framework (e.g. Faster R-CNN) with a graphical model which aims to infer object state. Comprehensive experiments on PASCAL VOC and MS COCO datasets indicate that scene context and object relationships truly improve the performance of object detection with more desirable and reasonable outputs.Comment: published in CVPR 201

    Context-based Object Viewpoint Estimation: A 2D Relational Approach

    Full text link
    The task of object viewpoint estimation has been a challenge since the early days of computer vision. To estimate the viewpoint (or pose) of an object, people have mostly looked at object intrinsic features, such as shape or appearance. Surprisingly, informative features provided by other, extrinsic elements in the scene, have so far mostly been ignored. At the same time, contextual cues have been proven to be of great benefit for related tasks such as object detection or action recognition. In this paper, we explore how information from other objects in the scene can be exploited for viewpoint estimation. In particular, we look at object configurations by following a relational neighbor-based approach for reasoning about object relations. We show that, starting from noisy object detections and viewpoint estimates, exploiting the estimated viewpoint and location of other objects in the scene can lead to improved object viewpoint predictions. Experiments on the KITTI dataset demonstrate that object configurations can indeed be used as a complementary cue to appearance-based viewpoint estimation. Our analysis reveals that the proposed context-based method can improve object viewpoint estimation by reducing specific types of viewpoint estimation errors commonly made by methods that only consider local information. Moreover, considering contextual information produces superior performance in scenes where a high number of object instances occur. Finally, our results suggest that, following a cautious relational neighbor formulation brings improvements over its aggressive counterpart for the task of object viewpoint estimation.Comment: Computer Vision and Image Understanding (CVIU

    Learning Actor Relation Graphs for Group Activity Recognition

    Full text link
    Modeling relation between actors is important for recognizing group activity in a multi-person scene. This paper aims at learning discriminative relation between actors efficiently using deep models. To this end, we propose to build a flexible and efficient Actor Relation Graph (ARG) to simultaneously capture the appearance and position relation between actors. Thanks to the Graph Convolutional Network, the connections in ARG could be automatically learned from group activity videos in an end-to-end manner, and the inference on ARG could be efficiently performed with standard matrix operations. Furthermore, in practice, we come up with two variants to sparsify ARG for more effective modeling in videos: spatially localized ARG and temporal randomized ARG. We perform extensive experiments on two standard group activity recognition datasets: the Volleyball dataset and the Collective Activity dataset, where state-of-the-art performance is achieved on both datasets. We also visualize the learned actor graphs and relation features, which demonstrate that the proposed ARG is able to capture the discriminative relation information for group activity recognition.Comment: Accepted by CVPR 201

    Image-Level Attentional Context Modeling Using Nested-Graph Neural Networks

    Full text link
    We introduce a new scene graph generation method called image-level attentional context modeling (ILAC). Our model includes an attentional graph network that effectively propagates contextual information across the graph using image-level features. Whereas previous works use an object-centric context, we build an image-level context agent to encode the scene properties. The proposed method comprises a single-stream network that iteratively refines the scene graph with a nested graph neural network. We demonstrate that our approach achieves competitive performance with the state-of-the-art for scene graph generation on the Visual Genome dataset, while requiring fewer parameters than other methods. We also show that ILAC can improve regular object detectors by incorporating relational image-level information.Comment: NIPS 2018, Relational Representation Learning Worksho

    STEP: Spatio-Temporal Progressive Learning for Video Action Detection

    Full text link
    In this paper, we propose Spatio-TEmporal Progressive (STEP) action detector---a progressive learning framework for spatio-temporal action detection in videos. Starting from a handful of coarse-scale proposal cuboids, our approach progressively refines the proposals towards actions over a few steps. In this way, high-quality proposals (i.e., adhere to action movements) can be gradually obtained at later steps by leveraging the regression outputs from previous steps. At each step, we adaptively extend the proposals in time to incorporate more related temporal context. Compared to the prior work that performs action detection in one run, our progressive learning framework is able to naturally handle the spatial displacement within action tubes and therefore provides a more effective way for spatio-temporal modeling. We extensively evaluate our approach on UCF101 and AVA, and demonstrate superior detection results. Remarkably, we achieve mAP of 75.0% and 18.6% on the two datasets with 3 progressive steps and using respectively only 11 and 34 initial proposals.Comment: CVPR 2019 (Oral
    corecore