22,558 research outputs found

    Activity Driven Weakly Supervised Object Detection

    Full text link
    Weakly supervised object detection aims at reducing the amount of supervision required to train detection models. Such models are traditionally learned from images/videos labelled only with the object class and not the object bounding box. In our work, we try to leverage not only the object class labels but also the action labels associated with the data. We show that the action depicted in the image/video can provide strong cues about the location of the associated object. We learn a spatial prior for the object dependent on the action (e.g. "ball" is closer to "leg of the person" in "kicking ball"), and incorporate this prior to simultaneously train a joint object detection and action classification model. We conducted experiments on both video datasets and image datasets to evaluate the performance of our weakly supervised object detection model. Our approach outperformed the current state-of-the-art (SOTA) method by more than 6% in mAP on the Charades video dataset.Comment: CVPR'19 camera read

    Target-Tailored Source-Transformation for Scene Graph Generation

    Get PDF
    Scene graph generation aims to provide a semantic and structural description of an image, denoting the objects (with nodes) and their relationships (with edges). The best performing works to date are based on exploiting the context surrounding objects or relations,e.g., by passing information among objects. In these approaches, to transform the representation of source objects is a critical process for extracting information for the use by target objects. In this work, we argue that a source object should give what tar-get object needs and give different objects different information rather than contributing common information to all targets. To achieve this goal, we propose a Target-TailoredSource-Transformation (TTST) method to efficiently propagate information among object proposals and relations. Particularly, for a source object proposal which will contribute information to other target objects, we transform the source object feature to the target object feature domain by simultaneously taking both the source and target into account. We further explore more powerful representations by integrating language prior with the visual context in the transformation for the scene graph generation. By doing so the target object is able to extract target-specific information from the source object and source relation accordingly to refine its representation. Our framework is validated on the Visual Genome bench-mark and demonstrated its state-of-the-art performance for the scene graph generation. The experimental results show that the performance of object detection and visual relation-ship detection are promoted mutually by our method

    Pixelwise Instance Segmentation with a Dynamically Instantiated Network

    Full text link
    Semantic segmentation and object detection research have recently achieved rapid progress. However, the former task has no notion of different instances of the same object, and the latter operates at a coarse, bounding-box level. We propose an Instance Segmentation system that produces a segmentation map where each pixel is assigned an object class and instance identity label. Most approaches adapt object detectors to produce segments instead of boxes. In contrast, our method is based on an initial semantic segmentation module, which feeds into an instance subnetwork. This subnetwork uses the initial category-level segmentation, along with cues from the output of an object detector, within an end-to-end CRF to predict instances. This part of our model is dynamically instantiated to produce a variable number of instances per image. Our end-to-end approach requires no post-processing and considers the image holistically, instead of processing independent proposals. Therefore, unlike some related work, a pixel cannot belong to multiple instances. Furthermore, far more precise segmentations are achieved, as shown by our state-of-the-art results (particularly at high IoU thresholds) on the Pascal VOC and Cityscapes datasets.Comment: CVPR 201

    Context-aware CNNs for person head detection

    Full text link
    Person detection is a key problem for many computer vision tasks. While face detection has reached maturity, detecting people under a full variation of camera view-points, human poses, lighting conditions and occlusions is still a difficult challenge. In this work we focus on detecting human heads in natural scenes. Starting from the recent local R-CNN object detector, we extend it with two types of contextual cues. First, we leverage person-scene relations and propose a Global CNN model trained to predict positions and scales of heads directly from the full image. Second, we explicitly model pairwise relations among objects and train a Pairwise CNN model using a structured-output surrogate loss. The Local, Global and Pairwise models are combined into a joint CNN framework. To train and test our full model, we introduce a large dataset composed of 369,846 human heads annotated in 224,740 movie frames. We evaluate our method and demonstrate improvements of person head detection against several recent baselines in three datasets. We also show improvements of the detection speed provided by our model.Comment: To appear in International Conference on Computer Vision (ICCV), 201
    • …
    corecore