4,159 research outputs found

    Weakly-supervised learning of visual relations

    Get PDF
    This paper introduces a novel approach for modeling visual relations between pairs of objects. We call relation a triplet of the form (subject, predicate, object) where the predicate is typically a preposition (eg. 'under', 'in front of') or a verb ('hold', 'ride') that links a pair of objects (subject, object). Learning such relations is challenging as the objects have different spatial configurations and appearances depending on the relation in which they occur. Another major challenge comes from the difficulty to get annotations, especially at box-level, for all possible triplets, which makes both learning and evaluation difficult. The contributions of this paper are threefold. First, we design strong yet flexible visual features that encode the appearance and spatial configuration for pairs of objects. Second, we propose a weakly-supervised discriminative clustering model to learn relations from image-level labels only. Third we introduce a new challenging dataset of unusual relations (UnRel) together with an exhaustive annotation, that enables accurate evaluation of visual relation retrieval. We show experimentally that our model results in state-of-the-art results on the visual relationship dataset significantly improving performance on previously unseen relations (zero-shot learning), and confirm this observation on our newly introduced UnRel dataset

    Weakly-supervised learning of visual relations

    Full text link
    This paper introduces a novel approach for modeling visual relations between pairs of objects. We call relation a triplet of the form (subject, predicate, object) where the predicate is typically a preposition (eg. 'under', 'in front of') or a verb ('hold', 'ride') that links a pair of objects (subject, object). Learning such relations is challenging as the objects have different spatial configurations and appearances depending on the relation in which they occur. Another major challenge comes from the difficulty to get annotations, especially at box-level, for all possible triplets, which makes both learning and evaluation difficult. The contributions of this paper are threefold. First, we design strong yet flexible visual features that encode the appearance and spatial configuration for pairs of objects. Second, we propose a weakly-supervised discriminative clustering model to learn relations from image-level labels only. Third we introduce a new challenging dataset of unusual relations (UnRel) together with an exhaustive annotation, that enables accurate evaluation of visual relation retrieval. We show experimentally that our model results in state-of-the-art results on the visual relationship dataset significantly improving performance on previously unseen relations (zero-shot learning), and confirm this observation on our newly introduced UnRel dataset

    Automatic annotation for weakly supervised learning of detectors

    Get PDF
    PhDObject detection in images and action detection in videos are among the most widely studied computer vision problems, with applications in consumer photography, surveillance, and automatic media tagging. Typically, these standard detectors are fully supervised, that is they require a large body of training data where the locations of the objects/actions in images/videos have been manually annotated. With the emergence of digital media, and the rise of high-speed internet, raw images and video are available for little to no cost. However, the manual annotation of object and action locations remains tedious, slow, and expensive. As a result there has been a great interest in training detectors with weak supervision where only the presence or absence of object/action in image/video is needed, not the location. This thesis presents approaches for weakly supervised learning of object/action detectors with a focus on automatically annotating object and action locations in images/videos using only binary weak labels indicating the presence or absence of object/action in images/videos. First, a framework for weakly supervised learning of object detectors in images is presented. In the proposed approach, a variation of multiple instance learning (MIL) technique for automatically annotating object locations in weakly labelled data is presented which, unlike existing approaches, uses inter-class and intra-class cue fusion to obtain the initial annotation. The initial annotation is then used to start an iterative process in which standard object detectors are used to refine the location annotation. Finally, to ensure that the iterative training of detectors do not drift from the object of interest, a scheme for detecting model drift is also presented. Furthermore, unlike most other methods, our weakly supervised approach is evaluated on data without manual pose (object orientation) annotation. Second, an analysis of the initial annotation of objects, using inter-class and intra-class cues, is carried out. From the analysis, a new method based on negative mining (NegMine) is presented for the initial annotation of both object and action data. The NegMine based approach is a much simpler formulation using only inter-class measure and requires no complex combinatorial optimisation but can still meet or outperform existing approaches including the previously pre3 sented inter-intra class cue fusion approach. Furthermore, NegMine can be fused with existing approaches to boost their performance. Finally, the thesis will take a step back and look at the use of generic object detectors as prior knowledge in weakly supervised learning of object detectors. These generic object detectors are typically based on sampling saliency maps that indicate if a pixel belongs to the background or foreground. A new approach to generating saliency maps is presented that, unlike existing approaches, looks beyond the current image of interest and into images similar to the current image. We show that our generic object proposal method can be used by itself to annotate the weakly labelled object data with surprisingly high accuracy

    Cross-task weakly supervised learning from instructional videos

    Get PDF
    In this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations. At the heart of our approach is the observation that weakly supervised learning may be easier if a model shares components while learning different steps: `pour egg' should be trained jointly with other tasks involving `pour' and `egg'. We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps. Past data does not permit systematic studying of sharing and so we also gather a new dataset, CrossTask, aimed at assessing cross-task sharing. Our experiments demonstrate that sharing across tasks improves performance, especially when done at the component level and that our component model can parse previously unseen tasks by virtue of its compositionality.Comment: 18 pages, 17 figures, to be published in proceedings of the CVPR, 201
    corecore