10 research outputs found

    A Deep Learning Approach to Object Affordance Segmentation

    Full text link
    Learning to understand and infer object functionalities is an important step towards robust visual intelligence. Significant research efforts have recently focused on segmenting the object parts that enable specific types of human-object interaction, the so-called "object affordances". However, most works treat it as a static semantic segmentation problem, focusing solely on object appearance and relying on strong supervision and object detection. In this paper, we propose a novel approach that exploits the spatio-temporal nature of human-object interaction for affordance segmentation. In particular, we design an autoencoder that is trained using ground-truth labels of only the last frame of the sequence, and is able to infer pixel-wise affordance labels in both videos and static images. Our model surpasses the need for object labels and bounding boxes by using a soft-attention mechanism that enables the implicit localization of the interaction hotspot. For evaluation purposes, we introduce the SOR3D-AFF corpus, which consists of human-object interaction sequences and supports 9 types of affordances in terms of pixel-wise annotation, covering typical manipulations of tool-like objects. We show that our model achieves competitive results compared to strongly supervised methods on SOR3D-AFF, while being able to predict affordances for similar unseen objects in two affordance image-only datasets.Comment: 5 pages, 4 figures, ICASSP 202

    On the study of deep learning active vision systems

    Get PDF
    This thesis presents a series of investigations into various active vision algorithms. An experimental method for evaluating active vision memory is proposed and used to demonstrate the benefits of a novel memory variant called the WW-LSTM network. A method for training active vision attention using classification gradients is proposed and a proof of concept of an attentional spotlight algorithm is demonstrated to convert spatially arranged gradients into coordinate space. The thesis makes a number of empirically supported recommendations as to the structure of future active vision architectures. Chapter 1 discusses the motivation behind pursuing active vision and lists the objectives set out in this thesis. The chapter contains the thesis statement, a brief overview of the relevant background and a list of the main contributions of this thesis to the literature. Chapter 2 describes an investigation into the utility of the software retina algorithm within the active vision paradigm. It discusses the initial research approach and motivations behind studying the retina, as well as the results that prompted a shift in the focus of this thesis away from the retina and onto active vision. The retina was found to slow down training to an infeasible pace, and in a latter experiment it was found to perform worse than a simple image cropping algorithm on an image classification task. Chapter 3 contains a comprehensive and empirically supported literature review highlighting a number of issues and knowledge gaps present within the relevant active vision literature. The review found the literature to be incoherent due to inconsistent terminology and due to the pursuit of disjointed approaches that do not reinforce each other. The literature was also found to contain a large number of pressing knowledge gaps, some of which were demonstrated experimentally. The literature review is accompanied by the proposal of an investigative framework devised to address the identified problems in the literature by structuring future active vision research. Chapter 4 investigated the means by which an active vision systems can collate the information they obtain across multiple observations. This aspect of active vision is referred to as memory. An experimental method for evaluating active vision memory in an interpretable manner is devised and applied to the study of a novel approach to recurrent memory called the WW-LSTM. The WW-LSTM is a parameter-efficient variant of the LSTM network that outperformed all other recurrent memory variants that were evaluated on an image classification task. Additionally, spatial concatenation in the input space was found to outperform all recurrent memory variants, calling into question a commonly employed approach in the active vision literature. Chapter 5 contains an investigation into active vision attention, which is the means by which the system decides where to look. Investigations contained therein demonstrate the benefits of employing a curriculum for training attention that modifies sensor parameters, and present an empirically backed argument in favour of implementing attention in a separate processing stream from classification. The chapter closes with a proposal of a novel method for leveraging classification gradients in training attention; the method is called predictive attention, and a first step in its pursuit is taken with a proof of concept demonstration of the hardcoded attention spotlight algorithm. The spotlight is demonstrated to facilitate the localisation of a hotspot in a modelled feature map via an optimisation process. Chapter 6 concludes this thesis by re-stating its objectives and summarizing its key contributions. It closes with a discussion of recommended future work that can further advance our understanding of active vision in deep learning

    End-to-End Policy Learning for Active Visual Categorization

    No full text
    corecore