23 research outputs found

    Few-shot linguistic grounding of visual attributes and relations using gaussian kernels

    Get PDF
    Understanding complex visual scenes is one of fundamental problems in computer vision, but learning in this domain is challenging due to the inherent richness of the visual world and the vast number of possible scene configurations. Current state of the art approaches to scene understanding often employ deep networks which require large and densely annotated datasets. This goes against the seemingly intuitive learning abilities of humans and our ability to generalise from few examples to unseen situations. In this paper, we propose a unified framework for learning visual representation of words denoting attributes such as “blue” and relations such as “left of” based on Gaussian models operating in a simple, unified feature space. The strength of our model is that it only requires a small number of weak annotations and is able to generalize easily to unseen situations such as recognizing object relations in unusual configurations. We demonstrate the effectiveness of our model on the pr edicate detection task. Our model is able to outperform the state of the art on this task in both the normal and zero-shot scenarios, while training on a dataset an order of magnitude smaller. (Less)Publisher PD

    PlaNet-ClothPick : effective fabric flattening based on latent dynamic planning

    Get PDF
    Why do Recurrent State Space Models such as PlaNet fail at cloth manipulation tasks? Recent work has attributed this to the blurry prediction of the observation, which makes it difficult to plan directly in the latent space. This paper explores the reasons behind this by applying PlaNet in the pick-and-place fabric-flattening domain. We find that the sharp discontinuity of the transition function on the contour of the fabric makes it difficult to learn an accurate latent dynamic model, causing the MPC planner to produce pick actions slightly outside of the article. By limiting picking space on the cloth mask and training on specially engineered trajectories, our mesh-free PlaNet-ClothPick surpasses visual planning and policy learning methods on principal metrics in simulation, achieving similar performance as state-of-the-art mesh-based planning approaches. Notably, our model exhibits a faster action inference and requires fewer transitional model parameters than the state-of-the-art robotic systems in this domain. Other supplementary materials are available at: https://sites.google.com/view/planet-clothpick

    Data-driven robotic manipulation of cloth-like deformable objects : the present, challenges and future prospects

    Get PDF
    Manipulating cloth-like deformable objects (CDOs) is a long-standing problem in the robotics community. CDOs are flexible (non-rigid) objects that do not show a detectable level of compression strength while two points on the article are pushed towards each other and include objects such as ropes (1D), fabrics (2D) and bags (3D). In general, CDOs’ many degrees of freedom (DoF) introduce severe self-occlusion and complex state–action dynamics as significant obstacles to perception and manipulation systems. These challenges exacerbate existing issues of modern robotic control methods such as imitation learning (IL) and reinforcement learning (RL). This review focuses on the application details of data-driven control methods on four major task families in this domain: cloth shaping, knot tying/untying, dressing and bag manipulation. Furthermore, we identify specific inductive biases in these four domains that present challenges for more general IL and RL algorithms.Publisher PDFPeer reviewe

    Visualization as Intermediate Representations (VLAIR) for human activity recognition

    Get PDF
    Ambient, binary, event-driven sensor data is useful for many human activity recognition applications such as smart homes and ambient-assisted living. These sensors are privacy-preserving, unobtrusive, inexpensive and easy to deploy in scenarios that require detection of simple activities such as going to sleep, and leaving the house. However, classification performance is still a challenge, especially when multiple people share the same space or when different activities take place in the same areas. To improve classification performance we develop what we call a Visualization as Intermediate Representations (VLAIR) approach. The main idea is to re-represent the data as visualizations (generated pixel images) in a similar way as how visualizations are created for humans to analyze and communicate data. Then we can feed these images to a convolutional neural network whose strength resides in extracting effective visual features. We have tested five variants (mappings) of the VLAIR approach and compared them to a collection of classifiers commonly used in classic human activity recognition. The best of the VLAIR approaches outperforms the best baseline, with strong advantage in recognising less frequent activities and distinguishing users and activities in common areas. We conclude the paper with a discussion on why and how VLAIR can be useful in human activity recognition scenarios and beyond.Postprin

    Interpretable feature maps for robot attention

    Get PDF
    Attention is crucial for autonomous agents interacting with complex environments. In a real scenario, our expectations drive attention, as we look for crucial objects to complete our understanding of the scene. But most visual attention models to date are designed to drive attention in a bottom-up fashion, without context, and the features they use are not always suitable for driving top-down attention. In this paper, we present an attentional mechanism based on semantically meaningful, interpretable features. We show how to generate a low-level semantic representation of the scene in real time, which can be used to search for objects based on specific features such as colour, shape, orientation, speed, and texture.Postprin

    BIMP: A real-time biological model of multi-scale keypoint detection in V1

    Get PDF
    We present an improved, biologically inspired and multiscale keypoint operator. Models of single- and double-stopped hypercomplex cells in area V1 of the mammalian visual cortex are used to detect stable points of high complexity at multiple scales. Keypoints represent line and edge crossings, junctions and terminations at fine scales, and blobs at coarse scales. They are detected by applying first and second derivatives to responses of complex cells in combination with two inhibition schemes to suppress responses along lines and edges. A number of optimisations make our new algorithm much faster than previous biologically inspired models, achieving real-time performance on modern GPUs and competitive speeds on CPUs. In this paper we show that the keypoints exhibit state-of-the-art repeatability in standardised benchmarks, often yielding best-in-class performance. This makes them interesting both in biological models and as a useful detector in practice. We also show that keypoints can be used as a data selection step, significantly reducing the complexity in state-of-the-art object categorisation. (C) 2014 Elsevier B.V. All rights reserved

    A parametric spectral model for texture-based salience

    Get PDF
    We present a novel saliency mechanism based on texture. Local texture at each pixel is characterised by the 2D spectrum obtained from oriented Gabor filters. We then apply a parametric model and describe the texture at each pixel by a combination of two 1D Gaussian approximations. This results in a simple model which consists of only four parameters. These four parameters are then used as feature channels and standard Difference-of-Gaussian blob detection is applied in order to detect salient areas in the image, similar to the Itti and Koch model. Finally, a diffusion process is used to sharpen the resulting regions. Evaluation on a large saliency dataset shows a significant improvement of our method over the baseline Itti and Koch model.Postprin

    Re-identification of individuals from images using spot constellations : a case study in Arctic charr (Salvelinus alpinus)

    Get PDF
    The long-term monitoring of Arctic charr in lava caves is funded by the Icelandic Research Fund, RANNĂŤS (research grant nos. 120227 and 162893). E.A.M. was supported by the Icelandic Research Fund, RANNĂŤS (grant no. 162893) and NERC research grant awarded to M.B.M. (grant no. NE/R011109/1). M.B.M. was supported by a University Research Fellowship from the Royal Society (London). C.A.L. and B.K.K. were supported by HĂłlar University, Iceland. The Titan Xp GPU used for this research was donated to K.T. by the NVIDIA Corporation.The ability to re-identify individuals is fundamental to the individual-based studies that are required to estimate many important ecological and evolutionary parameters in wild populations. Traditional methods of marking individuals and tracking them through time can be invasive and imperfect, which can affect these estimates and create uncertainties for population management. Here we present a photographic re-identification method that uses spot constellations in images to match specimens through time. Photographs of Arctic charr (Salvelinus alpinus) were used as a case study. Classical computer vision techniques were compared with new deep-learning techniques for masks and spot extraction. We found that a U-Net approach trained on a small set of human-annotated photographs performed substantially better than a baseline feature engineering approach. For matching the spot constellations, two algorithms were adapted, and, depending on whether a fully or semi-automated set-up is preferred, we show how either one or a combination of these algorithms can be implemented. Within our case study, our pipeline both successfully identified unmarked individuals from photographs alone and re-identified individuals that had lost tags, resulting in an approximately 4 our multi-step pipeline involves little human supervision and could be applied to many organisms.Publisher PDFPeer reviewe

    Biologically inspired vision for human-robot interaction

    Get PDF
    Human-robot interaction is an interdisciplinary research area that is becoming more and more relevant as robots start to enter our homes, workplaces, schools, etc. In order to navigate safely among us, robots must be able to understand human behavior, to communicate, and to interpret instructions from humans, either by recognizing their speech or by understanding their body movements and gestures. We present a biologically inspired vision system for human-robot interaction which integrates several components: visual saliency, stereo vision, face and hand detection and gesture recognition. Visual saliency is computed using color, motion and disparity. Both the stereo vision and gesture recognition components are based on keypoints coded by means of cortical V1 simple, complex and end-stopped cells. Hand and face detection is achieved by using a linear SVM classifier. The system was tested on a child-sized robot.Postprin
    corecore