1,764 research outputs found

    Multi-View Picking: Next-best-view Reaching for Improved Grasping in Clutter

    Full text link
    Camera viewpoint selection is an important aspect of visual grasp detection, especially in clutter where many occlusions are present. Where other approaches use a static camera position or fixed data collection routines, our Multi-View Picking (MVP) controller uses an active perception approach to choose informative viewpoints based directly on a distribution of grasp pose estimates in real time, reducing uncertainty in the grasp poses caused by clutter and occlusions. In trials of grasping 20 objects from clutter, our MVP controller achieves 80% grasp success, outperforming a single-viewpoint grasp detector by 12%. We also show that our approach is both more accurate and more efficient than approaches which consider multiple fixed viewpoints.Comment: ICRA 2019 Video: https://youtu.be/Vn3vSPKlaEk Code: https://github.com/dougsm/mvp_gras

    Facial Expression Analysis under Partial Occlusion: A Survey

    Full text link
    Automatic machine-based Facial Expression Analysis (FEA) has made substantial progress in the past few decades driven by its importance for applications in psychology, security, health, entertainment and human computer interaction. The vast majority of completed FEA studies are based on non-occluded faces collected in a controlled laboratory environment. Automatic expression recognition tolerant to partial occlusion remains less understood, particularly in real-world scenarios. In recent years, efforts investigating techniques to handle partial occlusion for FEA have seen an increase. The context is right for a comprehensive perspective of these developments and the state of the art from this perspective. This survey provides such a comprehensive review of recent advances in dataset creation, algorithm development, and investigations of the effects of occlusion critical for robust performance in FEA systems. It outlines existing challenges in overcoming partial occlusion and discusses possible opportunities in advancing the technology. To the best of our knowledge, it is the first FEA survey dedicated to occlusion and aimed at promoting better informed and benchmarked future work.Comment: Authors pre-print of the article accepted for publication in ACM Computing Surveys (accepted on 02-Nov-2017

    EARL: Eye-on-Hand Reinforcement Learner for Dynamic Grasping with Active Pose Estimation

    Full text link
    In this paper, we explore the dynamic grasping of moving objects through active pose tracking and reinforcement learning for hand-eye coordination systems. Most existing vision-based robotic grasping methods implicitly assume target objects are stationary or moving predictably. Performing grasping of unpredictably moving objects presents a unique set of challenges. For example, a pre-computed robust grasp can become unreachable or unstable as the target object moves, and motion planning must also be adaptive. In this work, we present a new approach, Eye-on-hAnd Reinforcement Learner (EARL), for enabling coupled Eye-on-Hand (EoH) robotic manipulation systems to perform real-time active pose tracking and dynamic grasping of novel objects without explicit motion prediction. EARL readily addresses many thorny issues in automated hand-eye coordination, including fast-tracking of 6D object pose from vision, learning control policy for a robotic arm to track a moving object while keeping the object in the camera's field of view, and performing dynamic grasping. We demonstrate the effectiveness of our approach in extensive experiments validated on multiple commercial robotic arms in both simulations and complex real-world tasks.Comment: Presented on IROS 2023 Corresponding author Siddarth Jai

    Robust fulfillment of constraints in robot visual servoing

    Full text link
    [EN] In this work, an approach based on sliding mode ideas is proposed to satisfy constraints in robot visual servoing. In particular, different types of constraints are defined in order to: fulfill the visibility constraints (camera fieldof-view and occlusions) for the image features of the detected object; to avoid exceeding the joint range limits and maximum joint speeds; and to avoid forbidden areas in the robot workspace. Moreover, another task with low-priority is considered to track the target object. The main advantages of the proposed approach are low computational cost, robustness and fully utilization of the allowed space for the constraints. The applicability and effectiveness of the proposed approach is demonstrated by simulation results for a simple 2D case and a complex 3D case study. Furthermore, the feasibility and robustness of the proposed approach is substantiated by experimental results using a conventional 6R industrial manipulator.This work was supported in part by the Spanish Government under grants BES-2010-038486 and Project DPI2013-42302-R, and the Generalitat Valenciana under grants VALi+d APOSTD/2016/044 and BEST/2017/029.Muñoz-Benavent, P.; Gracia Calandin, LI.; Solanes Galbis, JE.; Esparza Peidro, A.; Tornero Montserrat, J. (2018). Robust fulfillment of constraints in robot visual servoing. Control Engineering Practice. 71(1):79-95. https://doi.org/10.1016/j.conengprac.2017.10.017S799571

    Saliency-based approaches for multidimensional explainability of deep networks

    Get PDF
    In deep learning, visualization techniques extract the salient patterns exploited by deep networks to perform a task (e.g. image classification) focusing on single images. These methods allow a better understanding of these complex models, empowering the identification of the most informative parts of the input data. Beyond the deep network understanding, visual saliency is useful for many quantitative reasons and applications, both in the 2D and 3D domains, such as the analysis of the generalization capabilities of a classifier and autonomous navigation. In this thesis, we describe an approach to cope with the interpretability problem of a convolutional neural network and propose our ideas on how to exploit the visualization for applications like image classification and active object recognition. After a brief overview on common visualization methods producing attention/saliency maps, we will address two separate points: firstly, we will describe how visual saliency can be effectively used in the 2D domain (e.g. RGB images) to boost image classification performances: as a matter of fact, visual summaries, i.e. a compact representation of an ensemble of saliency maps, can be used to improve the classification accuracy of a network through summary-driven specializations. Then, we will present a 3D active recognition system that allows to consider different views of a target object, overcoming the single-view hypothesis of classical object recognition, making the classification problem much easier in principle. Here we adopt such attention maps in a quantitative fashion, by building a 3D dense saliency volume which fuses together saliency maps obtained from different viewpoints, obtaining a continuous proxy on which parts of an object are more discriminative for a given classifier. Finally, we will show how to inject this representations in a real world application, so that an agent (e.g. robot) can move knowing the capabilities of its classifier

    An original framework for understanding human actions and body language by using deep neural networks

    Get PDF
    The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour. By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way. These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively. While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements; both are essential tasks in many computer vision applications, including event recognition, and video surveillance. In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided. The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements. All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods

    Unsupervised Domain Adaptation for Face Recognition in Unlabeled Videos

    Full text link
    Despite rapid advances in face recognition, there remains a clear gap between the performance of still image-based face recognition and video-based face recognition, due to the vast difference in visual quality between the domains and the difficulty of curating diverse large-scale video datasets. This paper addresses both of those challenges, through an image to video feature-level domain adaptation approach, to learn discriminative video frame representations. The framework utilizes large-scale unlabeled video data to reduce the gap between different domains while transferring discriminative knowledge from large-scale labeled still images. Given a face recognition network that is pretrained in the image domain, the adaptation is achieved by (i) distilling knowledge from the network to a video adaptation network through feature matching, (ii) performing feature restoration through synthetic data augmentation and (iii) learning a domain-invariant feature through a domain adversarial discriminator. We further improve performance through a discriminator-guided feature fusion that boosts high-quality frames while eliminating those degraded by video domain-specific factors. Experiments on the YouTube Faces and IJB-A datasets demonstrate that each module contributes to our feature-level domain adaptation framework and substantially improves video face recognition performance to achieve state-of-the-art accuracy. We demonstrate qualitatively that the network learns to suppress diverse artifacts in videos such as pose, illumination or occlusion without being explicitly trained for them.Comment: accepted for publication at International Conference on Computer Vision (ICCV) 201
    • …
    corecore