171 research outputs found

    Enhancing Generalizable 6D Pose Tracking of an In-Hand Object with Tactile Sensing

    Full text link
    While holding and manipulating an object, humans track the object states through vision and touch so as to achieve complex tasks. However, nowadays the majority of robot research perceives object states just from visual signals, hugely limiting the robotic manipulation abilities. This work presents a tactile-enhanced generalizable 6D pose tracking design named TEG-Track to track previously unseen in-hand objects. TEG-Track extracts tactile kinematic cues of an in-hand object from consecutive tactile sensing signals. Such cues are incorporated into a geometric-kinematic optimization scheme to enhance existing generalizable visual trackers. To test our method in real scenarios and enable future studies on generalizable visual-tactile tracking, we collect a real visual-tactile in-hand object pose tracking dataset. Experiments show that TEG-Track significantly improves state-of-the-art generalizable 6D pose trackers in both synthetic and real cases

    TrackAgent: 6D Object Tracking via Reinforcement Learning

    Full text link
    Tracking an object's 6D pose, while either the object itself or the observing camera is moving, is important for many robotics and augmented reality applications. While exploiting temporal priors eases this problem, object-specific knowledge is required to recover when tracking is lost. Under the tight time constraints of the tracking task, RGB(D)-based methods are often conceptionally complex or rely on heuristic motion models. In comparison, we propose to simplify object tracking to a reinforced point cloud (depth only) alignment task. This allows us to train a streamlined approach from scratch with limited amounts of sparse 3D point clouds, compared to the large datasets of diverse RGBD sequences required in previous works. We incorporate temporal frame-to-frame registration with object-based recovery by frame-to-model refinement using a reinforcement learning (RL) agent that jointly solves for both objectives. We also show that the RL agent's uncertainty and a rendering-based mask propagation are effective reinitialization triggers.Comment: International Conference on Computer Vision Systems (ICVS) 202

    Video based Object 6D Pose Estimation using Transformers

    Full text link
    We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos. Our approach leverages the temporal information from a video sequence for pose refinement, along with being computationally efficient and robust. Compared to existing methods, our architecture is able to capture and reason from long-range dependencies efficiently, thus iteratively refining over video sequences. Experimental evaluation on the YCB-Video dataset shows that our approach is on par with the state-of-the-art Transformer methods, and performs significantly better relative to CNN based approaches. Further, with a speed of 33 fps, it is also more efficient and therefore applicable to a variety of applications that require real-time object pose estimation. Training code and pretrained models are available at https://github.com/ApoorvaBeedu/VideoPoseComment: arXiv admin note: text overlap with arXiv:2111.1067

    Ambiguity-Aware Multi-Object Pose Optimization for Visually-Assisted Robot Manipulation

    Full text link
    6D object pose estimation aims to infer the relative pose between the object and the camera using a single image or multiple images. Most works have focused on predicting the object pose without associated uncertainty under occlusion and structural ambiguity (symmetricity). However, these works demand prior information about shape attributes, and this condition is hardly satisfied in reality; even asymmetric objects may be symmetric under the viewpoint change. In addition, acquiring and fusing diverse sensor data is challenging when extending them to robotics applications. Tackling these limitations, we present an ambiguity-aware 6D object pose estimation network, PrimA6D++, as a generic uncertainty prediction method. The major challenges in pose estimation, such as occlusion and symmetry, can be handled in a generic manner based on the measured ambiguity of the prediction. Specifically, we devise a network to reconstruct the three rotation axis primitive images of a target object and predict the underlying uncertainty along each primitive axis. Leveraging the estimated uncertainty, we then optimize multi-object poses using visual measurements and camera poses by treating it as an object SLAM problem. The proposed method shows a significant performance improvement in T-LESS and YCB-Video datasets. We further demonstrate real-time scene recognition capability for visually-assisted robot manipulation. Our code and supplementary materials are available at https://github.com/rpmsnu/PrimA6D.Comment: IEEE Robotics and Automation Letter

    CSA6D: Channel-Spatial Attention Networks for 6D Object Pose Estimation

    Get PDF
    6D object pose estimation plays a crucial role in robotic manipulation and grasping tasks. The aim to estimate the 6D object pose from RGB or RGB-D images is to detect objects and estimate their orientations and translations relative to the given canonical models. RGB-D cameras provide two sensory modalities: RGB and depth images, which could benefit the estimation accuracy. But the exploitation of two different modality sources remains a challenging issue. In this paper, inspired by recent works on attention networks that could focus on important regions and ignore unnecessary information, we propose a novel network: Channel-Spatial Attention Network (CSA6D) to estimate the 6D object pose from RGB-D camera. The proposed CSA6D includes a pre-trained 2D network to segment the interested objects from RGB image. Then it uses two separate networks to extract appearance and geometrical features from RGB and depth images for each segmented object. Two feature vectors for each pixel are stacked together as a fusion vector which is refined by an attention module to generate a aggregated feature vector. The attention module includes a channel attention block and a spatial attention block which can effectively leverage the concatenated embeddings into accurate 6D pose prediction on known objects. We evaluate proposed network on two benchmark datasets YCB-Video dataset and LineMod dataset and the results show it can outperform previous state-of-the-art methods under ADD and ADD-S metrics. Also, the attention map demonstrates our proposed network searches for the unique geometry information as the most likely features for pose estimation. From experiments, we conclude that the proposed network can accurately estimate the object pose by effectively leveraging multi-modality features

    Physics-Based Object 6D-Pose Estimation during Non-Prehensile Manipulation

    Get PDF
    We propose a method to track the 6D pose of an object over time, while the object is under non-prehensile manipulation by a robot. At any given time during the manipulation of the object, we assume access to the robot joint controls and an image from a camera. We use the robot joint controls to perform a physics-based prediction of how the object might be moving. We then combine this prediction with the observation coming from the camera, to estimate the object pose as accurately as possible. We use a particle filtering approach to combine the control information with the visual information. We compare the proposed method with two baselines: (i) using only an image-based pose estimation system at each time-step, and (ii) a particle filter which does not perform the computationally expensive physics predictions, but assumes the object moves with constant velocity. Our results show that making physics-based predictions is worth the computational cost, resulting in more accurate tracking, and estimating object pose even when the object is not clearly visible to the camera

    A Unified Hybrid Formulation for Visual SLAM

    Get PDF
    Visual Simultaneous Localization and Mapping (Visual SLAM (VSLAM)), is the process of estimating the six degrees of freedom ego-motion of a camera, from its video feed, while simultaneously constructing a 3D model of the observed environment. Extensive research in the field for the past two decades has yielded real-time and efficient algorithms for VSLAM, allowing various interesting applications in augmented reality, cultural heritage, robotics and the automotive industry, to name a few. The underlying formula behind VSLAM is a mixture of image processing, geometry, graph theory, optimization and machine learning; the theoretical and practical development of these building blocks led to a wide variety of algorithms, each leveraging different assumptions to achieve superiority under the presumed conditions of operation. An exhaustive survey on the topic outlined seven main components in a generic VSLAM pipeline, namely: the matching paradigm, visual initialization, data association, pose estimation, topological/metric map generation, optimization, and global localization. Before claiming VSLAM a solved problem, numerous challenging subjects pertaining to robustness in each of the aforementioned components have to be addressed; namely: resilience to a wide variety of scenes (poorly textured or self repeating scenarios), resilience to dynamic changes (moving objects), and scalability for long-term operation (computational resources awareness and management). Furthermore, current state-of-the art VSLAM pipelines are tailored towards static, basic point cloud reconstructions, an impediment to perception applications such as path planning, obstacle avoidance and object tracking. To address these limitations, this work proposes a hybrid scene representation, where different sources of information extracted solely from the video feed are fused in a hybrid VSLAM system. The proposed pipeline allows for seamless integration of data from pixel-based intensity measurements and geometric entities to produce and make use of a coherent scene representation. The goal is threefold: 1) Increase camera tracking accuracy under challenging motions, 2) improve robustness to challenging poorly textured environments and varying illumination conditions, and 3) ensure scalability and long-term operation by efficiently maintaining a global reusable map representation

    Hybrid Architectures for Object Pose and Velocity Tracking at the Intersection of Kalman Filtering and Machine Learning

    Get PDF
    The study of object perception algorithms is fundamental for the development of robotic platforms capable of planning and executing actions involving objects with high precision, reliability and safety. Indeed, this topic has been vastly explored in both the robotic and computer vision research communities using diverse techniques, ranging from classical Bayesian filtering to more modern Machine Learning techniques, and complementary sensing modalities such as vision and touch. Recently, the ever-growing availability of tools for synthetic data generation has substantially increased the adoption of Deep Learning for both 2D tasks, as object detection and segmentation, and 6D tasks, such as object pose estimation and tracking. The proposed methods exhibit interesting performance on computer vision benchmarks and robotic tasks, e.g. using object pose estimation for grasp planning purposes. Nonetheless, they generally do not consider useful information connected with the physics of the object motion and the peculiarities and requirements of robotic systems. Examples are the necessity to provide well-behaved output signals for robot motion control, the possibility to integrate modelling priors on the motion of the object and algorithmic priors. These help exploit the temporal correlation of the object poses, handle the pose uncertainties and mitigate the effect of outliers. Most of these concepts are considered in classical approaches, e.g. from the Bayesian and Kalman filtering literature, which however are not as powerful as Deep Learning in handling visual data. As a consequence, the development of hybrid architectures that combine the best features from both worlds is particularly appealing in a robotic setting. Motivated by these considerations, in this Thesis, I aimed at devising hybrid architectures for object perception, focusing on the task of object pose and velocity tracking. The proposed architectures use Kalman filtering supported by state-of-the-art Deep Neural Networks to track the 6D pose and velocity of objects from images. The devised solutions exhibit state-of-the-art performance, increased modularity and do not require training to implement the actual tracking behaviors. Furthermore, they can track even fast object motions despite the possible non-negligible inference times of the adopted neural networks. Also, by relying on data-driven Kalman filtering, I explored a paradigm that enables to track the state of systems that cannot be easily modeled analytically. Specifically, I used this approach to learn the measurement model of soft 3D tactile sensors and address the problem of tracking the sliding motion of hand-held objects

    A Multi-body Tracking Framework -- From Rigid Objects to Kinematic Structures

    Full text link
    Kinematic structures are very common in the real world. They range from simple articulated objects to complex mechanical systems. However, despite their relevance, most model-based 3D tracking methods only consider rigid objects. To overcome this limitation, we propose a flexible framework that allows the extension of existing 6DoF algorithms to kinematic structures. Our approach focuses on methods that employ Newton-like optimization techniques, which are widely used in object tracking. The framework considers both tree-like and closed kinematic structures and allows a flexible configuration of joints and constraints. To project equations from individual rigid bodies to a multi-body system, Jacobians are used. For closed kinematic chains, a novel formulation that features Lagrange multipliers is developed. In a detailed mathematical proof, we show that our constraint formulation leads to an exact kinematic solution and converges in a single iteration. Based on the proposed framework, we extend ICG, which is a state-of-the-art rigid object tracking algorithm, to multi-body tracking. For the evaluation, we create a highly-realistic synthetic dataset that features a large number of sequences and various robots. Based on this dataset, we conduct a wide variety of experiments that demonstrate the excellent performance of the developed framework and our multi-body tracker.Comment: Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligenc

    Camera Re-Localization with Data Augmentation by Image Rendering and Image-to-Image Translation

    Get PDF
    Die Selbstlokalisierung von Automobilen, Robotern oder unbemannten Luftfahrzeugen sowie die Selbstlokalisierung von FußgĂ€ngern ist und wird fĂŒr eine Vielzahl an Anwendungen von hohem Interesse sein. Eine Hauptaufgabe ist die autonome Navigation von solchen Fahrzeugen, wobei die Lokalisierung in der umgebenden Szene eine SchlĂŒsselkomponente darstellt. Da Kameras etablierte fest verbaute Sensoren in Automobilen, Robotern und unbemannten Luftfahrzeugen sind, ist der Mehraufwand diese auch fĂŒr Aufgaben der Lokalisierung zu verwenden gering bis gar nicht vorhanden. Das gleiche gilt fĂŒr die Selbstlokalisierung von FußgĂ€ngern, bei der Smartphones als mobile Plattformen fĂŒr Kameras zum Einsatz kommen. Kamera-Relokalisierung, bei der die Pose einer Kamera bezĂŒglich einer festen Umgebung bestimmt wird, ist ein wertvoller Prozess um eine Lösung oder UnterstĂŒtzung der Lokalisierung fĂŒr Fahrzeuge oder FußgĂ€nger darzustellen. Kameras sind zudem kostengĂŒnstige Sensoren welche im Alltag von Menschen und Maschinen etabliert sind. Die UnterstĂŒtzung von Kamera-Relokalisierung ist nicht auf Anwendungen bezĂŒglich der Navigation begrenzt, sondern kann allgemein zur UnterstĂŒtzung von Bildanalyse oder Bildverarbeitung wie Szenenrekonstruktion, Detektion, Klassifizierung oder Ă€hnlichen Anwendungen genutzt werden. FĂŒr diese Zwecke, befasst sich diese Arbeit mit der Verbesserung des Prozesses der Kamera-Relokalisierung. Da Convolutional Neural Networks (CNNs) und hybride Lösungen um die Posen von Kameras zu bestimmen in den letzten Jahren mit etablierten manuell entworfenen Methoden konkurrieren, ist der Fokus in dieser Thesis auf erstere Methoden gesetzt. Die HauptbeitrĂ€ge dieser Arbeit beinhalten den Entwurf eines CNN zur SchĂ€tzung von Kameraposen, wobei der Schwerpunkt auf einer flachen Architektur liegt, die den Anforderungen an mobile Plattformen genĂŒgt. Dieses Netzwerk erreicht Genauigkeiten in gleichem Grad wie tiefere CNNs mit umfangreicheren ModelgrĂ¶ĂŸen. Desweiteren ist die Performanz von CNNs stark von der QuantitĂ€t und QualitĂ€t der zugrundeliegenden Trainingsdaten, die fĂŒr die Optimierung genutzt werden, abhĂ€ngig. Daher, befassen sich die weiteren BeitrĂ€ge dieser Thesis mit dem Rendern von Bildern und Bild-zu-Bild Umwandlungen zur Erweiterung solcher Trainingsdaten. Das generelle Erweitern solcher Trainingsdaten wird Data Augmentation (DA) genannt. FĂŒr das Rendern von Bildern zur nĂŒtzlichen Erweiterung von Trainingsdaten werden 3D Modelle genutzt. Generative Adversarial Networks (GANs) dienen zur Bild-zu-Bild Umwandlung. WĂ€hrend das Rendern von Bildern die QuantitĂ€t in einem Bilddatensatz erhöht, verbessert die Bild-zu-Bild Umwandlung die QualitĂ€t dieser gerenderten Daten. Experimente werden sowohl mit erweiterten DatensĂ€tzen aus gerenderten Bildern als auch mit umgewandelten Bildern durchgefĂŒhrt. Beide AnsĂ€tze der DA tragen zur Verbesserung der Genauigkeit der Lokalisierung bei. Somit werden in dieser Arbeit Kamera-Relokalisierung mit modernsten Methoden durch DA verbessert
