498 research outputs found

    Real-time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor

    No full text
    We present an approach for real-time, robust and accurate hand pose estimation from moving egocentric RGB-D cameras in cluttered real environments. Existing methods typically fail for hand-object interactions in cluttered scenes imaged from egocentric viewpoints, common for virtual or augmented reality applications. Our approach uses two subsequently applied Convolutional Neural Networks (CNNs) to localize the hand and regress 3D joint locations. Hand localization is achieved by using a CNN to estimate the 2D position of the hand center in the input, even in the presence of clutter and occlusions. The localized hand position, together with the corresponding input depth value, is used to generate a normalized cropped image that is fed into a second CNN to regress relative 3D hand joint locations in real time. For added accuracy, robustness and temporal stability, we refine the pose estimates using a kinematic pose tracking energy. To train the CNNs, we introduce a new photorealistic dataset that uses a merged reality approach to capture and synthesize large amounts of annotated data of natural hand interaction in cluttered scenes. Through quantitative and qualitative evaluation, we show that our method is robust to self-occlusion and occlusions by objects, particularly in moving egocentric perspectives

    Egocentric Perception of Hands and Its Applications

    Get PDF

    GANerated Hands for Real-time 3D Hand Tracking from Monocular RGB

    Full text link
    We address the highly challenging problem of real-time 3D hand tracking based on a monocular RGB-only sequence. Our tracking method combines a convolutional neural network with a kinematic 3D hand model, such that it generalizes well to unseen data, is robust to occlusions and varying camera viewpoints, and leads to anatomically plausible as well as temporally smooth hand motions. For training our CNN we propose a novel approach for the synthetic generation of training data that is based on a geometrically consistent image-to-image translation network. To be more specific, we use a neural network that translates synthetic images to "real" images, such that the so-generated images follow the same statistical distribution as real-world hand images. For training this translation network we combine an adversarial loss and a cycle-consistency loss with a geometric consistency loss in order to preserve geometric properties (such as hand pose) during translation. We demonstrate that our hand tracking system outperforms the current state-of-the-art on challenging RGB-only footage

    Real-time 3D hand reconstruction in challenging scenes from a single color or depth camera

    Get PDF
    Hands are one of the main enabling factors for performing complex tasks and humans naturally use them for interactions with their environment. Reconstruction and digitization of 3D hand motion opens up many possibilities for important applications. Hands gestures can be directly used for human–computer interaction, which is especially relevant for controlling augmented or virtual reality (AR/VR) devices where immersion is of utmost importance. In addition, 3D hand motion capture is a precondition for automatic sign-language translation, activity recognition, or teaching robots. Different approaches for 3D hand motion capture have been actively researched in the past. While being accurate, gloves and markers are intrusive and uncomfortable to wear. Hence, markerless hand reconstruction based on cameras is desirable. Multi-camera setups provide rich input, however, they are hard to calibrate and lack the flexibility for mobile use cases. Thus, the majority of more recent methods uses a single color or depth camera which, however, makes the problem harder due to more ambiguities in the input. For interaction purposes, users need continuous control and immediate feedback. This means the algorithms have to run in real time and be robust in uncontrolled scenes. These requirements, achieving 3D hand reconstruction in real time from a single camera in general scenes, make the problem significantly more challenging. While recent research has shown promising results, current state-of-the-art methods still have strong limitations. Most approaches only track the motion of a single hand in isolation and do not take background-clutter or interactions with arbitrary objects or the other hand into account. The few methods that can handle more general and natural scenarios run far from real time or use complex multi-camera setups. Such requirements make existing methods unusable for many aforementioned applications. This thesis pushes the state of the art for real-time 3D hand tracking and reconstruction in general scenes from a single RGB or depth camera. The presented approaches explore novel combinations of generative hand models, which have been used successfully in the computer vision and graphics community for decades, and powerful cutting-edge machine learning techniques, which have recently emerged with the advent of deep learning. In particular, this thesis proposes a novel method for hand tracking in the presence of strong occlusions and clutter, the first method for full global 3D hand tracking from in-the-wild RGB video, and a method for simultaneous pose and dense shape reconstruction of two interacting hands that, for the first time, combines a set of desirable properties previously unseen in the literature.Hände sind einer der Hauptfaktoren für die Ausführung komplexer Aufgaben, und Menschen verwenden sie auf natürliche Weise für Interaktionen mit ihrer Umgebung. Die Rekonstruktion und Digitalisierung der 3D-Handbewegung eröffnet viele Möglichkeiten für wichtige Anwendungen. Handgesten können direkt als Eingabe für die Mensch-Computer-Interaktion verwendet werden. Dies ist insbesondere für Geräte der erweiterten oder virtuellen Realität (AR / VR) relevant, bei denen die Immersion von größter Bedeutung ist. Darüber hinaus ist die Rekonstruktion der 3D Handbewegung eine Voraussetzung zur automatischen Übersetzung von Gebärdensprache, zur Aktivitätserkennung oder zum Unterrichten von Robotern. In der Vergangenheit wurden verschiedene Ansätze zur 3D-Handbewegungsrekonstruktion aktiv erforscht. Handschuhe und physische Markierungen sind zwar präzise, aber aufdringlich und unangenehm zu tragen. Daher ist eine markierungslose Handrekonstruktion auf der Basis von Kameras wünschenswert. Multi-Kamera-Setups bieten umfangreiche Eingabedaten, sind jedoch schwer zu kalibrieren und haben keine Flexibilität für mobile Anwendungsfälle. Daher verwenden die meisten neueren Methoden eine einzelne Farb- oder Tiefenkamera, was die Aufgabe jedoch schwerer macht, da mehr Ambiguitäten in den Eingabedaten vorhanden sind. Für Interaktionszwecke benötigen Benutzer kontinuierliche Kontrolle und sofortiges Feedback. Dies bedeutet, dass die Algorithmen in Echtzeit ausgeführt werden müssen und robust in unkontrollierten Szenen sein müssen. Diese Anforderungen, 3D-Handrekonstruktion in Echtzeit mit einer einzigen Kamera in allgemeinen Szenen, machen das Problem erheblich schwieriger. Während neuere Forschungsarbeiten vielversprechende Ergebnisse gezeigt haben, weisen aktuelle Methoden immer noch Einschränkungen auf. Die meisten Ansätze verfolgen die Bewegung einer einzelnen Hand nur isoliert und berücksichtigen keine alltäglichen Umgebungen oder Interaktionen mit beliebigen Objekten oder der anderen Hand. Die wenigen Methoden, die allgemeinere und natürlichere Szenarien verarbeiten können, laufen nicht in Echtzeit oder verwenden komplexe Multi-Kamera-Setups. Solche Anforderungen machen bestehende Verfahren für viele der oben genannten Anwendungen unbrauchbar. Diese Dissertation erweitert den Stand der Technik für die Echtzeit-3D-Handverfolgung und -Rekonstruktion in allgemeinen Szenen mit einer einzelnen RGB- oder Tiefenkamera. Die vorgestellten Algorithmen erforschen neue Kombinationen aus generativen Handmodellen, die seit Jahrzehnten erfolgreich in den Bereichen Computer Vision und Grafik eingesetzt werden, und leistungsfähigen innovativen Techniken des maschinellen Lernens, die vor kurzem mit dem Aufkommen neuronaler Netzwerke entstanden sind. In dieser Arbeit werden insbesondere vorgeschlagen: eine neuartige Methode zur Handbewegungsrekonstruktion bei starken Verdeckungen und in unkontrollierten Szenen, die erste Methode zur Rekonstruktion der globalen 3D Handbewegung aus RGB-Videos in freier Wildbahn und die erste Methode zur gleichzeitigen Rekonstruktion von Handpose und -form zweier interagierender Hände, die eine Reihe wünschenwerter Eigenschaften komibiniert

    Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception

    Full text link
    We introduce the Aria Digital Twin (ADT) - an egocentric dataset captured using Aria glasses with extensive object, environment, and human level ground truth. This ADT release contains 200 sequences of real-world activities conducted by Aria wearers in two real indoor scenes with 398 object instances (324 stationary and 74 dynamic). Each sequence consists of: a) raw data of two monochrome camera streams, one RGB camera stream, two IMU streams; b) complete sensor calibration; c) ground truth data including continuous 6-degree-of-freedom (6DoF) poses of the Aria devices, object 6DoF poses, 3D eye gaze vectors, 3D human poses, 2D image segmentations, image depth maps; and d) photo-realistic synthetic renderings. To the best of our knowledge, there is no existing egocentric dataset with a level of accuracy, photo-realism and comprehensiveness comparable to ADT. By contributing ADT to the research community, our mission is to set a new standard for evaluation in the egocentric machine perception domain, which includes very challenging research problems such as 3D object detection and tracking, scene reconstruction and understanding, sim-to-real learning, human pose prediction - while also inspiring new machine perception tasks for augmented reality (AR) applications. To kick start exploration of the ADT research use cases, we evaluated several existing state-of-the-art methods for object detection, segmentation and image translation tasks that demonstrate the usefulness of ADT as a benchmarking dataset

    Understanding egocentric human actions with temporal decision forests

    Get PDF
    Understanding human actions is a fundamental task in computer vision with a wide range of applications including pervasive health-care, robotics and game control. This thesis focuses on the problem of egocentric action recognition from RGB-D data, wherein the world is viewed through the eyes of the actor whose hands describe the actions. The main contributions of this work are its findings regarding egocentric actions as described by hands in two application scenarios and a proposal of a new technique that is based on temporal decision forests. The thesis first introduces a novel framework to recognise fingertip writing in mid-air in the context of human-computer interaction. This framework detects whether the user is writing and tracks the fingertip over time to generate spatio-temporal trajectories that are recognised by using a Hough forest variant that encourages temporal consistency in prediction. A problem with using such forest approach for action recognition is that the learning of temporal dynamics is limited to hand-crafted temporal features and temporal regression, which may break the temporal continuity and lead to inconsistent predictions. To overcome this limitation, the thesis proposes transition forests. Besides any temporal information that is encoded in the feature space, the forest automatically learns the temporal dynamics during training, and it is exploited in inference in an online and efficient manner achieving state-of-the-art results. The last contribution of this thesis is its introduction of the first RGB-D benchmark to allow for the study of egocentric hand-object actions with both hand and object pose annotations. This study conducts an extensive evaluation of different baselines, state-of-the art approaches and temporal decision forest models using colour, depth and hand pose features. Furthermore, it extends the transition forest model to incorporate data from different modalities and demonstrates the benefit of using hand pose features to recognise egocentric human actions. The thesis concludes by discussing and analysing the contributions and proposing a few ideas for future work.Open Acces
    • …
    corecore