9 research outputs found

    VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera

    Full text link
    We present the first real-time method to capture the full global 3D skeletal pose of a human in a stable, temporally consistent manner using a single RGB camera. Our method combines a new convolutional neural network (CNN) based pose regressor with kinematic skeleton fitting. Our novel fully-convolutional pose formulation regresses 2D and 3D joint positions jointly in real time and does not require tightly cropped input frames. A real-time kinematic skeleton fitting method uses the CNN output to yield temporally stable 3D global pose reconstructions on the basis of a coherent kinematic skeleton. This makes our approach the first monocular RGB method usable in real-time applications such as 3D character control---thus far, the only monocular methods for such applications employed specialized RGB-D cameras. Our method's accuracy is quantitatively on par with the best offline 3D monocular RGB pose estimation methods. Our results are qualitatively comparable to, and sometimes better than, results from monocular RGB-D approaches, such as the Kinect. However, we show that our approach is more broadly applicable than RGB-D solutions, i.e. it works for outdoor scenes, community videos, and low quality commodity RGB cameras.Comment: Accepted to SIGGRAPH 201

    Estimating Human Body Pose from a Single Image via the Specialized Mappings Architecture

    Full text link
    A non-linear supervised learning architecture, the Specialized Mapping Architecture (SMA) and its application to articulated body pose reconstruction from single monocular images is described. The architecture is formed by a number of specialized mapping functions, each of them with the purpose of mapping certain portions (connected or not) of the input space, and a feedback matching process. A probabilistic model for the architecture is described along with a mechanism for learning its parameters. The learning problem is approached using a maximum likelihood estimation framework; we present Expectation Maximization (EM) algorithms for two different instances of the likelihood probability. Performance is characterized by estimating human body postures from low level visual features, showing promising results

    Decomposing biological motion: A framework for analysis and synthesis of human gait patterns.

    Get PDF
    Biological motion contains information about the identity of an agent as well as about his or her actions, intentions, and emotions. The human visual system is highly sensitive to biological motion and capable of extracting socially relevant information from it. Here we investigate the question of how such information is encoded in biological motion patterns and how such information can be retrieved. A framework is developed that transforms biological motion into a representation allowing for analysis using linear methods from statistics and pattern recognition. Using gender classification as an example, simple classifiers are constructed and compared to psychophysical data from human observers. The analysis reveals that the dynamic part of the motion contains more information about gender than motion-mediated structural cues. The proposed framework can be used not only for analysis of biological motion but also to synthesize new motion patterns. A simple motion modeler is presented that can be used to visualize and exaggerate the differences in male and female walking patterns

    Vision-Based Observation Models for Lower Limb 3D Tracking with a Moving Platform

    Get PDF
    Tracking and understanding human gait is an important step towards improving elderly mobility and safety. This thesis presents a vision-based tracking system that estimates the 3D pose of a wheeled walker user's lower limbs with cameras mounted on the moving walker. The tracker estimates 3D poses from images of the lower limbs in the coronal plane in a dynamic, uncontrolled environment. It employs a probabilistic approach based on particle filtering with three different camera setups: a monocular RGB camera, binocular RGB cameras, and a depth camera. For the RGB cameras, observation likelihoods are designed to compare the colors and gradients of each frame with initial templates that are manually extracted. Two strategies are also investigated for handling appearance change of tracking target: increasing number of templates and using different representations of colors. For the depth camera, two observation likelihoods are developed: the first one works directly in the 3D space, while the second one works in the projected image space. Experiments are conducted to evaluate the performance of the tracking system with different users for all three camera setups. It is demonstrated that the trackers with the RGB cameras produce results with higher error as compared to the depth camera, and the strategies for handling appearance change improve tracking accuracy in general. On the other hand, the tracker with the depth sensor successfully tracks the 3D poses of users over the entire video sequence and is robust against unfavorable conditions such as partial occlusion, missing observations, and deformable tracking target

    Accurate Human Motion Capture and Modeling using Low-cost Sensors

    Get PDF
    Motion capture technologies, especially those combined with multiple kinds of sensory technologies to capture both kinematic and dynamic information, are widely used in a variety of fields such as biomechanics, robotics, and health. However, many existing systems suffer from limitations of being intrusive, restrictive, and expensive. This dissertation explores two aspects of motion capture systems that are low-cost, non-intrusive, high-accuracy, and easy to use for common users, including both full-body kinematics and dynamics capture, and user-specific hand modeling. More specifically, we present a new method for full-body motion capture that uses input data captured by three depth cameras and a pair of pressure-sensing shoes. Our system is appealing because it is fully automatic and can accurately reconstruct both full-body kinematic and dynamic data. We introduce a highly accurate tracking process that automatically reconstructs 3D skeletal poses using depth data, foot pressure data, and detailed full-body geometry. We also develop an efficient physics-based motion reconstruction algorithm for solving internal joint torques and contact forces based on contact pressure information and 3D poses from the kinematic tracking process. In addition, we present a novel low-dimensional parametric model for 3D hand modeling and synthesis. We construct a low-dimensional parametric model to compactly represent hand shape variations across individuals and enhance it by adding Linear Blend Skinning (LBS) for pose deformation. We also introduce an efficient iterative approach to learn the parametric model from a large unaligned scan database. Our model is compact, expressive, and produces a natural-looking LBS model for pose deformation, which allows for a variety of applications ranging from user-specific hand modeling to skinning weights transfer and model-based hand tracking

    Real-time 3D human body pose estimation from monocular RGB input

    Get PDF
    Human motion capture finds extensive application in movies, games, sports and biomechanical analysis. However, existing motion capture solutions require cumbersome external and/or on-body instrumentation, or use active sensors with limits on the possible capture volume dictated by power consumption. The ubiquity and ease of deployment of RGB cameras makes monocular RGB based human motion capture an extremely useful problem to solve, which would lower the barrier-to entry for content creators to employ motion capture tools, and enable newer applications of human motion capture. This thesis demonstrates the first real-time monocular RGB based motion-capture solutions that work in general scene settings. They are based on developing neural network based approaches to address the ill-posed problem of estimating 3D human pose from a single RGB image, in combination with model based fitting. In particular, the contributions of this work make advances towards three key aspects of real-time monocular RGB based motion capture, namely speed, accuracy, and the ability to work for general scenes. New training datasets are proposed, for single-person and multi-person scenarios, which, together with the proposed transfer learning based training pipeline, allow learning based approaches to be appearance invariant. The training datasets are accompanied by evaluation benchmarks with multiple avenues of fine-grained evaluation. The evaluation benchmarks differ visually from the training datasets, so as to promote efforts towards solutions that generalize to in-the-wild scenes. The proposed task formulations for the single-person and multi-person case allow higher accuracy, and incorporate additional qualities such as occlusion robustness, that are helpful in the context of a full motion capture solution. The multi-person formulations are designed to have a nearly constant inference time regardless of the number of subjects in the scene, and combined with contributions towards fast neural network inference, enable real-time 3D pose estimation for multiple subjects. Combining the proposed learning-based approaches with a model-based kinematic skeleton fitting step provides temporally stable joint angle estimates, which can be readily employed for driving virtual characters.Menschlicher Motion Capture findet umfangreiche Anwendung in Filmen, Spielen, Sport und biomechanischen Analysen. Bestehende Motion-Capture-Lösungen erfordern jedoch umständliche externe Instrumentierung und / oder Instrumentierung am Körper, oder verwenden aktive Sensoren deren begrenztes Erfassungsvolumen durch den Stromverbrauch begrenzt wird. Die Allgegenwart und einfache Bereitstellung von RGB-Kameras macht die monokulare RGB-basierte Motion Capture zu einem äußerst nützlichen Problem. Dies würde die Eintrittsbarriere für Inhaltsersteller für die Verwendung der Motion Capture verringern und neuere Anwendungen dieser Tools zur Analyse menschlicher Bewegungen ermöglichen. Diese Arbeit zeigt die ersten monokularen RGB-basierten Motion-Capture-Lösungen in Echtzeit, die in allgemeinen Szeneneinstellungen funktionieren. Sie basieren auf der Entwicklung neuronaler netzwerkbasierter Ansätze, um das schlecht gestellte Problem der Schätzung der menschlichen 3D-Pose aus einem einzelnen RGB-Bild in Kombination mit einer modellbasierten Anpassung anzugehen. Insbesondere machen die Beiträge dieser Arbeit Fortschritte in Richtung drei Schlüsselaspekte der monokularen RGB-basierten Echtzeit-Bewegungserfassung, nämlich Geschwindigkeit, Genauigkeit und die Fähigkeit, für allgemeine Szenen zu arbeiten. Es werden neue Trainingsdatensätze für Einzel- und Mehrpersonen-Szenarien vorgeschlagen, die zusammen mit der vorgeschlagenen Trainingspipeline, die auf Transferlernen basiert, ermöglichen, dass lernbasierte Ansätze nicht von Unterschieden im Erscheinungsbild des Bildes beeinflusst werden. Die Trainingsdatensätze werden von Bewertungsbenchmarks mit mehreren Möglichkeiten einer feinkörnigen Bewertung begleitet. Die angegebenen Benchmarks unterscheiden sich visuell von den Trainingsaufzeichnungen, um die Entwicklung von Lösungen zu fördern, die sich auf verschiedene Szenen verallgemeinern lassen. Die vorgeschlagenen Aufgabenformulierungen für den Einzel- und Mehrpersonenfall ermöglichen eine höhere Genauigkeit und enthalten zusätzliche Eigenschaften wie die Robustheit der Okklusion, die im Kontext einer vollständigen Bewegungserfassungslösung hilfreich sind. Die Mehrpersonenformulierungen sind so konzipiert, dass sie unabhängig von der Anzahl der Subjekte in der Szene eine nahezu konstante Inferenzzeit haben. In Kombination mit Beiträgen zur schnellen Inferenz neuronaler Netze ermöglichen sie eine 3D-Posenschätzung in Echtzeit für mehrere Subjekte. Die Kombination der vorgeschlagenen lernbasierten Ansätze mit einem modellbasierten kinematischen Skelettanpassungsschritt liefert zeitlich stabile Gelenkwinkelschätzungen, die leicht zum Ansteuern virtueller Charaktere verwendet werden können

    Human perception capabilities for socially intelligent domestic service robots

    Get PDF
    The daily living activities for an increasing number of frail elderly people represent a continuous struggle both for them as well as for their extended families. These people have difficulties coping at home alone but are still sufficiently fit not to need the round-the-clock care provided by a nursing home. Their struggle can be alleviated by the deployment of a mechanical helper in their home, i.e. a service robot that can execute a range of simple object manipulation tasks. Such a robotic application promises to extend the period of independent home living for elderly people, while providing them with a better quality of life. However, despite the recent technological advances in robotics, there are still some remaining challenges, mainly related to the human factors. Arguably, the lack of consistently dependable human detection, localisation, position and pose tracking information and insufficiently refined processing of sensor information makes the close range physical interaction between a robot and a human a high-risk task. The work described in this thesis addresses the deficiencies in the processing of the human information of today’s service robots. This is achieved through proposing a new paradigm for the robot’s situational awareness in regard to people as well as a collection of methods and techniques, operating at the lower levels of the paradigm, i.e. perception of new human information. The collection includes methods for obtaining and processing of information about the presence, location and body pose of the people. In addition to the availability of reliable human perception information, the integration between the separate levels of paradigm is considered to be a critically important factor for achieving the human-aware control of the robot. Improving the cognition, judgment and decision making action links between the paradigm’s layers leads to enhanced capability of the robot to engage in a natural and more meaningful interaction with people and, therefore, to a more enjoyable user experience. Therefore, the proposed paradigm and methodology are envisioned to contribute to making the prolonged assisted living of elderly people at home a more feasible and realistic task. In particular, this thesis proposes a set of methods for human presence detection, localisation and body pose tracking that are operating on the perception level of the paradigm. Also, the problem of having only limited visibility of a person from the on-board sensors of the robot is addressed by the proposed classifier fusion method that combines information from several types of sensors. A method for improved real-time human body pose tracking is also investigated. Additionally, a method for estimation of the multiple human tracks from noisy detections, as well as analysis of the computed human tracks for cognition about the social interactions within the social group, operating at the comprehension level of the robot’s situational awareness paradigm, is proposed. Finally, at the human-aware planning layer, a method that utilises the human related information, generated by the perception and comprehension layers to compute a minimally intrusive navigation path to a target person within a human group, is proposed. This method demonstrates how the improved human perception capabilities of the robot, through its judgement activity, ii ABSTRACT can be utilised by the highest level of the paradigm, i.e. the decision making layer, to achieve user friendly human-robot interactions. Overall, the research presented in this work, drawing on recent innovation in statistical learning, data fusion and optimisation methods, improves the overall situational awareness of the robot in regard to people with the main focus placed on human sensing capabilities of service robots. The improved overall situational awareness of the robot regarding people, as defined by the proposed paradigm, enables more meaningful human-robot interactions
    corecore