172,047 research outputs found

    Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

    Full text link
    To facilitate the analysis of human actions, interactions and emotions, we compute a 3D model of human body pose, hand pose, and facial expression from a single monocular image. To achieve this, we use thousands of 3D scans to train a new, unified, 3D model of the human body, SMPL-X, that extends SMPL with fully articulated hands and an expressive face. Learning to regress the parameters of SMPL-X directly from images is challenging without paired images and 3D ground truth. Consequently, we follow the approach of SMPLify, which estimates 2D features and then optimizes model parameters to fit the features. We improve on SMPLify in several significant ways: (1) we detect 2D features corresponding to the face, hands, and feet and fit the full SMPL-X model to these; (2) we train a new neural network pose prior using a large MoCap dataset; (3) we define a new interpenetration penalty that is both fast and accurate; (4) we automatically detect gender and the appropriate body models (male, female, or neutral); (5) our PyTorch implementation achieves a speedup of more than 8x over Chumpy. We use the new method, SMPLify-X, to fit SMPL-X to both controlled images and images in the wild. We evaluate 3D accuracy on a new curated dataset comprising 100 images with pseudo ground-truth. This is a step towards automatic expressive human capture from monocular RGB data. The models, code, and data are available for research purposes at https://smpl-x.is.tue.mpg.de.Comment: To appear in CVPR 201

    Recognition of human interactions using limb-level feature points

    Get PDF
    Human activity recognition is an emerging area of research in computer vision with applications in video surveillance, human-computer interaction, robotics, and video annotation. Despite a number of recent advances, there are still many opportunities for new developments, especially in the area of person-person and person-object interaction. Many proposed algorithms focus on recognizing solely single person, person-person or person-object activities. An algorithm which can recognize all three types would be a significant step toward the real-world application of this technology. This thesis investigates the design and implementation of such an algorithm. It utilizes background subtraction to extract the subjects in the scene, and pixel clustering to segment their image into body parts. A location-based feature identification algorithm extracts feature points from these segments and feeds them to a classifier which identifies videos as activities. Together these techniques comprise an algorithm that can recognize single person, person-person and person-object interactions. This algorithm\u27s performance was evaluated based on interactions in a new video dataset, demonstrating the effectiveness of using limb-level feature points as a method of identifying human interactions

    Boosting Image-based Mutual Gaze Detection using Pseudo 3D Gaze

    Full text link
    Mutual gaze detection, i.e., predicting whether or not two people are looking at each other, plays an important role in understanding human interactions. In this work, we focus on the task of image-based mutual gaze detection, and propose a simple and effective approach to boost the performance by using an auxiliary 3D gaze estimation task during the training phase. We achieve the performance boost without additional labeling cost by training the 3D gaze estimation branch using pseudo 3D gaze labels deduced from mutual gaze labels. By sharing the head image encoder between the 3D gaze estimation and the mutual gaze detection branches, we achieve better head features than learned by training the mutual gaze detection branch alone. Experimental results on three image datasets show that the proposed approach improves the detection performance significantly without additional annotations. This work also introduces a new image dataset that consists of 33.1K pairs of humans annotated with mutual gaze labels in 29.2K images

    LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal Networks for HOI in videos

    Full text link
    Analyzing the interactions between humans and objects from a video includes identification of the relationships between humans and the objects present in the video. It can be thought of as a specialized version of Visual Relationship Detection, wherein one of the objects must be a human. While traditional methods formulate the problem as inference on a sequence of video segments, we present a hierarchical approach, LIGHTEN, to learn visual features to effectively capture spatio-temporal cues at multiple granularities in a video. Unlike current approaches, LIGHTEN avoids using ground truth data like depth maps or 3D human pose, thus increasing generalization across non-RGBD datasets as well. Furthermore, we achieve the same using only the visual features, instead of the commonly used hand-crafted spatial features. We achieve state-of-the-art results in human-object interaction detection (88.9% and 92.6%) and anticipation tasks of CAD-120 and competitive results on image based HOI detection in V-COCO dataset, setting a new benchmark for visual features based approaches. Code for LIGHTEN is available at https://github.com/praneeth11009/LIGHTEN-Learning-Interactions-with-Graphs-and-Hierarchical-TEmporal-Networks-for-HOIComment: 9 pages, 6 figures, ACM Multimedia Conference 202

    A Multimodal Human-Robot Interaction Dataset

    Get PDF
    International audienceThis works presents a multimodal dataset for Human-Robot Interactive Learning. 1 The dataset contains synchronized recordings of several human users, from a stereo 2 microphone and three cameras mounted on the robot. The focus of the dataset is 3 incremental object learning, oriented to human-robot assistance and interaction. To 4 learn new object models from interactions with a human user, the robot needs to 5 be able to perform multiple tasks: (a) recognize the type of interaction (pointing, 6 showing or speaking), (b) segment regions of interest from acquired data (hands and 7 objects), and (c) learn and recognize object models. We illustrate the advantages 8 of multimodal data over camera-only datasets by presenting an approach that 9 recognizes the user interaction by combining simple image and language features

    Weakly supervised learning of interactions between humans and objects

    Get PDF
    International audienceWe introduce a weakly supervised approach for learning human actions modeled as interactions between humans and objects. Our approach is human-centric: we first localize a human in the image and then determine the object relevant for the action and its spatial relation with the human. The model is learned automatically from a set of still images annotated only with the action label. Our approach relies on a human detector to initialize the model learning. For robustness to various degrees of visibility, we build a detector that learns to combine a set of existing part detectors. Starting from humans detected in a set of images depicting the action, our approach determines the action object and its spatial relation to the human. Its final output is a probabilistic model of the human-object interaction, i.e. the spatial relation between the human and the object. We present an extensive experimental evaluation on the sports action dataset from Gupta et al., the PASCAL Action 2010 dataset, and a new human-object interaction dataset
    corecore