172,047 research outputs found
Expressive Body Capture: 3D Hands, Face, and Body from a Single Image
To facilitate the analysis of human actions, interactions and emotions, we
compute a 3D model of human body pose, hand pose, and facial expression from a
single monocular image. To achieve this, we use thousands of 3D scans to train
a new, unified, 3D model of the human body, SMPL-X, that extends SMPL with
fully articulated hands and an expressive face. Learning to regress the
parameters of SMPL-X directly from images is challenging without paired images
and 3D ground truth. Consequently, we follow the approach of SMPLify, which
estimates 2D features and then optimizes model parameters to fit the features.
We improve on SMPLify in several significant ways: (1) we detect 2D features
corresponding to the face, hands, and feet and fit the full SMPL-X model to
these; (2) we train a new neural network pose prior using a large MoCap
dataset; (3) we define a new interpenetration penalty that is both fast and
accurate; (4) we automatically detect gender and the appropriate body models
(male, female, or neutral); (5) our PyTorch implementation achieves a speedup
of more than 8x over Chumpy. We use the new method, SMPLify-X, to fit SMPL-X to
both controlled images and images in the wild. We evaluate 3D accuracy on a new
curated dataset comprising 100 images with pseudo ground-truth. This is a step
towards automatic expressive human capture from monocular RGB data. The models,
code, and data are available for research purposes at
https://smpl-x.is.tue.mpg.de.Comment: To appear in CVPR 201
Recognition of human interactions using limb-level feature points
Human activity recognition is an emerging area of research in computer vision with applications in video surveillance, human-computer interaction, robotics, and video annotation. Despite a number of recent advances, there are still many opportunities for new developments, especially in the area of person-person and person-object interaction. Many proposed algorithms focus on recognizing solely single person, person-person or person-object activities. An algorithm which can recognize all three types would be a significant step toward the real-world application of this technology. This thesis investigates the design and implementation of such an algorithm. It utilizes background subtraction to extract the subjects in the scene, and pixel clustering to segment their image into body parts. A location-based feature identification algorithm extracts feature points from these segments and feeds them to a classifier which identifies videos as activities. Together these techniques comprise an algorithm that can recognize single person, person-person and person-object interactions. This algorithm\u27s performance was evaluated based on interactions in a new video dataset, demonstrating the effectiveness of using limb-level feature points as a method of identifying human interactions
Boosting Image-based Mutual Gaze Detection using Pseudo 3D Gaze
Mutual gaze detection, i.e., predicting whether or not two people are looking
at each other, plays an important role in understanding human interactions. In
this work, we focus on the task of image-based mutual gaze detection, and
propose a simple and effective approach to boost the performance by using an
auxiliary 3D gaze estimation task during the training phase. We achieve the
performance boost without additional labeling cost by training the 3D gaze
estimation branch using pseudo 3D gaze labels deduced from mutual gaze labels.
By sharing the head image encoder between the 3D gaze estimation and the mutual
gaze detection branches, we achieve better head features than learned by
training the mutual gaze detection branch alone. Experimental results on three
image datasets show that the proposed approach improves the detection
performance significantly without additional annotations. This work also
introduces a new image dataset that consists of 33.1K pairs of humans annotated
with mutual gaze labels in 29.2K images
LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal Networks for HOI in videos
Analyzing the interactions between humans and objects from a video includes
identification of the relationships between humans and the objects present in
the video. It can be thought of as a specialized version of Visual Relationship
Detection, wherein one of the objects must be a human. While traditional
methods formulate the problem as inference on a sequence of video segments, we
present a hierarchical approach, LIGHTEN, to learn visual features to
effectively capture spatio-temporal cues at multiple granularities in a video.
Unlike current approaches, LIGHTEN avoids using ground truth data like depth
maps or 3D human pose, thus increasing generalization across non-RGBD datasets
as well. Furthermore, we achieve the same using only the visual features,
instead of the commonly used hand-crafted spatial features. We achieve
state-of-the-art results in human-object interaction detection (88.9% and
92.6%) and anticipation tasks of CAD-120 and competitive results on image based
HOI detection in V-COCO dataset, setting a new benchmark for visual features
based approaches. Code for LIGHTEN is available at
https://github.com/praneeth11009/LIGHTEN-Learning-Interactions-with-Graphs-and-Hierarchical-TEmporal-Networks-for-HOIComment: 9 pages, 6 figures, ACM Multimedia Conference 202
A Multimodal Human-Robot Interaction Dataset
International audienceThis works presents a multimodal dataset for Human-Robot Interactive Learning. 1 The dataset contains synchronized recordings of several human users, from a stereo 2 microphone and three cameras mounted on the robot. The focus of the dataset is 3 incremental object learning, oriented to human-robot assistance and interaction. To 4 learn new object models from interactions with a human user, the robot needs to 5 be able to perform multiple tasks: (a) recognize the type of interaction (pointing, 6 showing or speaking), (b) segment regions of interest from acquired data (hands and 7 objects), and (c) learn and recognize object models. We illustrate the advantages 8 of multimodal data over camera-only datasets by presenting an approach that 9 recognizes the user interaction by combining simple image and language features
Weakly supervised learning of interactions between humans and objects
International audienceWe introduce a weakly supervised approach for learning human actions modeled as interactions between humans and objects. Our approach is human-centric: we first localize a human in the image and then determine the object relevant for the action and its spatial relation with the human. The model is learned automatically from a set of still images annotated only with the action label. Our approach relies on a human detector to initialize the model learning. For robustness to various degrees of visibility, we build a detector that learns to combine a set of existing part detectors. Starting from humans detected in a set of images depicting the action, our approach determines the action object and its spatial relation to the human. Its final output is a probabilistic model of the human-object interaction, i.e. the spatial relation between the human and the object. We present an extensive experimental evaluation on the sports action dataset from Gupta et al., the PASCAL Action 2010 dataset, and a new human-object interaction dataset
- …