4 research outputs found

    Pictorial Human Spaces : A Computational Study on the Human Perception of 3D Articulated Poses

    No full text
    Human motion analysis in images and video, with its deeply inter-related 2D and 3D inference components, is a central computer vision problem. Yet, there are no studies that reveal how humans perceive other people in images and how accurate they are. In this paper we aim to unveil some of the processing—as well as the levels of accuracy—involved in the 3D perception of people from images by assessing the human performance. Moreover, we reveal the quantitative and qualitative differences between human and computer performance when presented with the same visual stimuli and show that metrics incorporating human perception can produce more meaningful results when integrated into automatic pose prediction algorithms. Our contributions are: (1) the construction of an experimental apparatus that relates perception and measurement, in particular the visual and kinematic performance with respect to 3D ground truth when the human subject is presented an image of a person in a given pose; (2) the creation of a dataset containing images, articulated 2D and 3D pose ground truth, as well as synchronized eye movement recordings of human subjects, shown a variety of human body configurations, both easy and difficult, as well as their ‘re-enacted’ 3D poses; (3) quantitative analysis revealing the human performance in 3D pose re-enactment tasks, the degree of stability in the visual fixation patterns of human subjects, and the way it correlates with different poses; (4) extensive analysis on the differences between human re-enactments and poses produced by an automatic system when presented with the same visual stimuli; (5) an approach to learning perceptual metrics that, when integrated into visual sensing systems, produces more stable and meaningful results

    3D Human Sensing, Action and Emotion Recognition in Robot Assisted Therapy of Children with Autism

    No full text
    We introduce new, fine-grained action and emotion recognition tasks defined on non-staged videos, recorded during robot-assisted therapy sessions of children with autism. The tasks present several challenges: a large dataset with long videos, a large number of highly variable actions, children that are only partially visible, have different ages and may show unpredictable behaviour, as well as non-standard camera viewpoints. We investigate how state-of-the-art 3d human pose reconstruction methods perform on the newly introduced tasks and propose extensions to adapt them to deal with these challenges. We also analyze multiple approaches in action and emotion recognition from 3d human pose data, establish several baselines, and discuss results and their implications in the context of child-robot interaction

    Deep network for the integrated 3D sensing of multiple people in natural images

    No full text
    We present MubyNet - a feed-forward, multitask, bottom up system for the integrated localization, as well as 3d pose and shape estimation, of multiple people in monocular images. The challenge is the formal modeling of the problem that intrinsically requires discrete and continuous computation, e.g. grouping people vs. predicting 3d pose. The model identifies human body structures (joints and limbs) in images, groups them based on 2d and 3d information fused using learned scoring functions, and optimally aggregates such responses into partial or complete 3d human skeleton hypotheses under kinematic tree constraints, but without knowing in advance the number of people in the scene and their visibility relations. We design a multi-task deep neural network with differentiable stages where the person grouping problem is formulated as an integer program based on learned body part scores parameterized by both 2d and 3d information. This avoids suboptimality resulting from separate 2d and 3d reasoning, with grouping performed based on the combined representation. The final stage of 3d pose and shape prediction is based on a learned attention process where information from different human body parts is optimally integrated. State-of-the-art results are obtained in large scale datasets like Human3.6M and Panoptic, and qualitatively by reconstructing the 3d shape and pose of multiple people, under occlusion, in difficult monocular images
    corecore