15 research outputs found

    Unsupervised learning of human motion

    Get PDF
    An unsupervised learning algorithm that can obtain a probabilistic model of an object composed of a collection of parts (a moving human body in our examples) automatically from unlabeled training data is presented. The training data include both useful "foreground" features as well as features that arise from irrelevant background clutter - the correspondence between parts and detected features is unknown. The joint probability density function of the parts is represented by a mixture of decomposable triangulated graphs which allow for fast detection. To learn the model structure as well as model parameters, an EM-like algorithm is developed where the labeling of the data (part assignments) is treated as hidden variables. The unsupervised learning technique is not limited to decomposable triangulated graphs. The efficiency and effectiveness of our algorithm is demonstrated by applying it to generate models of human motion automatically from unlabeled image sequences, and testing the learned models on a variety of sequences

    Towards Detection of Human Motion

    Get PDF
    Detecting humans in images is a useful application of computer vision. Loose and textured clothing, occlusion and scene clutter make it a difficult problem because bottom-up segmentation and grouping do not always work. We address the problem of detecting humans from their motion pattern in monocular image sequences; extraneous motions and occlusion may be present. We assume that we may not rely on segmentation, nor grouping and that the vision front-end is limited to observing the motion of key points and textured patches in between pairs of frames. We do not assume that we are able to track features for more than two frames. Our method is based on learning an approximate probabilistic model of the joint position and velocity of different body features. Detection is performed by hypothesis testing on the maximum a posteriori estimate of the pose and motion of the body. Our experiments on a dozen of walking sequences indicate that our algorithm is accurate and efficient

    Object and pattern detection in video sequences

    Get PDF
    Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1997.Includes bibliographical references (p. 60-62).by Constantine Phaedon Papageorgiou.M.S

    Vision for Social Robots: Human Perception and Pose Estimation

    Get PDF
    In order to extract the underlying meaning from a scene captured from the surrounding world in a single still image, social robots will need to learn the human ability to detect different objects, understand their arrangement and relationships relative both to their own parts and to each other, and infer the dynamics under which they are evolving. Furthermore, they will need to develop and hold a notion of context to allow assigning different meanings (semantics) to the same visual configuration (syntax) of a scene. The underlying thread of this Thesis is the investigation of new ways for enabling interactions between social robots and humans, by advancing the visual perception capabilities of robots when they process images and videos in which humans are the main focus of attention. First, we analyze the general problem of scene understanding, as social robots moving through the world need to be able to interpret scenes without having been assigned a specific preset goal. Throughout this line of research, i) we observe that human actions and interactions which can be visually discriminated from an image follow a very heavy-tailed distribution; ii) we develop an algorithm that can obtain a spatial understanding of a scene by only using cues arising from the effect of perspective on a picture of a person’s face; and iii) we define a novel taxonomy of errors for the task of estimating the 2D body pose of people in images to better explain the behavior of algorithms and highlight their underlying causes of error. Second, we focus on the specific task of 3D human pose and motion estimation from monocular 2D images using weakly supervised training data, as accurately predicting human pose will open up the possibility of richer interactions between humans and social robots. We show that when 3D ground-truth data is only available in small quantities, or not at all, it is possible to leverage knowledge about the physical properties of the human body, along with additional constraints related to alternative types of supervisory signals, to learn models that can regress the full 3D pose of the human body and predict its motions from monocular 2D images. Taken in its entirety, the intent of this Thesis is to highlight the importance of, and provide novel methodologies for, social robots' ability to interpret their surrounding environment, learn in a way that is robust to low data availability, and generalize previously observed behaviors to unknown situations in a similar way to humans.</p

    View-Invariance in Visual Human Motion Analysis

    Get PDF
    This thesis makes contributions towards the solutions to two problems in the area of visual human motion analysis: human action recognition and human body pose estimation. Although there has been a substantial amount of research addressing these two problems in the past, the important issue of viewpoint invariance in the representation and recognition of poses and actions has received relatively scarce attention, and forms a key goal of this thesis. Drawing on results from 2D projective invariance theory and 3D mutual invariants, we present three different approaches of varying degrees of generality, for human action representation and recognition. A detailed analysis of the approaches reveals key challenges, which are circumvented by enforcing spatial and temporal coherency constraints. An extensive performance evaluation of the approaches on 2D projections of motion capture data and manually segmented real image sequences demonstrates that in addition to viewpoint changes, the approaches are able to handle well, varying speeds of execution of actions (and hence different frame rates of the video), different subjects and minor variabilities in the spatiotemporal dynamics of the action. Next, we present a method for recovering the body-centric coordinates of key joints and parts of a canonically scaled human body, given an image of the body and the point correspondences of specific body joints in an image. This problem is difficult to solve because of body articulation and perspective effects. To make the problem tractable, previous researchers have resorted to restricting the camera model or requiring an unrealistic number of point correspondences, both of which are more restrictive than necessary. We present a solution for the general case of a perspective uncalibrated camera. Our method requires that the torso does not twist considerably, an assumption that is usually satisfied for many poses of the body. We evaluate the quantitative performance of the method on synthetic data and the qualitative performance of the method on real images taken with unknown cameras and viewpoints. Both these evaluations show the effectiveness of the method at recovering the pose of the human body

    Modelling human pose and shape based on a database of human 3D scans

    Get PDF
    Generating realistic human shapes and motion is an important task both in the motion picture industry and in computer games. In feature films, high quality and believability are the most important characteristics. Additionally, when creating virtual doubles the generated charactes have to match as closely as possible to given real persons. In contrast, in computer games the level of realism does not need to be as high but real-time performance is essential. It is desirable to meet all these requirements with a general model of human pose and shape. In addition, many markerless human tracking methods applied, e.g., in biomedicine or sports science can benefit greatly from the availability of such a model because most methods require a 3D model of the tracked subject as input, which can be generated on-the-fly given a suitable shape and pose model. In this thesis, a comprehensive procedure is presented to generate different general models of human pose. A database of 3D scans spanning the space of human pose and shape variations is introduced. Then, four different approaches for transforming the database into a general model of human pose and shape are presented, which improve the current state of the art. Experiments are performed to evaluate and compare the proposed models on real-world problems, i.e., characters are generated given semantic constraints and the underlying shape and pose of humans given 3D scans, multi-view video, or uncalibrated monocular images is estimated.Die Erzeugung realistischer Menschenmodelle ist eine wichtige Anwendung in der Filmindustrie und bei Computerspielen. In Spielen ist Echtzeitsynthese unabdingbar aber der Detailgrad muß nicht so hoch sein wie in Filmen. Für virtuelle Doubles, wie sie z.B. in Filmen eingesetzt werden, muss der generierte Charakter dem gegebenen realen Menschen möglichst ähnlich sein. Mit einem generellen Modell für menschliche Pose und Körperform ist es möglich alle diese Anforderungen zu erfüllen. Zusätzlich können viele Verfahren zur markerlosen Bewegungserfassung, wie sie z.B. in der Biomedizin oder in den Sportwissenschaften eingesetzt werden, von einem generellen Modell für Pose und Körperform profitieren. Da diese ein 3D Modell der erfassten Person benötigen, das jetzt zur Laufzeit generiert werden kann. In dieser Doktorarbeit wird ein umfassender Ansatz vorgestellt, um verschiedene Modelle für Pose und Körperform zu berechnen. Zunächst wird eine Datenbank von 3D Scans aufgebaut, die Pose- und Körperformvariationen von Menschen umfasst. Dann werden vier verschiedene Verfahren eingeführt, die daraus generelle Modelle für Pose und Körperform berechnen und Probleme beim Stand der Technik beheben. Die vorgestellten Modelle werden auf realistischen Problemstellungen getestet. So werden Menschenmodelle aus einigen wenigen Randbedingungen erzeugt und Pose und Körperform von Probanden wird aus 3D Scans, Multi-Kamera Videodaten und Einzelbildern der bekleideten Personen geschätzt

    A Trainable System for Object Detection in Images and Video Sequences

    Get PDF
    This thesis presents a general, trainable system for object detection in static images and video sequences. The core system finds a certain class of objects in static images of completely unconstrained, cluttered scenes without using motion, tracking, or handcrafted models and without making any assumptions on the scene structure or the number of objects in the scene. The system uses a set of training data of positive and negative example images as input, transforms the pixel images to a Haar wavelet representation, and uses a support vector machine classifier to learn the difference between in-class and out-of-class patterns. To detect objects in out-of-sample images, we do a brute force search over all the subwindows in the image. This system is applied to face, people, and car detection with excellent results. For our extensions to video sequences, we augment the core static detection system in several ways -- 1) extending the representation to five frames, 2) implementing an approximation to a Kalman filter, and 3) modeling detections in an image as a density and propagating this density through time according to measured features. In addition, we present a real-time version of the system that is currently running in a DaimlerChrysler experimental vehicle. As part of this thesis, we also present a system that, instead of detecting full patterns, uses a component-based approach. We find it to be more robust to occlusions, rotations in depth, and severe lighting conditions for people detection than the full body version. We also experiment with various other representations including pixels and principal components and show results that quantify how the number of features, color, and gray-level affect performance
    corecore