15 research outputs found
Unsupervised learning of human motion
An unsupervised learning algorithm that can obtain a probabilistic model of an object composed of a collection of parts (a moving human body in our examples) automatically from unlabeled training data is presented. The training data include both useful "foreground" features as well as features that arise from irrelevant background clutter - the correspondence between parts and detected features is unknown. The joint probability density function of the parts is represented by a mixture of decomposable triangulated graphs which allow for fast detection. To learn the model structure as well as model parameters, an EM-like algorithm is developed where the labeling of the data (part assignments) is treated as hidden variables. The unsupervised learning technique is not limited to decomposable triangulated graphs. The efficiency and effectiveness of our algorithm is demonstrated by applying it to generate models of human motion automatically from unlabeled image sequences, and testing the learned models on a variety of sequences
Towards Detection of Human Motion
Detecting humans in images is a useful application
of computer vision. Loose and textured clothing, occlusion
and scene clutter make it a difficult problem because
bottom-up segmentation and grouping do not always work.
We address the problem of detecting humans from their motion
pattern in monocular image sequences; extraneous motions
and occlusion may be present. We assume that we may
not rely on segmentation, nor grouping and that the vision
front-end is limited to observing the motion of key points
and textured patches in between pairs of frames. We do not
assume that we are able to track features for more than two
frames. Our method is based on learning an approximate
probabilistic model of the joint position and velocity of different body features. Detection is performed by hypothesis testing on the maximum a posteriori estimate of the pose and motion of the body. Our experiments on a dozen of walking sequences indicate that our algorithm is accurate
and efficient
Sistema de seguimiento del cambio de la postura de una persona que realiza una actividad en un lugar cerrado
Ingeniero (a) ElectrĂłnicoPregrad
Object and pattern detection in video sequences
Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1997.Includes bibliographical references (p. 60-62).by Constantine Phaedon Papageorgiou.M.S
Vision for Social Robots: Human Perception and Pose Estimation
In order to extract the underlying meaning from a scene captured from the surrounding world in a single still image, social robots will need to learn the human ability to detect different objects, understand their arrangement and relationships relative both to their own parts and to each other, and infer the dynamics under which they are evolving. Furthermore, they will need to develop and hold a notion of context to allow assigning different meanings (semantics) to the same visual configuration (syntax) of a scene.
The underlying thread of this Thesis is the investigation of new ways for enabling interactions between social robots and humans, by advancing the visual perception capabilities of robots when they process images and videos in which humans are the main focus of attention.
First, we analyze the general problem of scene understanding, as social robots moving through the world need to be able to interpret scenes without having been assigned a specific preset goal. Throughout this line of research, i) we observe that human actions and interactions which can be visually discriminated from an image follow a very heavy-tailed distribution; ii) we develop an algorithm that can obtain a spatial understanding of a scene by only using cues arising from the effect of perspective on a picture of a person’s face; and iii) we define a novel taxonomy of errors for the task of estimating the 2D body pose of people in images to better explain the behavior of algorithms and highlight their underlying causes of error.
Second, we focus on the specific task of 3D human pose and motion estimation from monocular 2D images using weakly supervised training data, as accurately predicting human pose will open up the possibility of richer interactions between humans and social robots. We show that when 3D ground-truth data is only available in small quantities, or not at all, it is possible to leverage knowledge about the physical properties of the human body, along with additional constraints related to alternative types of supervisory signals, to learn models that can regress the full 3D pose of the human body and predict its motions from monocular 2D images.
Taken in its entirety, the intent of this Thesis is to highlight the importance of, and provide novel methodologies for, social robots' ability to interpret their surrounding environment, learn in a way that is robust to low data availability, and generalize previously observed behaviors to unknown situations in a similar way to humans.</p
View-Invariance in Visual Human Motion Analysis
This thesis makes contributions towards the solutions to
two problems in the area of visual human motion
analysis: human action recognition and human body pose
estimation. Although there has been a substantial
amount of research addressing these two problems in the
past, the important issue of viewpoint invariance in
the representation and recognition of poses and actions
has received relatively scarce attention, and forms a
key goal of this thesis.
Drawing on results from 2D projective invariance theory
and 3D mutual invariants, we present three different
approaches of varying degrees of generality, for human
action representation and recognition. A detailed
analysis of the approaches reveals key challenges,
which are circumvented by enforcing spatial and
temporal coherency constraints. An extensive
performance evaluation of the approaches on 2D
projections of motion capture data and manually
segmented real image sequences demonstrates that in
addition to viewpoint changes, the approaches are able
to handle well, varying speeds of execution of actions
(and hence different frame rates of the video),
different subjects and minor variabilities in the
spatiotemporal dynamics of the action.
Next, we present a method for recovering the
body-centric coordinates of key joints and parts of a
canonically scaled human body, given an image of the
body and the point correspondences of specific body
joints in an image. This problem is difficult to solve
because of body articulation and perspective effects.
To make the problem tractable, previous researchers
have resorted to restricting the camera model or
requiring an unrealistic number of point
correspondences, both of which are more restrictive
than necessary. We present a solution for the general
case of a perspective uncalibrated camera. Our method
requires that the torso does not twist considerably, an
assumption that is usually satisfied for many poses of
the body. We evaluate the quantitative performance of
the method on synthetic data and the qualitative
performance of the method on real images taken with
unknown cameras and viewpoints. Both these evaluations
show the effectiveness of the method at recovering the
pose of the human body
Modelling human pose and shape based on a database of human 3D scans
Generating realistic human shapes and motion is an important task both in the motion picture industry and in computer games. In feature films, high quality and believability are the most important characteristics. Additionally, when creating virtual doubles the generated charactes have to match as closely as possible to given real persons. In contrast, in computer games the level of realism does not need to be as high but real-time performance is essential. It is desirable to meet all these requirements with a general model of human pose and shape. In addition, many markerless human tracking methods applied, e.g., in biomedicine or sports science can benefit greatly from the availability of such a model because most methods require a 3D model of the tracked subject as input, which can be generated on-the-fly given a suitable shape and pose model.
In this thesis, a comprehensive procedure is presented to generate different general models of human pose. A database of 3D scans spanning the space of human pose and shape variations is introduced. Then, four different approaches for transforming the database into a general model of human pose and shape are presented, which improve the current state of the art. Experiments are performed to evaluate and compare the proposed models on real-world problems, i.e., characters are generated given semantic constraints and the underlying shape and pose of humans given 3D scans, multi-view video, or uncalibrated monocular images is estimated.Die Erzeugung realistischer Menschenmodelle ist eine wichtige Anwendung in der Filmindustrie und bei Computerspielen. In Spielen ist Echtzeitsynthese unabdingbar aber der Detailgrad muß nicht so hoch sein wie in Filmen. Für virtuelle Doubles, wie sie z.B. in Filmen eingesetzt werden, muss der generierte Charakter dem gegebenen realen Menschen möglichst ähnlich sein. Mit einem generellen Modell für menschliche Pose und Körperform ist es möglich alle diese Anforderungen zu erfüllen. Zusätzlich können viele Verfahren zur markerlosen Bewegungserfassung, wie sie z.B. in der Biomedizin oder in den Sportwissenschaften eingesetzt werden, von einem generellen Modell für Pose und Körperform profitieren. Da diese ein 3D Modell der erfassten Person benötigen, das jetzt zur Laufzeit generiert werden kann. In dieser Doktorarbeit wird ein umfassender Ansatz vorgestellt, um verschiedene Modelle für Pose und Körperform zu berechnen. Zunächst wird eine Datenbank von 3D Scans aufgebaut, die Pose- und Körperformvariationen von Menschen umfasst. Dann werden vier verschiedene Verfahren eingeführt, die daraus generelle Modelle für Pose und Körperform berechnen und Probleme beim Stand der Technik beheben. Die vorgestellten Modelle werden auf realistischen Problemstellungen getestet. So werden Menschenmodelle aus einigen wenigen Randbedingungen erzeugt und Pose und Körperform von Probanden wird aus 3D Scans, Multi-Kamera Videodaten und Einzelbildern der bekleideten Personen geschätzt
A Trainable System for Object Detection in Images and Video Sequences
This thesis presents a general, trainable system for object detection in static images and video sequences. The core system finds a certain class of objects in static images of completely unconstrained, cluttered scenes without using motion, tracking, or handcrafted models and without making any assumptions on the scene structure or the number of objects in the scene. The system uses a set of training data of positive and negative example images as input, transforms the pixel images to a Haar wavelet representation, and uses a support vector machine classifier to learn the difference between in-class and out-of-class patterns. To detect objects in out-of-sample images, we do a brute force search over all the subwindows in the image. This system is applied to face, people, and car detection with excellent results. For our extensions to video sequences, we augment the core static detection system in several ways -- 1) extending the representation to five frames, 2) implementing an approximation to a Kalman filter, and 3) modeling detections in an image as a density and propagating this density through time according to measured features. In addition, we present a real-time version of the system that is currently running in a DaimlerChrysler experimental vehicle. As part of this thesis, we also present a system that, instead of detecting full patterns, uses a component-based approach. We find it to be more robust to occlusions, rotations in depth, and severe lighting conditions for people detection than the full body version. We also experiment with various other representations including pixels and principal components and show results that quantify how the number of features, color, and gray-level affect performance