1,422 research outputs found

    Model-based viewpoint invariant human activity recognition from uncalibrated monocular video sequence

    Get PDF
    There is growing interest in human activity recognition systems, motivated by their numerous promising applications in many domains. Despite much progress, most researchers have narrowed the problem towards fixed camera viewpoint owing to inherent difficulty to train their systems across all possible viewpoints. Fixed viewpoint systems are impractical in real scenarios. Therefore, we attempt to relax the fixed viewpoint assumption and present a novel and simple framework to recognize and classify human activities from uncalibrated monocular video source from any viewpoint. The proposed framework comprises two stages: 3D human pose estimation and human activity recognition. In the pose estimation stage, we estimate 3D human pose by a simple search-based and tracking-based technique. In the activity recognition stage, we use Nearest Neighbor, with Dynamic Time Warping as a distance measure, to classify multivariate time series which emanate from streams of pose vectors from multiple video frames. We have performed some experiments to evaluate the accuracy of the two stages separately. The encouraging experimental results demonstrate the effectiveness of our framework

    Vision for Social Robots: Human Perception and Pose Estimation

    Get PDF
    In order to extract the underlying meaning from a scene captured from the surrounding world in a single still image, social robots will need to learn the human ability to detect different objects, understand their arrangement and relationships relative both to their own parts and to each other, and infer the dynamics under which they are evolving. Furthermore, they will need to develop and hold a notion of context to allow assigning different meanings (semantics) to the same visual configuration (syntax) of a scene. The underlying thread of this Thesis is the investigation of new ways for enabling interactions between social robots and humans, by advancing the visual perception capabilities of robots when they process images and videos in which humans are the main focus of attention. First, we analyze the general problem of scene understanding, as social robots moving through the world need to be able to interpret scenes without having been assigned a specific preset goal. Throughout this line of research, i) we observe that human actions and interactions which can be visually discriminated from an image follow a very heavy-tailed distribution; ii) we develop an algorithm that can obtain a spatial understanding of a scene by only using cues arising from the effect of perspective on a picture of a person’s face; and iii) we define a novel taxonomy of errors for the task of estimating the 2D body pose of people in images to better explain the behavior of algorithms and highlight their underlying causes of error. Second, we focus on the specific task of 3D human pose and motion estimation from monocular 2D images using weakly supervised training data, as accurately predicting human pose will open up the possibility of richer interactions between humans and social robots. We show that when 3D ground-truth data is only available in small quantities, or not at all, it is possible to leverage knowledge about the physical properties of the human body, along with additional constraints related to alternative types of supervisory signals, to learn models that can regress the full 3D pose of the human body and predict its motions from monocular 2D images. Taken in its entirety, the intent of this Thesis is to highlight the importance of, and provide novel methodologies for, social robots' ability to interpret their surrounding environment, learn in a way that is robust to low data availability, and generalize previously observed behaviors to unknown situations in a similar way to humans.</p

    How Does the Cerebral Cortex Work? Developement, Learning, Attention, and 3D Vision by Laminar Circuits of Visual Cortex

    Full text link
    A key goal of behavioral and cognitive neuroscience is to link brain mechanisms to behavioral functions. The present article describes recent progress towards explaining how the visual cortex sees. Visual cortex, like many parts of perceptual and cognitive neocortex, is organized into six main layers of cells, as well as characteristic sub-lamina. Here it is proposed how these layered circuits help to realize the processes of developement, learning, perceptual grouping, attention, and 3D vision through a combination of bottom-up, horizontal, and top-down interactions. A key theme is that the mechanisms which enable developement and learning to occur in a stable way imply properties of adult behavior. These results thus begin to unify three fields: infant cortical developement, adult cortical neurophysiology and anatomy, and adult visual perception. The identified cortical mechanisms promise to generalize to explain how other perceptual and cognitive processes work.Air Force Office of Scientific Research (F49620-01-1-0397); Office of Naval Research (N00014-01-1-0624

    Exploiting projective geometry for view-invariant monocular human motion analysis in man-made environments

    Get PDF
    Example-based approaches have been very successful for human motion analysis but their accuracy strongly depends on the similarity of the viewpoint in testing and training images. In practice, roof-top cameras are widely used for video surveillance and are usually placed at a significant angle from the floor, which is different from typical training viewpoints. We present a methodology for view-invariant monocular human motion analysis in man-made environments in which we exploit some properties of projective geometry and the presence of numerous easy-to-detect straight lines. We also assume that observed people move on a known ground plane. First, we model body poses and silhouettes using a reduced set of training views. Then, during the online stage, the homography that relates the selected training plane to the input image points is calculated using the dominant 3D directions of the scene, the location on the ground plane and the camera view in both training and testing images. This homographic transformation is used to compensate for the changes in silhouette due to the novel viewpoint. In our experiments, we show that it can be employed in a bottom-up manner to align the input image to the training plane and process it with the corresponding view-based silhouette model, or top-down to project a candidate silhouette and match it in the image. We present qualitative and quantitative results on the CAVIAR dataset using both bottom-up and top-down types of framework and demonstrate the significant improvements of the proposed homographic alignment over a commonly used similarity transform