9 research outputs found

    Understanding First-Person and Third-Person Videos in Computer Vision

    Get PDF
    Due to advancements in technology and social media, a large amount of visual information is created. There is a lot of interesting research going on in Computer Vision that takes into consideration either visual information generated by first-person (egocentric) or third-person(exocentric) cameras. Video data generated by YouTubers, Surveillance cameras, and Drones which is referred to as third-person or exocentric video data. Whereas first-person or egocentric is the one which is generated by GoPro cameras and Google Glass. Exocentric view capture wide and global views whereas egocentric view capture activities an actor is involved in w.r.t. objects. These two perspectives seem to be independent yet related. In Computer Vision, these two perspectives have been studied by various domains like Activity Recognition, Object Detection, Action Recognition, and Summarization independently. Their relationship and comparison are less discussed in the literature. This paper tries to bridge this gap by presenting a systematic study of first-person and third-person videos. Further, we implemented an algorithm to classify videos as first-person/third-person with the validation accuracy of 88.4% and an F1-score of 86.10% using the Charades dataset.

    4D Human Body Capture from Egocentric Video via 3D Scene Grounding

    Full text link
    We introduce a novel task of reconstructing a time series of second-person 3D human body meshes from monocular egocentric videos. The unique viewpoint and rapid embodied camera motion of egocentric videos raise additional technical barriers for human body capture. To address those challenges, we propose a simple yet effective optimization-based approach that leverages 2D observations of the entire video sequence and human-scene interaction constraint to estimate second-person human poses, shapes, and global motion that are grounded on the 3D environment captured from the egocentric view. We conduct detailed ablation studies to validate our design choice. Moreover, we compare our method with the previous state-of-the-art method on human motion capture from monocular video, and show that our method estimates more accurate human-body poses and shapes under the challenging egocentric setting. In addition, we demonstrate that our approach produces more realistic human-scene interaction

    EgoHumans: An Egocentric 3D Multi-Human Benchmark

    Full text link
    We present EgoHumans, a new multi-view multi-human video benchmark to advance the state-of-the-art of egocentric human 3D pose estimation and tracking. Existing egocentric benchmarks either capture single subject or indoor-only scenarios, which limit the generalization of computer vision algorithms for real-world applications. We propose a novel 3D capture setup to construct a comprehensive egocentric multi-human benchmark in the wild with annotations to support diverse tasks such as human detection, tracking, 2D/3D pose estimation, and mesh recovery. We leverage consumer-grade wearable camera-equipped glasses for the egocentric view, which enables us to capture dynamic activities like playing tennis, fencing, volleyball, etc. Furthermore, our multi-view setup generates accurate 3D ground truth even under severe or complete occlusion. The dataset consists of more than 125k egocentric images, spanning diverse scenes with a particular focus on challenging and unchoreographed multi-human activities and fast-moving egocentric views. We rigorously evaluate existing state-of-the-art methods and highlight their limitations in the egocentric scenario, specifically on multi-human tracking. To address such limitations, we propose EgoFormer, a novel approach with a multi-stream transformer architecture and explicit 3D spatial reasoning to estimate and track the human pose. EgoFormer significantly outperforms prior art by 13.6% IDF1 on the EgoHumans dataset.Comment: Accepted to ICCV 2023 (Oral

    Convolutional Long Short-Term Memory Networks for Recognizing First Person Interactions

    Get PDF
    We present a novel deep learning approach for addressing the problem of interaction recognition from a first person perspective. The approach uses a pair of convolutional neural networks, whose parameters are shared, for extracting frame level features from successive frames of the video. The frame level features are then aggregated using a convolutional long short-term memory. The final hidden state of the convolutional long short-term memory is used for classification in to the respective categories. In our network the spatio-temporal structure of the input is preserved till the very final processing stage. Experimental results show that our method outperforms the state of the art on most recent first person interactions datasets that involve complex ego-motion. On UTKinect, it competes with methods that use depth image and skeletal joints information along with RGB images, while it surpasses previous methods that use only RGB images by more than 20% in recognition accuracy

    Moving-Camera Video Content Analysis via Action Recognition and Homography Transformation

    Get PDF
    Moving-camera video content analysis aims at interpreting useful information in videos taken by moving cameras, including wearable cameras and handy cameras. It is an essential problem in computer vision, and plays an important role in many real-life applications, including understanding social difficulties and enhancing public security. In this work, we study three sub-problems of moving-camera video content analysis, including two sub-problems for the analysis on wearable-camera videos which are a special type of moving camera videos: recognizing general actions and recognizing microactions in wearable-camera videos. And, the third sub-problem is estimating homographies along moving-camera videos. Recognizing general actions in wearable-camera videos is a challenging task, because the motion features extracted from videos of the same action may show very large variation and inconsistency, by mixing the complex and non-stop motion of the camera. It is very difficult to collect sufficient videos to cover all such variations and use them to train action classifiers with good generalization ability. To address this, we develop a new approach to train action classifiers on a relatively smaller set of fixed-camera videos with different views, and then apply them to recognize actions in wearable-camera videos. We conduct experiments by training on a set of fixed-camera videos and testing on a set of wearable-camera videos, with very promising results. Microactions such as small hand or head movements, can be difficult to be recognized in practice, especially from wearble-camera videos, because only subtle body motion is presented. To address this, we proposed a new deep-learning based method to effectively learn midlayer CNN features for enhancing microaction recognition. More specifically, we develop a new dual-branch network for microaction recognition: one branch uses the high-layer CNN features for classification, and the second branch with a novel subtle motion detector further explores the midlayer CNN features for classification. In the experiments, we build a new microaction video dataset, where the micromotions of interest are mixed with other larger general motions such as walking. Comprehensive experimental results verify that the proposed method yields new state-of-the-art performance in two microaction video datasets, while its performance on two general-action video datasets is also very promising. Homography is the invertible mapping between two images of the same planar surface. For estimating homographies along moving-camera videos, homography estimation between non-adjacent frames can be very challenging when their camera view angles show large difference. To handle this, we propose a new deep-learning based method for homography estimation along videos by exploiting temporal dynamics across frames. More specifically, we develop a recurrent convolutional regression network consisting of convolutional neural network and recurrent neural network with long short-term memory cells, followed by a regression layer for estimating the parameters of homography. In the experiments, we introduce a new approach to synthesize videos with known ground-truth homographies, and evaluate the proposed method on both the synthesized and real-world videos with good results

    Action and Interaction Recognition in First-Person Videos

    No full text
    corecore