51 research outputs found
Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image
We propose a unified formulation for the problem of 3D human pose estimation
from a single raw RGB image that reasons jointly about 2D joint estimation and
3D pose reconstruction to improve both tasks. We take an integrated approach
that fuses probabilistic knowledge of 3D human pose with a multi-stage CNN
architecture and uses the knowledge of plausible 3D landmark locations to
refine the search for better 2D locations. The entire process is trained
end-to-end, is extremely efficient and obtains state- of-the-art results on
Human3.6M outperforming previous approaches both on 2D and 3D errors.Comment: Paper presented at CVPR 1
Exploiting temporal information for 3D pose estimation
In this work, we address the problem of 3D human pose estimation from a
sequence of 2D human poses. Although the recent success of deep networks has
led many state-of-the-art methods for 3D pose estimation to train deep networks
end-to-end to predict from images directly, the top-performing approaches have
shown the effectiveness of dividing the task of 3D pose estimation into two
steps: using a state-of-the-art 2D pose estimator to estimate the 2D pose from
images and then mapping them into 3D space. They also showed that a
low-dimensional representation like 2D locations of a set of joints can be
discriminative enough to estimate 3D pose with high accuracy. However,
estimation of 3D pose for individual frames leads to temporally incoherent
estimates due to independent error in each frame causing jitter. Therefore, in
this work we utilize the temporal information across a sequence of 2D joint
locations to estimate a sequence of 3D poses. We designed a
sequence-to-sequence network composed of layer-normalized LSTM units with
shortcut connections connecting the input to the output on the decoder side and
imposed temporal smoothness constraint during training. We found that the
knowledge of temporal consistency improves the best reported result on
Human3.6M dataset by approximately and helps our network to recover
temporally consistent 3D poses over a sequence of images even when the 2D pose
detector fails
Motion capture based on RGBD data from multiple sensors for avatar animation
With recent advances in technology and emergence of affordable RGB-D sensors for a
wider range of users, markerless motion capture has become an active field of research
both in computer vision and computer graphics.
In this thesis, we designed a POC (Proof of Concept) for a new tool that enables us
to perform motion capture by using a variable number of commodity RGB-D sensors of
different brands and technical specifications on constraint-less layout environments. The
main goal of this work is to provide a tool with motion capture capabilities by using a
handful of RGB-D sensors, without imposing strong requirements in terms of lighting,
background or extension of the motion capture area. Of course, the number of RGB-D
sensors needed is inversely proportional to their resolution, and directly proportional to
the size of the area to track to.
Built on top of the OpenNI 2 library, we made this POC compatible with most of the nonhigh-end
RGB-D sensors currently available in the market. Due to the lack of resources on
a single computer, in order to support more than a couple of sensors working simultaneously,
we need a setup composed of multiple computers. In order to keep data coherency
and synchronization across sensors and computers, our tool makes use of a semi-automatic
calibration method and a message-oriented network protocol.
From color and depth data given by a sensor, we can also obtain a 3D pointcloud representation
of the environment. By combining pointclouds from multiple sensors, we can
collect a complete and animated 3D pointcloud that can be visualized from any viewpoint.
Given a 3D avatar model and its corresponding attached skeleton, we can use an
iterative optimization method (e.g. Simplex) to find a fit between each pointcloud frame
and a skeleton configuration, resulting in 3D avatar animation when using such skeleton
configurations as key frames
Automatic visual detection of human behavior: a review from 2000 to 2014
Due to advances in information technology (e.g., digital video cameras, ubiquitous sensors), the automatic detection of human behaviors from video is a very recent research topic. In this paper, we perform a systematic and recent literature review on this topic, from 2000 to 2014, covering a selection of 193 papers that were searched from six major scientific publishers. The selected papers were classified into three main subjects: detection techniques, datasets and applications. The detection techniques were divided into four categories (initialization, tracking, pose estimation and recognition). The list of datasets includes eight examples (e.g., Hollywood action). Finally, several application areas were identified, including human detection, abnormal activity detection, action recognition, player modeling and pedestrian detection. Our analysis provides a road map to guide future research for designing automatic visual human behavior detection systems.This work is funded by the Portuguese Foundation for Science and Technology (FCT - Fundacao para a Ciencia e a Tecnologia) under research Grant SFRH/BD/84939/2012
Body Knowledge and Uncertainty Modeling for Monocular 3D Human Body Reconstruction
While 3D body reconstruction methods have made remarkable progress recently,
it remains difficult to acquire the sufficiently accurate and numerous 3D
supervisions required for training. In this paper, we propose \textbf{KNOWN}, a
framework that effectively utilizes body \textbf{KNOW}ledge and
u\textbf{N}certainty modeling to compensate for insufficient 3D supervisions.
KNOWN exploits a comprehensive set of generic body constraints derived from
well-established body knowledge. These generic constraints precisely and
explicitly characterize the reconstruction plausibility and enable 3D
reconstruction models to be trained without any 3D data. Moreover, existing
methods typically use images from multiple datasets during training, which can
result in data noise (\textit{e.g.}, inconsistent joint annotation) and data
imbalance (\textit{e.g.}, minority images representing unusual poses or
captured from challenging camera views). KNOWN solves these problems through a
novel probabilistic framework that models both aleatoric and epistemic
uncertainty. Aleatoric uncertainty is encoded in a robust Negative
Log-Likelihood (NLL) training loss, while epistemic uncertainty is used to
guide model refinement. Experiments demonstrate that KNOWN's body
reconstruction outperforms prior weakly-supervised approaches, particularly on
the challenging minority images.Comment: ICCV 202
3D Face Reconstruction: the Road to Forensics
3D face reconstruction algorithms from images and videos are applied to many fields, from plastic surgery to the entertainment sector, thanks to their advantageous features. However, when looking at forensic applications, 3D face reconstruction must observe strict requirements that still make its possible role in bringing evidence to a lawsuit unclear. An extensive investigation of the constraints, potential, and limits of its application in forensics is still missing. Shedding some light on this matter is the goal of the present survey, which starts by clarifying the relation between forensic applications and biometrics, with a focus on face recognition. Therefore, it provides an analysis of the achievements of 3D face reconstruction algorithms from surveillance videos and mugshot images and discusses the current obstacles that separate 3D face reconstruction from an active role in forensic applications. Finally, it examines the underlying data sets, with their advantages and limitations, while proposing alternatives that could substitute or complement them
3D Face Reconstruction: the Road to Forensics
3D face reconstruction algorithms from images and videos are applied to many
fields, from plastic surgery to the entertainment sector, thanks to their
advantageous features. However, when looking at forensic applications, 3D face
reconstruction must observe strict requirements that still make its possible
role in bringing evidence to a lawsuit unclear. An extensive investigation of
the constraints, potential, and limits of its application in forensics is still
missing. Shedding some light on this matter is the goal of the present survey,
which starts by clarifying the relation between forensic applications and
biometrics, with a focus on face recognition. Therefore, it provides an
analysis of the achievements of 3D face reconstruction algorithms from
surveillance videos and mugshot images and discusses the current obstacles that
separate 3D face reconstruction from an active role in forensic applications.
Finally, it examines the underlying data sets, with their advantages and
limitations, while proposing alternatives that could substitute or complement
them.Comment: The manuscript has been accepted for publication in ACM Computing
Surveys. arXiv admin note: text overlap with arXiv:2303.1116
A theoretical eye model for uncalibrated real-time eye gaze estimation
Computer vision systems that monitor human activity can be utilized for many diverse applications. Some general applications stemming from such activity monitoring are surveillance, human-computer interfaces, aids for the handicapped, and virtual reality environments. For most of these applications, a non-intrusive system is desirable, either for reasons of covertness or comfort. Also desirable is generality across users, especially for humancomputer interfaces and surveillance. This thesis presents a method of gaze estimation that, without calibration, determines a relatively unconstrained user’s overall horizontal eye gaze. Utilizing anthropometric data and physiological models, a simple, yet general eye model is presented. The equations that describe the gaze angle of the eye in this model are presented. The procedure for choosing the proper features for gaze estimation is detailed and the algorithms utilized to find these points are described. Results from manual and automatic feature extraction are presented and analyzed. The error observed from this model is around 3± and the error observed from the implementation is around 6±. This amount of error is comparable to previous eye gaze estimation algorithms and it validates this model. The results presented across a set of subjects display consistency, which proves the generality of this model. A real-time implementation that operates around 17 frames per second displays the efficiency of the algorithms implemented. While there are many interesting directions for future work, the goals of this thesis were achieved
- …