1,973 research outputs found
A Review and Analysis of Eye-Gaze Estimation Systems, Algorithms and Performance Evaluation Methods in Consumer Platforms
In this paper a review is presented of the research on eye gaze estimation
techniques and applications, that has progressed in diverse ways over the past
two decades. Several generic eye gaze use-cases are identified: desktop, TV,
head-mounted, automotive and handheld devices. Analysis of the literature leads
to the identification of several platform specific factors that influence gaze
tracking accuracy. A key outcome from this review is the realization of a need
to develop standardized methodologies for performance evaluation of gaze
tracking systems and achieve consistency in their specification and comparative
evaluation. To address this need, the concept of a methodological framework for
practical evaluation of different gaze tracking systems is proposed.Comment: 25 pages, 13 figures, Accepted for publication in IEEE Access in July
201
Person Recognition in Personal Photo Collections
Recognising persons in everyday photos presents major challenges (occluded
faces, different clothing, locations, etc.) for machine vision. We propose a
convnet based person recognition system on which we provide an in-depth
analysis of informativeness of different body cues, impact of training data,
and the common failure modes of the system. In addition, we discuss the
limitations of existing benchmarks and propose more challenging ones. Our
method is simple and is built on open source and open data, yet it improves the
state of the art results on a large dataset of social media photos (PIPA).Comment: Accepted to ICCV 2015, revise
RGBD Datasets: Past, Present and Future
Since the launch of the Microsoft Kinect, scores of RGBD datasets have been
released. These have propelled advances in areas from reconstruction to gesture
recognition. In this paper we explore the field, reviewing datasets across
eight categories: semantics, object pose estimation, camera tracking, scene
reconstruction, object tracking, human actions, faces and identification. By
extracting relevant information in each category we help researchers to find
appropriate data for their needs, and we consider which datasets have succeeded
in driving computer vision forward and why.
Finally, we examine the future of RGBD datasets. We identify key areas which
are currently underexplored, and suggest that future directions may include
synthetic data and dense reconstructions of static and dynamic scenes.Comment: 8 pages excluding references (CVPR style
VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera
We present the first real-time method to capture the full global 3D skeletal
pose of a human in a stable, temporally consistent manner using a single RGB
camera. Our method combines a new convolutional neural network (CNN) based pose
regressor with kinematic skeleton fitting. Our novel fully-convolutional pose
formulation regresses 2D and 3D joint positions jointly in real time and does
not require tightly cropped input frames. A real-time kinematic skeleton
fitting method uses the CNN output to yield temporally stable 3D global pose
reconstructions on the basis of a coherent kinematic skeleton. This makes our
approach the first monocular RGB method usable in real-time applications such
as 3D character control---thus far, the only monocular methods for such
applications employed specialized RGB-D cameras. Our method's accuracy is
quantitatively on par with the best offline 3D monocular RGB pose estimation
methods. Our results are qualitatively comparable to, and sometimes better
than, results from monocular RGB-D approaches, such as the Kinect. However, we
show that our approach is more broadly applicable than RGB-D solutions, i.e. it
works for outdoor scenes, community videos, and low quality commodity RGB
cameras.Comment: Accepted to SIGGRAPH 201
Towards higher sense of presence: a 3D virtual environment adaptable to confusion and engagement
Virtual Reality scenarios where emitters convey information to receptors can be used as a tool for distance learning and to enable virtual visits to company physical headquarters. However, immersive Virtual Reality setups usually require visualization interfaces such as Head-mounted Displays, Powerwalls or CAVE systems, supported by interaction devices (Microsoft Kinect, Wii Motion, among others), that foster natural interaction but are often inaccessible to users. We propose a virtual presentation scenario, supported by a framework, that provides emotion-driven interaction through ubiquitous devices. An experiment with 3 conditions was designed involving: a control condition; a less confusing text script based on its lexical, syntactical, and bigram features; and a third condition where an adaptive lighting system dynamically acted based on the user’s engagement. Results show that users exposed to the less confusing script reported higher sense of presence, albeit without statistical significance. Users from the last condition reported lower sense of presence, which rejects our hypothesis without statistical significance. We theorize that, as the presentation was given orally and the adaptive lighting system impacts the visual channel, this conflict may have overloaded the users’ cognitive capacity and thus reduced available resources to address the presentation content.info:eu-repo/semantics/publishedVersio
Computationally efficient deformable 3D object tracking with a monocular RGB camera
182 p.Monocular RGB cameras are present in most scopes and devices, including embedded environments like robots, cars and home automation. Most of these environments have in common a significant presence of human operators with whom the system has to interact. This context provides the motivation to use the captured monocular images to improve the understanding of the operator and the surrounding scene for more accurate results and applications.However, monocular images do not have depth information, which is a crucial element in understanding the 3D scene correctly. Estimating the three-dimensional information of an object in the scene using a single two-dimensional image is already a challenge. The challenge grows if the object is deformable (e.g., a human body or a human face) and there is a need to track its movements and interactions in the scene.Several methods attempt to solve this task, including modern regression methods based on Deep NeuralNetworks. However, despite the great results, most are computationally demanding and therefore unsuitable for several environments. Computational efficiency is a critical feature for computationally constrained setups like embedded or onboard systems present in robotics and automotive applications, among others.This study proposes computationally efficient methodologies to reconstruct and track three-dimensional deformable objects, such as human faces and human bodies, using a single monocular RGB camera. To model the deformability of faces and bodies, it considers two types of deformations: non-rigid deformations for face tracking, and rigid multi-body deformations for body pose tracking. Furthermore, it studies their performance on computationally restricted devices like smartphones and onboard systems used in the automotive industry. The information extracted from such devices gives valuable insight into human behaviour a crucial element in improving human-machine interaction.We tested the proposed approaches in different challenging application fields like onboard driver monitoring systems, human behaviour analysis from monocular videos, and human face tracking on embedded devices
Scene-aware Egocentric 3D Human Pose Estimation
Egocentric 3D human pose estimation with a single head-mounted fisheye camera
has recently attracted attention due to its numerous applications in virtual
and augmented reality. Existing methods still struggle in challenging poses
where the human body is highly occluded or is closely interacting with the
scene. To address this issue, we propose a scene-aware egocentric pose
estimation method that guides the prediction of the egocentric pose with scene
constraints. To this end, we propose an egocentric depth estimation network to
predict the scene depth map from a wide-view egocentric fisheye camera while
mitigating the occlusion of the human body with a depth-inpainting network.
Next, we propose a scene-aware pose estimation network that projects the 2D
image features and estimated depth map of the scene into a voxel space and
regresses the 3D pose with a V2V network. The voxel-based feature
representation provides the direct geometric connection between 2D image
features and scene geometry, and further facilitates the V2V network to
constrain the predicted pose based on the estimated scene geometry. To enable
the training of the aforementioned networks, we also generated a synthetic
dataset, called EgoGTA, and an in-the-wild dataset based on EgoPW, called
EgoPW-Scene. The experimental results of our new evaluation sequences show that
the predicted 3D egocentric poses are accurate and physically plausible in
terms of human-scene interaction, demonstrating that our method outperforms the
state-of-the-art methods both quantitatively and qualitatively
- …