19 research outputs found
Fully Automatic Multi-Object Articulated Motion Tracking
Fully automatic tracking of articulated motion in real-time with a monocular RGB camera is a challenging problem which is essential for many virtual reality (VR) and human-computer interaction applications. In this paper, we present an algorithm for multiple articulated objects tracking based on monocular RGB image sequence. Our algorithm can be directly employed in practical applications as it is fully automatic, real-time, and temporally stable. It consists of the following stages: dynamic objects counting, objects specific 3D skeletons generation, initial 3D poses estimation, and 3D skeleton fitting which fits each 3D skeleton to the corresponding 2D body-parts locations. In the skeleton fitting stage, the 3D pose of every object is estimated by maximizing an objective function that combines a skeleton fitting term with motion and pose priors. To illustrate the importance of our algorithm for practical applications, we present competitive results for real-time tracking of multiple humans. Our algorithm detects objects that enter or leave the scene, and dynamically generates or deletes their 3D skeletons. This makes our monocular RGB method optimal for real-time applications. We show that our algorithm is applicable for tracking multiple objects in outdoor scenes, community videos, and low-quality videos captured with mobile-phone cameras. Keywords: Multi-object motion tracking, Articulated motion capture, Deep learning, Anthropometric data, 3D pose estimation. DOI: 10.7176/CEIS/12-1-01 Publication date: March 31st 202
Simultaneous Hand Pose and Skeleton Bone-Lengths Estimation from a Single Depth Image
Articulated hand pose estimation is a challenging task for human-computer
interaction. The state-of-the-art hand pose estimation algorithms work only
with one or a few subjects for which they have been calibrated or trained.
Particularly, the hybrid methods based on learning followed by model fitting or
model based deep learning do not explicitly consider varying hand shapes and
sizes. In this work, we introduce a novel hybrid algorithm for estimating the
3D hand pose as well as bone-lengths of the hand skeleton at the same time,
from a single depth image. The proposed CNN architecture learns hand pose
parameters and scale parameters associated with the bone-lengths
simultaneously. Subsequently, a new hybrid forward kinematics layer employs
both parameters to estimate 3D joint positions of the hand. For end-to-end
training, we combine three public datasets NYU, ICVL and MSRA-2015 in one
unified format to achieve large variation in hand shapes and sizes. Among
hybrid methods, our method shows improved accuracy over the state-of-the-art on
the combined dataset and the ICVL dataset that contain multiple subjects. Also,
our algorithm is demonstrated to work well with unseen images.Comment: This paper has been accepted and presented in 3DV-2017 conference
held at Qingdao, China. http://irc.cs.sdu.edu.cn/3dv
Marker-less motion capture in general scenes with sparse multi-camera setups
Human motion-capture from videos is one of the fundamental problems in computer vision and computer graphics. Its applications can be found in a wide range of industries. Even with all the developments in the past years, industry and academia alike still rely on complex and expensive marker-based systems. Many state-of-the-art marker-less motioncapture methods come close to the performance of marker-based algorithms, but only when recording in highly controlled studio environments with exactly synchronized, static and sufficiently many cameras. While relative to marker-based systems, this yields an easier apparatus with a reduced setup time, the hurdles towards practical application are still large and the costs are considerable. By being constrained to a controlled studio, marker-less methods fail to fully play out their advantage of being able to capture scenes without actively modifying them. In the area of marker-less human motion-capture, this thesis proposes several novel algorithms for simplifying the motion-capture to be applicable in new general outdoor scenes. The first is an optical multi-video synchronization method which achieves subframe accuracy in general scenes. In this step, the synchronization parameters of multiple videos are estimated. Then, we propose a spatio-temporal motion-capture method which uses the synchronization parameters for accurate motion-capture with unsynchronized cameras. Afterwards, we propose a motion capture method that works with moving cameras, where multiple people are tracked even in front of cluttered and dynamic backgrounds with potentially moving cameras. Finally, we reduce the number of cameras employed by proposing a novel motion-capture method which uses as few as two cameras to capture high-quality motion in general environments, even outdoors. The methods proposed in this thesis can be adopted in many practical applications to achieve similar performance as complex motion-capture studios with a few consumer-grade cameras, such as mobile phones or GoPros, even for uncontrolled outdoor scenes.Die videobasierte Bewegungserfassung (Motion Capture) menschlicher Darsteller ist ein fundamentales Problem in Computer Vision und Computergrafik, das in einer Vielzahl von Branchen Anwendung findet. Trotz des Fortschritts der letzten Jahre verlassen sich Wirtschaft und Wissenschaft noch immer auf komplexe und teure markerbasierte Systeme. Viele aktuelle markerlose Motion-Capture-Verfahren kommen der Leistung von markerbasierten Algorithmen nahe, aber nur bei Aufnahmen in stark kontrollierten Studio-Umgebungen mit genügend genau synchronisierten, statischen Kameras. Im Vergleich zu markerbasierten Systemen wird der Aufbau erheblich vereinfacht, was Zeit beim Aufbau spart, aber die Hürden für die praktische Anwendung sind noch immer groß und die Kosten beträchtlich. Durch die Beschränkung auf ein kontrolliertes Studio können markerlose Verfahren nicht vollständig ihren Vorteil ausspielen, Szenen aufzunehmen zu können, ohne sie aktiv zu verändern. Diese Arbeit schlägt mehrere neuartige markerlose Motion-Capture-Verfahren vor, welche die Erfassung menschlicher Darsteller in allgemeinen Außenaufnahmen vereinfachen. Das erste ist ein optisches Videosynchronisierungsverfahren, welches die Synchronisationsparameter mehrerer Videos genauer als die Bildwiederholrate schätzt. Anschließend wird ein Raum-Zeit-Motion-Capture-Verfahren vorgeschlagen, welches die Synchronisationsparameter für präzises Motion Capture mit nicht synchronisierten Kameras verwendet. Außerdem wird ein Motion-Capture-Verfahren für bewegliche Kameras vorgestellt, das mehrere Menschen auch vor unübersichtlichen und dynamischen Hintergründen erfasst. Schließlich wird die Anzahl der erforderlichen Kameras durch ein neues MotionCapture-Verfahren, auf lediglich zwei Kameras reduziert, um Bewegungen qualitativ hochwertig auch in allgemeinen Umgebungen wie im Freien zu erfassen. Die in dieser Arbeit vorgeschlagenen Verfahren können in viele praktische Anwendungen übernommen werden, um eine ähnliche Leistung wie komplexe Motion-Capture-Studios mit lediglich einigen Videokameras der Verbraucherklasse, zum Beispiel Mobiltelefonen oder GoPros, auch in unkontrollierten Außenaufnahmen zu erzielen
Structure from Articulated Motion: Accurate and Stable Monocular 3D Reconstruction without Training Data
Recovery of articulated 3D structure from 2D observations is a challenging
computer vision problem with many applications. Current learning-based
approaches achieve state-of-the-art accuracy on public benchmarks but are
restricted to specific types of objects and motions covered by the training
datasets. Model-based approaches do not rely on training data but show lower
accuracy on these datasets. In this paper, we introduce a model-based method
called Structure from Articulated Motion (SfAM), which can recover multiple
object and motion types without training on extensive data collections. At the
same time, it performs on par with learning-based state-of-the-art approaches
on public benchmarks and outperforms previous non-rigid structure from motion
(NRSfM) methods. SfAM is built upon a general-purpose NRSfM technique while
integrating a soft spatio-temporal constraint on the bone lengths. We use
alternating optimization strategy to recover optimal geometry (i.e., bone
proportions) together with 3D joint positions by enforcing the bone lengths
consistency over a series of frames. SfAM is highly robust to noisy 2D
annotations, generalizes to arbitrary objects and does not rely on training
data, which is shown in extensive experiments on public benchmarks and real
video sequences. We believe that it brings a new perspective on the domain of
monocular 3D recovery of articulated structures, including human motion
capture.Comment: 21 pages, 8 figures, 2 table
Marker-less motion capture in general scenes with sparse multi-camera setups
Human motion-capture from videos is one of the fundamental problems in computer vision and computer graphics. Its applications can be found in a wide range of industries. Even with all the developments in the past years, industry and academia alike still rely on complex and expensive marker-based systems. Many state-of-the-art marker-less motioncapture methods come close to the performance of marker-based algorithms, but only when recording in highly controlled studio environments with exactly synchronized, static and sufficiently many cameras. While relative to marker-based systems, this yields an easier apparatus with a reduced setup time, the hurdles towards practical application are still large and the costs are considerable. By being constrained to a controlled studio, marker-less methods fail to fully play out their advantage of being able to capture scenes without actively modifying them. In the area of marker-less human motion-capture, this thesis proposes several novel algorithms for simplifying the motion-capture to be applicable in new general outdoor scenes. The first is an optical multi-video synchronization method which achieves subframe accuracy in general scenes. In this step, the synchronization parameters of multiple videos are estimated. Then, we propose a spatio-temporal motion-capture method which uses the synchronization parameters for accurate motion-capture with unsynchronized cameras. Afterwards, we propose a motion capture method that works with moving cameras, where multiple people are tracked even in front of cluttered and dynamic backgrounds with potentially moving cameras. Finally, we reduce the number of cameras employed by proposing a novel motion-capture method which uses as few as two cameras to capture high-quality motion in general environments, even outdoors. The methods proposed in this thesis can be adopted in many practical applications to achieve similar performance as complex motion-capture studios with a few consumer-grade cameras, such as mobile phones or GoPros, even for uncontrolled outdoor scenes.Die videobasierte Bewegungserfassung (Motion Capture) menschlicher Darsteller ist ein fundamentales Problem in Computer Vision und Computergrafik, das in einer Vielzahl von Branchen Anwendung findet. Trotz des Fortschritts der letzten Jahre verlassen sich Wirtschaft und Wissenschaft noch immer auf komplexe und teure markerbasierte Systeme. Viele aktuelle markerlose Motion-Capture-Verfahren kommen der Leistung von markerbasierten Algorithmen nahe, aber nur bei Aufnahmen in stark kontrollierten Studio-Umgebungen mit genügend genau synchronisierten, statischen Kameras. Im Vergleich zu markerbasierten Systemen wird der Aufbau erheblich vereinfacht, was Zeit beim Aufbau spart, aber die Hürden für die praktische Anwendung sind noch immer groß und die Kosten beträchtlich. Durch die Beschränkung auf ein kontrolliertes Studio können markerlose Verfahren nicht vollständig ihren Vorteil ausspielen, Szenen aufzunehmen zu können, ohne sie aktiv zu verändern. Diese Arbeit schlägt mehrere neuartige markerlose Motion-Capture-Verfahren vor, welche die Erfassung menschlicher Darsteller in allgemeinen Außenaufnahmen vereinfachen. Das erste ist ein optisches Videosynchronisierungsverfahren, welches die Synchronisationsparameter mehrerer Videos genauer als die Bildwiederholrate schätzt. Anschließend wird ein Raum-Zeit-Motion-Capture-Verfahren vorgeschlagen, welches die Synchronisationsparameter für präzises Motion Capture mit nicht synchronisierten Kameras verwendet. Außerdem wird ein Motion-Capture-Verfahren für bewegliche Kameras vorgestellt, das mehrere Menschen auch vor unübersichtlichen und dynamischen Hintergründen erfasst. Schließlich wird die Anzahl der erforderlichen Kameras durch ein neues MotionCapture-Verfahren, auf lediglich zwei Kameras reduziert, um Bewegungen qualitativ hochwertig auch in allgemeinen Umgebungen wie im Freien zu erfassen. Die in dieser Arbeit vorgeschlagenen Verfahren können in viele praktische Anwendungen übernommen werden, um eine ähnliche Leistung wie komplexe Motion-Capture-Studios mit lediglich einigen Videokameras der Verbraucherklasse, zum Beispiel Mobiltelefonen oder GoPros, auch in unkontrollierten Außenaufnahmen zu erzielen
ShapeGraFormer: GraFormer-Based Network for Hand-Object Reconstruction from a Single Depth Map
3D reconstruction of hand-object manipulations is important for emulating
human actions. Most methods dealing with challenging object manipulation
scenarios, focus on hands reconstruction in isolation, ignoring physical and
kinematic constraints due to object contact. Some approaches produce more
realistic results by jointly reconstructing 3D hand-object interactions.
However, they focus on coarse pose estimation or rely upon known hand and
object shapes. We propose the first approach for realistic 3D hand-object shape
and pose reconstruction from a single depth map. Unlike previous work, our
voxel-based reconstruction network regresses the vertex coordinates of a hand
and an object and reconstructs more realistic interaction. Our pipeline
additionally predicts voxelized hand-object shapes, having a one-to-one mapping
to the input voxelized depth. Thereafter, we exploit the graph nature of the
hand and object shapes, by utilizing the recent GraFormer network with
positional embedding to reconstruct shapes from template meshes. In addition,
we show the impact of adding another GraFormer component that refines the
reconstructed shapes based on the hand-object interactions and its ability to
reconstruct more accurate object shapes. We perform an extensive evaluation on
the HO-3D and DexYCB datasets and show that our method outperforms existing
approaches in hand reconstruction and produces plausible reconstructions for
the object
HandVoxNet: Deep Voxel-Based Network for 3D Hand Shape and Pose Estimation from a Single Depth Map
3D hand shape and pose estimation from a single depth map is a new and
challenging computer vision problem with many applications. The
state-of-the-art methods directly regress 3D hand meshes from 2D depth images
via 2D convolutional neural networks, which leads to artefacts in the
estimations due to perspective distortions in the images. In contrast, we
propose a novel architecture with 3D convolutions trained in a
weakly-supervised manner. The input to our method is a 3D voxelized depth map,
and we rely on two hand shape representations. The first one is the 3D
voxelized grid of the shape which is accurate but does not preserve the mesh
topology and the number of mesh vertices. The second representation is the 3D
hand surface which is less accurate but does not suffer from the limitations of
the first representation. We combine the advantages of these two
representations by registering the hand surface to the voxelized hand shape. In
the extensive experiments, the proposed approach improves over the state of the
art by 47.8% on the SynHand5M dataset. Moreover, our augmentation policy for
voxelized depth maps further enhances the accuracy of 3D hand pose estimation
on real data. Our method produces visually more reasonable and realistic hand
shapes on NYU and BigHand2.2M datasets compared to the existing approaches.Comment: 10 pages, 8 figures, 5 tables, CVP
WHSP-Net: A Weakly-Supervised Approach for 3D Hand Shape and Pose Recovery from a Single Depth Image
Hand shape and pose recovery is essential for many computer vision applications such as animation of a personalized hand mesh in a virtual environment. Although there are many hand pose estimation methods, only a few deep learning based algorithms target 3D hand shape and pose from a single RGB or depth image. Jointly estimating hand shape and pose is very challenging because none of the existing real benchmarks provides ground truth hand shape. For this reason, we propose a novel weakly-supervised approach for 3D hand shape and pose recovery (named WHSP-Net) from a single depth image by learning shapes from unlabeled real data and labeled synthetic data. To this end, we propose a novel framework which consists of three novel components. The first is the Convolutional Neural Network (CNN) based deep network which produces 3D joints positions from learned 3D bone vectors using a new layer. The second is a novel shape decoder that recovers dense 3D hand mesh from sparse joints. The third is a novel depth synthesizer which reconstructs 2D depth image from 3D hand mesh. The whole pipeline is fine-tuned in an end-to-end manner. We demonstrate that our approach recovers reasonable hand shapes from real world datasets as well as from live stream of depth camera in real-time. Our algorithm outperforms state-of-the-art methods that output more than the joint positions and shows competitive performance on 3D pose estimation task
HandVoxNet++: 3D Hand Shape and Pose Estimation using Voxel-Based Neural Networks
3D hand shape and pose estimation from a single depth map is a new and
challenging computer vision problem with many applications. Existing methods
addressing it directly regress hand meshes via 2D convolutional neural
networks, which leads to artefacts due to perspective distortions in the
images. To address the limitations of the existing methods, we develop
HandVoxNet++, i.e., a voxel-based deep network with 3D and graph convolutions
trained in a fully supervised manner. The input to our network is a 3D
voxelized-depth-map-based on the truncated signed distance function (TSDF).
HandVoxNet++ relies on two hand shape representations. The first one is the 3D
voxelized grid of hand shape, which does not preserve the mesh topology and
which is the most accurate representation. The second representation is the
hand surface that preserves the mesh topology. We combine the advantages of
both representations by aligning the hand surface to the voxelized hand shape
either with a new neural Graph-Convolutions-based Mesh Registration
(GCN-MeshReg) or classical segment-wise Non-Rigid Gravitational Approach
(NRGA++) which does not rely on training data. In extensive evaluations on
three public benchmarks, i.e., SynHand5M, depth-based HANDS19 challenge and
HO-3D, the proposed HandVoxNet++ achieves state-of-the-art performance. In this
journal extension of our previous approach presented at CVPR 2020, we gain
41.09% and 13.7% higher shape alignment accuracy on SynHand5M and HANDS19
datasets, respectively. Our method is ranked first on the HANDS19 challenge
dataset (Task 1: Depth-Based 3D Hand Pose Estimation) at the moment of the
submission of our results to the portal in August 2020.Comment: 13 pages, 6 tables, 7 figures; project webpage:
http://4dqv.mpi-inf.mpg.de/HandVoxNet++/. arXiv admin note: text overlap with
arXiv:2004.0158