Search CORE

6,242 research outputs found

Learning 3D Human Pose from Structure and Motion

Author: A Newell
C Ionescu
C Sminchisescu
D Mehta
F Bogo
J Chen
L Herda
M Loper
MJ Park
N Sarafianos
R Urtasun
S Li
T Alldieck
V Ramakrishna
X Wei
X Zhou
Z-H Zhou
Publication venue
Publication date: 03/07/2018
Field of study

3D human pose estimation from a single image is a challenging problem, especially for in-the-wild settings due to the lack of 3D annotated data. We propose two anatomically inspired loss functions and use them with a weakly-supervised learning framework to jointly learn from large-scale in-the-wild 2D and indoor/synthetic 3D data. We also present a simple temporal network that exploits temporal and structural cues present in predicted pose sequences to temporally harmonize the pose estimations. We carefully analyze the proposed contributions through loss surface visualizations and sensitivity analysis to facilitate deeper understanding of their working mechanism. Our complete pipeline improves the state-of-the-art by 11.8% and 12% on Human3.6M and MPI-INF-3DHP, respectively, and runs at 30 FPS on a commodity graphics card.Comment: ECCV 2018. Project page: https://www.cse.iitb.ac.in/~rdabral/3DPose

arXiv.org e-Print Archive

Crossref

Monocular visual scene analysis:saliency detection and 3D face reconstruction using GAN

Author: Cai Xiaoxu
Publication venue
Publication date: 01/01/2021
Field of study

Portsmouth University Research Portal (Pure)

Improved deep depth estimation for environments with sparse visual cues

Author: Autiosalo Juuso
Joswig Niclas
Ruotsalainen Laura
Publication venue
Publication date: 01/01/2023
Field of study

Most deep learning-based depth estimation models that learn scene structure self-supervised from monocular video base their estimation on visual cues such as vanishing points. In the established depth estimation benchmarks depicting, for example, street navigation or indoor offices, these cues can be found consistently, which enables neural networks to predict depth maps from single images. In this work, we are addressing the challenge of depth estimation from a real-world bird’s-eye perspective in an industry environment which contains, conditioned by its special geometry, a minimal amount of visual cues and, hence, requires incorporation of the temporal domain for structure from motion estimation. To enable the system to incorporate structure from motion from pixel translation when facing context-sparse, i.e., visual cue sparse, scenery, we propose a novel architecture built upon the structure from motion learner, which uses temporal pairs of jointly unrotated and stacked images for depth prediction. In order to increase the overall performance and to avoid blurred depth edges that lie in between the edges of the two input images, we integrate a geometric consistency loss into our pipeline. We assess the model’s ability to learn structure from motion by introducing a novel industry dataset whose perspective, orthogonal to the floor, contains only minimal visual cues. Through the evaluation with ground truth depth, we show that our proposed method outperforms the state of the art in difficult context-sparse environments.Peer reviewe

Aaltodoc Publication Archive

Helsingin yliopiston digitaalinen arkisto