5 research outputs found

    A Constrained Latent Variable Model

    Get PDF
    Latent variable models provide valuable compact representations for learning and inference in many computer vision tasks. However, most existing models cannot directly encode prior knowledge about the specific problem at hand. In this paper, we introduce a constrained latent variable model whose generated output inherently accounts for such knowledge. To this end, we propose an approach that explicitly imposes equality and inequality constraints on the model's output during learning, thus avoiding the computational burden of having to account for these constraints at inference. Our learning mechanism can exploit non-linear kernels, while only involving sequential closed-form updates of the model parameters. We demonstrate the effectiveness of our constrained latent variable model on the problem of non-rigid 3D reconstruction from monocular images, and show that it yields qualitative and quantitative improvements over several baselines

    3차원 μ‚¬λžŒ μžμ„Έ 좔정을 μœ„ν•œ 3차원 볡원, μ•½μ§€λ„ν•™μŠ΅, μ§€λ„ν•™μŠ΅ 방법

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사)-- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : μœ΅ν•©κ³Όν•™κΈ°μˆ λŒ€ν•™μ› μœ΅ν•©κ³Όν•™λΆ€(지λŠ₯ν˜•μœ΅ν•©μ‹œμŠ€ν…œμ „κ³΅), 2019. 2. κ³½λ…Έμ€€.Estimating human poses from images is one of the fundamental tasks in computer vision, which leads to lots of applications such as action recognition, human-computer interaction, and virtual reality. Especially, estimating 3D human poses from 2D inputs is a challenging problem since it is inherently under-constrained. In addition, obtaining 3D ground truth data for human poses is only possible under the limited and restricted environments. In this dissertation, 3D human pose estimation is studied in different aspects focusing on various types of the availability of the data. To this end, three different methods to retrieve 3D human poses from 2D observations or from RGB images---algorithms of 3D reconstruction, weakly-supervised learning, and supervised learning---are proposed. First, a non-rigid structure from motion (NRSfM) algorithm that reconstructs 3D structures of non-rigid objects such as human bodies from 2D observations is proposed. In the proposed framework which is named as Procrustean Regression, the 3D shapes are regularized based on their aligned shapes. We show that the cost function of the Procrustean Regression can be casted into an unconstrained problem or a problem with simple bound constraints, which can be efficiently solved by existing gradient descent solvers. This framework can be easily integrated with numerous existing models and assumptions, which makes it more practical for various real situations. The experimental results show that the proposed method gives competitive result to the state-of-the-art methods for orthographic projection with much less time complexity and memory requirement, and outperforms the existing methods for perspective projection. Second, a weakly-supervised learning method that is capable of learning 3D structures when only 2D ground truth data is available as a training set is presented. Extending the Procrustean Regression framework, we suggest Procrustean Regression Network, a learning method that trains neural networks to learn 3D structures using training data with 2D ground truths. This is the first attempt that directly integrates an NRSfM algorithm into neural network training. The cost function that contains a low-rank function is also firstly used as a cost function of neural networks that reconstructs 3D shapes. During the test phase, 3D structures of human bodies can be obtained via a feed-forward operation, which enables the framework to have much faster inference time compared to the 3D reconstruction algorithms. Third, a supervised learning method that infers 3D poses from 2D inputs using neural networks is suggested. The method exploits a relational unit which captures the relations between different body parts. In the method, each pair of different body parts generates relational features, and the average of the features from all the pairs are used for 3D pose estimation. We also suggest a dropout method called relational dropout, which can be used in relational modules to impose robustness to the occlusions. The experimental results validate that the performance of the proposed algorithm does not degrade much when missing points exist while maintaining state-of-the-art performance when every point is visible.RGB μ˜μƒμ—μ„œμ˜ μ‚¬λžŒ μžμ„Έ μΆ”μ • 방법은 컴퓨터 λΉ„μ „ λΆ„μ•Όμ—μ„œ μ€‘μš”ν•˜λ©° μ—¬λŸ¬ μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ˜ 기본이 λ˜λŠ” κΈ°μˆ μ΄λ‹€. μ‚¬λžŒ μžμ„Έ 좔정은 λ™μž‘ 인식, 인간-컴퓨터 μƒν˜Έμž‘μš©, 가상 ν˜„μ‹€, 증강 ν˜„μ‹€ λ“± κ΄‘λ²”μœ„ν•œ λΆ„μ•Όμ—μ„œ 기반 기술둜 μ‚¬μš©λ  수 μžˆλ‹€. 특히, 2차원 μž…λ ₯μœΌλ‘œλΆ€ν„° 3차원 μ‚¬λžŒ μžμ„Έλ₯Ό μΆ”μ •ν•˜λŠ” λ¬Έμ œλŠ” 무수히 λ§Žμ€ ν•΄λ₯Ό κ°€μ§ˆ 수 μžˆλŠ” 문제이기 λ•Œλ¬Έμ— ν’€κΈ° μ–΄λ €μš΄ 문제둜 μ•Œλ €μ Έ μžˆλ‹€. λ˜ν•œ, 3차원 μ‹€μ œ λ°μ΄ν„°μ˜ μŠ΅λ“μ€ λͺ¨μ…˜μΊ‘처 μŠ€νŠœλ””μ˜€ λ“± μ œν•œλœ ν™˜κ²½ν•˜μ—μ„œλ§Œ κ°€λŠ₯ν•˜κΈ° λ•Œλ¬Έμ— 얻을 수 μžˆλŠ” λ°μ΄ν„°μ˜ 양이 ν•œμ •μ μ΄λ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ”, 얻을 수 μžˆλŠ” ν•™μŠ΅ λ°μ΄ν„°μ˜ μ’…λ₯˜μ— 따라 μ—¬λŸ¬ 방면으둜 3차원 μ‚¬λžŒ μžμ„Έλ₯Ό μΆ”μ •ν•˜λŠ” 방법을 μ—°κ΅¬ν•˜μ˜€λ‹€. ꡬ체적으둜, 2차원 κ΄€μΈ‘κ°’ λ˜λŠ” RGB μ˜μƒμ„ λ°”νƒ•μœΌλ‘œ 3차원 μ‚¬λžŒ μžμ„Έλ₯Ό μΆ”μ •, λ³΅μ›ν•˜λŠ” μ„Έ 가지 방법--3차원 볡원, μ•½μ§€λ„ν•™μŠ΅, μ§€λ„ν•™μŠ΅--을 μ œμ‹œν•˜μ˜€λ‹€. 첫 번째둜, μ‚¬λžŒμ˜ 신체와 같이 λΉ„μ •ν˜• 객체의 2차원 κ΄€μΈ‘κ°’μœΌλ‘œλΆ€ν„° 3차원 ꡬ쑰λ₯Ό λ³΅μ›ν•˜λŠ” λΉ„μ •ν˜• μ›€μ§μž„ 기반 ꡬ쑰 (Non-rigid structure from motion) μ•Œκ³ λ¦¬μ¦˜μ„ μ œμ•ˆν•˜μ˜€λ‹€. ν”„λ‘œν¬λ£¨μŠ€ν…ŒμŠ€ νšŒκ·€ (Procrustean regression)으둜 λͺ…λͺ…ν•œ μ œμ•ˆλœ ν”„λ ˆμž„μ›Œν¬μ—μ„œ, 3차원 ν˜•νƒœλ“€μ€ κ·Έλ“€μ˜ μ •λ ¬λœ ν˜•νƒœμ— λŒ€ν•œ ν•¨μˆ˜λ‘œ μ •κ·œν™”λœλ‹€. μ œμ•ˆλœ ν”„λ‘œν¬λ£¨μŠ€ν…ŒμŠ€ νšŒκ·€μ˜ λΉ„μš© ν•¨μˆ˜λŠ” 3차원 ν˜•νƒœ μ •λ ¬κ³Ό κ΄€λ ¨λœ μ œμ•½μ„ λΉ„μš© ν•¨μˆ˜μ— ν¬ν•¨μ‹œμΌœ 경사 ν•˜κ°•λ²•μ„ μ΄μš©ν•œ μ΅œμ ν™”κ°€ κ°€λŠ₯ν•˜λ‹€. μ œμ•ˆλœ 방법은 λ‹€μ–‘ν•œ λͺ¨λΈκ³Ό 가정을 ν¬ν•¨μ‹œν‚¬ 수 μžˆμ–΄ μ‹€μš©μ μ΄κ³  μœ μ—°ν•œ ν”„λ ˆμž„μ›Œν¬μ΄λ‹€. λ‹€μ–‘ν•œ μ‹€ν—˜μ„ 톡해 μ œμ•ˆλœ 방법은 세계 졜고 μˆ˜μ€€μ˜ 방법듀과 비ꡐ해 μœ μ‚¬ν•œ μ„±λŠ₯을 λ³΄μ΄λ©΄μ„œ, λ™μ‹œμ— μ‹œκ°„, 곡간 λ³΅μž‘λ„ λ©΄μ—μ„œ κΈ°μ‘΄ 방법에 λΉ„ν•΄ μš°μˆ˜ν•¨μ„ λ³΄μ˜€λ‹€. 두 번째둜 μ œμ•ˆλœ 방법은, 2차원 ν•™μŠ΅ λ°μ΄ν„°λ§Œ μ£Όμ–΄μ‘Œμ„ λ•Œ 2차원 μž…λ ₯μ—μ„œ 3차원 ꡬ쑰λ₯Ό λ³΅μ›ν•˜λŠ” μ•½μ§€λ„ν•™μŠ΅ 방법이닀. ν”„λ‘œν¬λ£¨μŠ€ν…ŒμŠ€ νšŒκ·€ 신경망 (Procrustean regression network)둜 λͺ…λͺ…ν•œ μ œμ•ˆλœ ν•™μŠ΅ 방법은 신경망 λ˜λŠ” μ»¨λ³Όλ£¨μ…˜ 신경망을 톡해 μ‚¬λžŒμ˜ 2차원 μžμ„Έλ‘œλΆ€ν„° 3차원 μžμ„Έλ₯Ό μΆ”μ •ν•˜λŠ” 방법을 ν•™μŠ΅ν•œλ‹€. ν”„λ‘œν¬λ£¨μŠ€ν…ŒμŠ€ νšŒκ·€μ— μ‚¬μš©λœ λΉ„μš© ν•¨μˆ˜λ₯Ό μˆ˜μ •ν•˜μ—¬ 신경망을 ν•™μŠ΅μ‹œν‚€λŠ” λ³Έ 방법은, λΉ„μ •ν˜• μ›€μ§μž„ 기반 ꡬ쑰에 μ‚¬μš©λœ λΉ„μš© ν•¨μˆ˜λ₯Ό 신경망 ν•™μŠ΅μ— μ μš©ν•œ 졜초의 μ‹œλ„μ΄λ‹€. λ˜ν•œ λΉ„μš©ν•¨μˆ˜μ— μ‚¬μš©λœ μ €κ³„μˆ˜ ν•¨μˆ˜ (low-rank function)λ₯Ό 신경망 ν•™μŠ΅μ— 처음으둜 μ‚¬μš©ν•˜μ˜€λ‹€. ν…ŒμŠ€νŠΈ 데이터에 λŒ€ν•΄μ„œ 3차원 μ‚¬λžŒ μžμ„ΈλŠ” μ‹ κ²½λ§μ˜ 전방전달(feed forward)연산에 μ˜ν•΄ μ–»μ–΄μ§€λ―€λ‘œ, 3차원 볡원 방법에 λΉ„ν•΄ 훨씬 λΉ λ₯Έ 3차원 μžμ„Έ 좔정이 κ°€λŠ₯ν•˜λ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ, 신경망을 μ΄μš©ν•΄ 2차원 μž…λ ₯μœΌλ‘œλΆ€ν„° 3차원 μ‚¬λžŒ μžμ„Έλ₯Ό μΆ”μ •ν•˜λŠ” μ§€λ„ν•™μŠ΅ 방법을 μ œμ‹œν•˜μ˜€λ‹€. λ³Έ 방법은 관계 신경망 λͺ¨λ“ˆ(relational modules)을 ν™œμš©ν•΄ μ‹ μ²΄μ˜ λ‹€λ₯Έ λΆ€μœ„κ°„μ˜ 관계λ₯Ό ν•™μŠ΅ν•œλ‹€. μ„œλ‘œ λ‹€λ₯Έ λΆ€μœ„μ˜ μŒλ§ˆλ‹€ 관계 νŠΉμ§•μ„ μΆ”μΆœν•΄ λͺ¨λ“  관계 νŠΉμ§•μ˜ 평균을 μ΅œμ’… 3차원 μžμ„Έ 좔정에 μ‚¬μš©ν•œλ‹€. λ˜ν•œ κ΄€κ³„ν˜• λ“œλžμ•„μ›ƒ(relational dropout)μ΄λΌλŠ” μƒˆλ‘œμš΄ ν•™μŠ΅ 방법을 μ œμ‹œν•΄ 가렀짐에 μ˜ν•΄ λ‚˜νƒ€λ‚˜μ§€ μ•Šμ€ 2차원 관츑값이 μžˆλŠ” μƒν™©μ—μ„œ, κ°•μΈν•˜κ²Œ λ™μž‘ν•  수 μžˆλŠ” 3차원 μžμ„Έ μΆ”μ • 방법을 μ œμ‹œν•˜μ˜€λ‹€. μ‹€ν—˜μ„ 톡해 ν•΄λ‹Ή 방법이 2차원 관츑값이 μΌλΆ€λ§Œ 주어진 μƒν™©μ—μ„œλ„ 큰 μ„±λŠ₯ ν•˜λ½μ΄ 없이 효과적으둜 3차원 μžμ„Έλ₯Ό 좔정함을 증λͺ…ν•˜μ˜€λ‹€.Abstract i Contents iii List of Tables vi List of Figures viii 1 Introduction 1 1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.1 3D Reconstruction of Human Bodies . . . . . . . . . . 9 1.4.2 Weakly-Supervised Learning for 3D HPE . . . . . . . . 11 1.4.3 Supervised Learning for 3D HPE . . . . . . . . . . . . 11 1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Related Works 14 2.1 2D Human Pose Estimation . . . . . . . . . . . . . . . . . . . . 14 2.2 3D Human Pose Estimation . . . . . . . . . . . . . . . . . . . . 16 2.3 Non-rigid Structure from Motion . . . . . . . . . . . . . . . . . 18 2.4 Learning to Reconstruct 3D Structures via Neural Networks . . 23 3 3D Reconstruction of Human Bodies via Procrustean Regression 25 3.1 Formalization of NRSfM . . . . . . . . . . . . . . . . . . . . . 27 3.2 Procrustean Regression . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1 The Cost Function of Procrustean Regression . . . . . . 29 3.2.2 Derivatives of the Cost Function . . . . . . . . . . . . . 32 3.2.3 Example Functions for f and g . . . . . . . . . . . . . . 38 3.2.4 Handling Missing Points . . . . . . . . . . . . . . . . . 43 3.2.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.6 Initialization . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.1 Orthographic Projection . . . . . . . . . . . . . . . . . 46 3.3.2 Perspective Projection . . . . . . . . . . . . . . . . . . 56 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4 Weakly-Supervised Learning of 3D Human Pose via Procrustean Regression Networks 69 4.1 The Cost Function for Procrustean Regression Network . . . . . 70 4.2 Choosing f and g for Procrustean Regression Network . . . . . 74 4.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 75 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 77 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5 Supervised Learning of 3D Human Pose via Relational Networks 86 5.1 Relational Networks . . . . . . . . . . . . . . . . . . . . . . . 88 5.2 Relational Networks for 3D HPE . . . . . . . . . . . . . . . . . 88 5.3 Extensions to Multi-Frame Inputs . . . . . . . . . . . . . . . . 91 5.4 Relational Dropout . . . . . . . . . . . . . . . . . . . . . . . . 93 5.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 94 5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 95 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6 Concluding Remarks 105 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 108 Abstract (In Korean) 128Docto

    Sensing Highly Non-Rigid Objects with RGBD Sensors for Robotic Systems

    Get PDF
    The goal of this research is to enable a robotic system to manipulate clothing and other highly non-rigid objects using an RGBD sensor. The focus of this thesis is to define and test various algorithms / models that are used to solve parts of the laundry process (i.e. handling, classifying, sorting, unfolding, and folding). First, a system is presented for automatically extracting and classifying items in a pile of laundry. Using only visual sensors, the robot identifies and extracts items sequentially from the pile. When an item is removed and isolated, a model is captured of the shape and appearance of the object, which is then compared against a dataset of known items. The contributions of this part of the laundry process are a novel method for extracting articles of clothing from a pile of laundry, a novel method of classifying clothing using interactive perception, and a multi-layer approach termed L-M-H, more specifically L-C-S-H for clothing classification. This thesis describes two different approaches to classify clothing into categories. The first approach relies upon silhouettes, edges, and other low-level image measurements of the articles of clothing. Experiments from the first approach demonstrate the ability of the system to efficiently classify and label into one of six categories (pants, shorts, short-sleeve shirt, long-sleeve shirt, socks, or underwear). These results show that, on average, classification rates using robot interaction are 59% higher than those that do not use interaction. The second approach relies upon color, texture, shape, and edge information from 2D and 3D data within a local and global perspective. The multi-layer approach compartmentalizes the problem into a high (H) layer, multiple mid-level (characteristics(C), selection masks(S)) layers, and a low (L) layer. This approach produces \u27local\u27 solutions to solve the global classification problem. Experiments demonstrate the ability of the system to efficiently classify each article of clothing into one of seven categories (pants, shorts, shirts, socks, dresses, cloths, or jackets). The results presented in this paper show that, on average, the classification rates improve by +27.47% for three categories, +17.90% for four categories, and +10.35% for seven categories over the baseline system, using support vector machines. Second, an algorithm is presented for automatically unfolding a piece of clothing. A piece of cloth is pulled in different directions at various points of the cloth in order to flatten the cloth. The features of the cloth are extracted and calculated to determine a valid location and orientation in which to interact with it. The features include the peak region, corner locations, and continuity / discontinuity of the cloth. In this thesis, a two-stage algorithm is presented, introducing a novel solution to the unfolding / flattening problem using interactive perception. Simulations using 3D simulation software, and experiments with robot hardware demonstrate the ability of the algorithm to flatten pieces of laundry using different starting configurations. These results show that, at most, the algorithm flattens out a piece of cloth from 11.1% to 95.6% of the canonical configuration. Third, an energy minimization algorithm is presented that is designed to estimate the configuration of a deformable object. This approach utilizes an RGBD image to calculate feature correspondence (using SURF features), depth values, and boundary locations. Input from a Kinect sensor is used to segment the deformable surface from the background using an alpha-beta swap algorithm. Using this segmentation, the system creates an initial mesh model without prior information of the surface geometry, and it reinitializes the configuration of the mesh model after a loss of input data. This approach is able to handle in-plane rotation, out-of-plane rotation, and varying changes in translation and scale. Results display the proposed algorithm over a dataset consisting of seven shirts, two pairs of shorts, two posters, and a pair of pants. The current approach is compared using a simulated shirt model in order to calculate the mean square error of the distance from the vertices on the mesh model to the ground truth, provided by the simulation model

    Resolving Ambiguities in Monocular 3D Reconstruction of Deformable Surfaces

    Get PDF
    In this thesis, we focus on the problem of recovering 3D shapes of deformable surfaces from a single camera. This problem is known to be ill-posed as for a given 2D input image there exist many 3D shapes that give visually identical projections. We present three methods which make headway towards resolving these ambiguities. We believe that our work represents a significant step towards making surface reconstruction methods of practical use. First, we propose a surface reconstruction method that overcomes the limitations of the state-of-the-art template-based and non-rigid structure from motion methods. We neither track points over many frames, nor require a sophisticated deformation model, or depend on a reference image. In our method, we establish correspondences between pairs of frames in which the shape is different and unknown. We then estimate homographies between corresponding local planar patches in both images. These yield approximate 3D reconstructions of points within each patch up to a scale factor. Since we consider overlapping patches, we can enforce them to be consistent over the whole surface. Finally, a local deformation model is used to fit a triangulated mesh to the 3D point cloud, which makes the reconstruction robust to both noise and outliers in the image data. Second, we propose a novel approach to recovering the 3D shape of a deformable surface from a monocular input by taking advantage of shading information in more generic contexts than conventional Shape-from-Shading (SfS) methods. This includes surfaces that may be fully or partially textured and lit by arbitrarily many light sources. To this end, given a lighting model, we learn the relationship between a shading pattern and the corresponding local surface shape. At run time, we first use this knowledge to recover the shape of surface patches and then enforce spatial consistency between the patches to produce a global 3D shape. Instead of treating texture as noise as in many SfS approaches, we exploit it as an additional source of information. We validate our approach quantitatively and qualitatively using both synthetic and real data. Third, we introduce a constrained latent variable model that inherently accounts for geometric constraints such as inextensibility defined on the mesh model. To this end, we learn a non-linear mapping from the latent space to the output space, which corresponds to vertex positions of a mesh model, such that the generated outputs comply with equality and inequality constraints expressed in terms of the problem variables. Since its output is encouraged to satisfy such constraints inherently, using our model removes the need for computationally expensive methods that enforce these constraints at run time. In addition, our approach is completely generic and could be used in many other different contexts as well, such as image classification to impose separation of the classes, and articulated tracking to constrain the space of possible poses
    corecore