320 research outputs found

    GANVO: Unsupervised Deep Monocular Visual Odometry and Depth Estimation with Generative Adversarial Networks

    Full text link
    In the last decade, supervised deep learning approaches have been extensively employed in visual odometry (VO) applications, which is not feasible in environments where labelled data is not abundant. On the other hand, unsupervised deep learning approaches for localization and mapping in unknown environments from unlabelled data have received comparatively less attention in VO research. In this study, we propose a generative unsupervised learning framework that predicts 6-DoF pose camera motion and monocular depth map of the scene from unlabelled RGB image sequences, using deep convolutional Generative Adversarial Networks (GANs). We create a supervisory signal by warping view sequences and assigning the re-projection minimization to the objective loss function that is adopted in multi-view pose estimation and single-view depth generation network. Detailed quantitative and qualitative evaluations of the proposed framework on the KITTI and Cityscapes datasets show that the proposed method outperforms both existing traditional and unsupervised deep VO methods providing better results for both pose estimation and depth recovery.Comment: ICRA 2019 - accepte

    Unsupervised Odometry and Depth Learning for Endoscopic Capsule Robots

    Full text link
    In the last decade, many medical companies and research groups have tried to convert passive capsule endoscopes as an emerging and minimally invasive diagnostic technology into actively steerable endoscopic capsule robots which will provide more intuitive disease detection, targeted drug delivery and biopsy-like operations in the gastrointestinal(GI) tract. In this study, we introduce a fully unsupervised, real-time odometry and depth learner for monocular endoscopic capsule robots. We establish the supervision by warping view sequences and assigning the re-projection minimization to the loss function, which we adopt in multi-view pose estimation and single-view depth estimation network. Detailed quantitative and qualitative analyses of the proposed framework performed on non-rigidly deformable ex-vivo porcine stomach datasets proves the effectiveness of the method in terms of motion estimation and depth recovery.Comment: submitted to IROS 201

    ๋”ฅ๋Ÿฌ๋‹์— ๊ธฐ์ดˆํ•œ ํšจ๊ณผ์ ์ธ Visual Odometry ๊ฐœ์„  ๋ฐฉ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์ด๋ฒ”ํฌ.Understanding the three-dimensional environment is one of the most important issues in robotics and computer vision. For this purpose, sensors such as a lidar, a ultrasound, infrared devices, an inertial measurement unit (IMU) and cameras are used, individually or simultaneously, through sensor fusion. Among these sensors, in recent years, researches for use of visual sensors, which can obtain a lot of information at a low price, have been actively underway. Understanding of the 3D environment using cameras includes depth restoration, optical/scene flow estimation, and visual odometry (VO). Among them, VO estimates location of a camera and maps the surrounding environment, while a camera-equipped robot or person travels. This technology must be preceded by other tasks such as path planning and collision avoidance. Also, it can be applied to practical applications such as autonomous driving, augmented reality (AR), unmanned aerial vehicle (UAV) control, and 3D modeling. So far, researches on various VO algorithms have been proposed. Initial VO researches were conducted by filtering poses of robot and map features. Because of the disadvantage of the amount of computation being too large and errors are accumulated, a method using a keyframe was studied. Traditional VO can be divided into a feature-based method and a direct method. Methods using features obtain pose transformation between two images through feature extraction and matching. Direct methods directly compare the intensity of image pixels to obtain poses that minimize the sum of photometric errors. Recently, due to the development of deep learning skills, many studies have been conducted to apply deep learning to VO. Deep learning-based VO, like other fields using deep learning with images, first extracts convolutional neural network (CNN) features and calculates pose transformation between images. Deep learning-based VO can be divided into supervised learning-based and unsupervised learning-based. For VO, using supervised learning, a neural network is trained using ground truth poses, and the unsupervised learning-based method learns poses using only image sequences without given ground truth values. While existing research papers show decent performance, the image datasets used in these studies are all composed of high quality and clear images obtained using expensive cameras. There are also algorithms that can be operated only if non-image information such as exposure time, nonlinear response functions, and camera parameters is provided. In order for VO to be more widely applied to real-world application problems, odometry estimation should be performed even if the datasets are incomplete. Therefore, in this dissertation, two methods are proposed to improve VO performance using deep learning. First, I adopt a super-resolution (SR) technique to improve the performance of VO using images with low-resolution and noises. The existing SR techniques have mainly focused on increasing image resolution rather than execution time. However, a real-time property is very important for VO. Therefore, the SR network should be designed considering the execution time, resolution increment, and noise reduction in this case. Conducting a VO after passing through this SR network, a higher performance VO can be carried out, than using original images. Experimental results using the TUM dataset show that the proposed method outperforms the conventional VO and other SR methods. Second, I propose a fully unsupervised learning-based VO that performs odometry estimation, single-view depth estimation, and camera intrinsic parameter estimation simultaneously using a dataset consisting only of image sequences. In the existing unsupervised learning-based VO, algorithms were performed using the images and intrinsic parameters of the camera. Based on existing the technique, I propose a method for additionally estimating camera parameters from the deep intrinsic network. Intrinsic parameters are estimated by two assumptions using the properties of camera parameters in an intrinsic network. Experiments using the KITTI dataset show that the results are comparable to those of the conventional method.3์ฐจ์› ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์ดํ•ด๋Š” ๋กœ๋ณดํ‹ฑ์Šค์™€ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์—์„œ ๊ต‰์žฅํžˆ ์ค‘์š”ํ•œ ๋ฌธ์ œ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋ผ์ด๋‹ค, ์ดˆ์ŒํŒŒ, ์ ์™ธ์„ , inertial measurement unit (IMU), ์นด๋ฉ”๋ผ ๋“ฑ์˜ ์„ผ์„œ๊ฐ€ ๊ฐœ๋ณ„์ ์œผ๋กœ ๋˜๋Š” ์„ผ์„œ ์œตํ•ฉ์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ ์„ผ์„œ๊ฐ€ ๋™์‹œ์— ์‚ฌ์šฉ๋˜๊ธฐ๋„ ํ•œ๋‹ค. ์ด ์ค‘์—์„œ๋„ ์ตœ๊ทผ์—๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ์ €๋ ดํ•œ ๊ฐ€๊ฒฉ์— ๋งŽ์€ ์ •๋ณด๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์นด๋ฉ”๋ผ๋ฅผ ์ด์šฉํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํžˆ ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค. ์นด๋ฉ”๋ผ๋ฅผ ์ด์šฉํ•œ 3์ฐจ์› ํ™˜๊ฒฝ ์ธ์ง€๋Š” ๊นŠ์ด ๋ณต์›, optical/scene flow ์ถ”์ •, visual odometry (VO) ๋“ฑ์ด ์žˆ๋‹ค. ์ด ์ค‘ VO๋Š” ์นด๋ฉ”๋ผ๋ฅผ ์žฅ์ฐฉํ•œ ๋กœ๋ด‡ ํ˜น์€ ์‚ฌ๋žŒ์ด ์ด๋™ํ•˜๋ฉฐ ์ž์‹ ์˜ ์œ„์น˜๋ฅผ ํŒŒ์•…ํ•˜๊ณ  ์ฃผ๋ณ€ ํ™˜๊ฒฝ์˜ ์ง€๋„๋ฅผ ์ž‘์„ฑํ•˜๋Š” ๊ธฐ์ˆ ์ด๋‹ค. ์ด ๊ธฐ์ˆ ์€ ๊ฒฝ๋กœ ์„ค์ •, ์ถฉ๋Œ ํšŒํ”ผ ๋“ฑ ๋‹ค๋ฅธ ์ž„๋ฌด๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์ „์— ํ•„์ˆ˜์ ์œผ๋กœ ์„ ํ–‰๋˜์–ด์•ผ ํ•˜๋ฉฐ ์ž์œจ ์ฃผํ–‰, AR, UAV contron, 3D modelling ๋“ฑ ์‹ค์ œ ์‘์šฉ ๋ฌธ์ œ์— ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ํ˜„์žฌ ๋‹ค์–‘ํ•œ VO ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•œ ๋…ผ๋ฌธ์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ดˆ๊ธฐ VO ์—ฐ๊ตฌ๋Š” feature๋ฅผ ์ด์šฉํ•˜์—ฌ feature์™€ ๋กœ๋ด‡์˜ pose๋ฅผ ํ•„ํ„ฐ๋ง ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰๋˜์—ˆ๋‹ค. ํ•„ํ„ฐ๋ฅผ ์ด์šฉํ•œ ๋ฐฉ๋ฒ•์€ ๊ณ„์‚ฐ๋Ÿ‰์ด ๋„ˆ๋ฌด ๋งŽ๊ณ  ์˜ค์ฐจ๊ฐ€ ๋ˆ„์ ๋œ๋‹ค๋Š” ๋‹จ์  ๋•Œ๋ฌธ์— keyframe์„ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์—ฐ๊ตฌ๋˜์—ˆ๋‹ค. ์ด ๋ฐฉ์‹์œผ๋กœ feature๋ฅผ ์ด์šฉํ•˜๋Š” ๋ฐฉ์‹๊ณผ ํ”ฝ์…€์˜ intensity๋ฅผ ์ง์ ‘ ์‚ฌ์šฉํ•˜๋Š” direct ๋ฐฉ์‹์ด ์—ฐ๊ตฌ๋˜์—ˆ๋‹ค. feature๋ฅผ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์€ feature์˜ ์ถ”์ถœ๊ณผ ๋งค์นญ์„ ์ด์šฉํ•˜์—ฌ ๋‘ ์ด๋ฏธ์ง€ ์‚ฌ์ด์˜ pose ๋ณ€ํ™”๋ฅผ ๊ตฌํ•˜๋ฉฐ direct ๋ฐฉ๋ฒ•๋“ค์€ ์ด๋ฏธ์ง€ ํ”ฝ์…€์˜ intensity๋ฅผ ์ง์ ‘ ๋น„๊ตํ•˜์—ฌ photometric error๋ฅผ ์ตœ์†Œํ™” ์‹œํ‚ค๋Š” pose๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ตœ๊ทผ์—๋Š” deep learning ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ฐœ๋‹ฌ๋กœ ์ธํ•ด VO์—๋„ deep learning์„ ์ ์šฉ์‹œํ‚ค๋Š” ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์ด ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค. Deep learning-based VO๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ด์šฉํ•œ ๋‹ค๋ฅธ ๋ถ„์•ผ์™€ ๊ฐ™์ด ๊ธฐ๋ณธ์ ์œผ๋กœ CNN์„ ์ด์šฉํ•˜์—ฌ feature๋ฅผ ์ถ”์ถœํ•œ ๋’ค ์ด๋ฏธ์ง€ ์‚ฌ์ด์˜ pose ๋ณ€ํ™”๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด๋Š” ๋‹ค์‹œ supervised learning์„ ์ด์šฉํ•œ ๋ฐฉ์‹๊ณผ unsupervised learning์„ ์ด์šฉํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋‹ค. supervised learning์„ ์ด์šฉํ•œ VO๋Š” pose์˜ ์ฐธ๊ฐ’์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์„ ์‹œํ‚ค๋ฉฐ, unsupervised learning์„ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์ฃผ์–ด์ง€๋Š” ์ฐธ๊ฐ’ ์—†์ด ์ด๋ฏธ์ง€์˜ ์ •๋ณด๋งŒ์„ ์ด์šฉํ•˜์—ฌ pose๋ฅผ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ์‹์ด๋‹ค. ๊ธฐ์กด VO ๋…ผ๋ฌธ๋“ค์€ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์ง€๋งŒ ์—ฐ๊ตฌ์— ์‚ฌ์šฉ๋œ ์ด๋ฏธ์ง€ dataset๋“ค์€ ๋ชจ๋‘ ๊ณ ๊ฐ€์˜ ์นด๋ฉ”๋ผ๋ฅผ ์ด์šฉํ•˜์—ฌ ์–ป์–ด์ง„ ๊ณ ํ™”์งˆ์˜ ์„ ๋ช…ํ•œ ์ด๋ฏธ์ง€๋“ค๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. ๋˜ํ•œ ๋…ธ์ถœ ์‹œ๊ฐ„, ๋น„์„ ํ˜• ๋ฐ˜์‘ ํ•จ์ˆ˜, ์นด๋ฉ”๋ผ ํŒŒ๋ผ๋ฏธํ„ฐ ๋“ฑ์˜ ์ด๋ฏธ์ง€ ์™ธ์ ์ธ ์ •๋ณด๋ฅผ ์ด์šฉํ•ด์•ผ๋งŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋™์ž‘์ด ๊ฐ€๋Šฅํ•˜๋‹ค. VO๊ฐ€ ์‹ค์ œ ์‘์šฉ ๋ฌธ์ œ์— ๋” ๋„๋ฆฌ ์ ์šฉ๋˜๊ธฐ ์œ„ํ•ด์„œ๋Š” dataset์ด ๋ถˆ์™„์ „ํ•  ๊ฒฝ์šฐ์—๋„ odometry ์ถ”์ •์ด ์ž˜ ์ด๋ฃจ์–ด์ ธ์•ผ ํ•œ๋‹ค. ์ด์— ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” deep learning์„ ์ด์šฉํ•˜์—ฌ VO์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ๋Š” super-resolution (SR) ๊ธฐ๋ฒ•์œผ๋กœ ์ €ํ•ด์ƒ๋„, ๋…ธ์ด์ฆˆ๊ฐ€ ํฌํ•จ๋œ ์ด๋ฏธ์ง€๋ฅผ ์ด์šฉํ•œ VO์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด์˜ SR ๊ธฐ๋ฒ•์€ ์ˆ˜ํ–‰ ์‹œ๊ฐ„๋ณด๋‹ค๋Š” ์ด๋ฏธ์ง€์˜ ํ•ด์ƒ๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์— ์ฃผ๋กœ ์ง‘์ค‘ํ•˜์˜€๋‹ค. ํ•˜์ง€๋งŒ VO ์ˆ˜ํ–‰์— ์žˆ์–ด์„œ๋Š” ์‹ค์‹œ๊ฐ„์„ฑ์ด ๊ต‰์žฅํžˆ ์ค‘์š”ํ•˜๋‹ค. ๋”ฐ๋ผ์„œ ์ˆ˜ํ–‰ ์‹œ๊ฐ„์„ ๊ณ ๋ คํ•œ SR ๋„คํŠธ์›Œํฌ์˜ ์„ค๊ณ„ํ•˜์—ฌ ์ด๋ฏธ์ง€์˜ ํ•ด์ƒ๋„๋ฅผ ๋†’์ด๊ณ  ๋…ธ์ด์ฆˆ๋ฅผ ์ค„์˜€๋‹ค. ์ด SR ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ต๊ณผ์‹œํ‚จ ๋’ค VO๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉด ๊ธฐ์กด์˜ ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ์˜ VO๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. TUM dataset์„ ์ด์šฉํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ ๊ธฐ์กด์˜ VO ๊ธฐ๋ฒ•๊ณผ ๋‹ค๋ฅธ SR ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์˜€์„ ๋•Œ ๋ณด๋‹ค ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์˜ ์„ฑ๋Šฅ์ด ๋” ๋†’์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ๋Š” ์—ฐ์†๋œ ์ด๋ฏธ์ง€๋งŒ์œผ๋กœ ๊ตฌ์„ฑ๋œ dataset์„ ์ด์šฉํ•˜์—ฌ VO, ๋‹จ์ผ ์ด๋ฏธ์ง€ ๊นŠ์ด ์ถ”์ •, ์นด๋ฉ”๋ผ ๋‚ด๋ถ€ ํŒŒ๋ผ๋ฏธํ„ฐ ์ถ”์ •์„ ์ˆ˜ํ–‰ํ•˜๋Š” fully unsupervised learning-based VO๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด unsupervised learning์„ ์ด์šฉํ•œ VO์—์„œ๋Š” ์ด๋ฏธ์ง€๋“ค๊ณผ ์ด๋ฏธ์ง€๋ฅผ ์ดฌ์˜ํ•œ ์นด๋ฉ”๋ผ์˜ ๋‚ด๋ถ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ VO๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ์ด ๊ธฐ์ˆ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” deep intrinsic ๋„คํŠธ์›Œํฌ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์นด๋ฉ”๋ผ ํŒŒ๋ผ๋ฏธํ„ฐ๊นŒ์ง€ ๋„คํŠธ์›Œํฌ์—์„œ ์ถ”์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. 0์œผ๋กœ ์ˆ˜๋ ดํ•˜๊ฑฐ๋‚˜ ์‰ฝ๊ฒŒ ๋ฐœ์‚ฐํ•˜๋Š” intrinsic ๋„คํŠธ์›Œํฌ์— ์นด๋ฉ”๋ผ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์„ฑ์งˆ์„ ์ด์šฉํ•œ ๋‘ ๊ฐ€์ง€ ๊ฐ€์ •์„ ํ†ตํ•ด ๋‚ด๋ถ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. KITTI dataset์„ ์ด์šฉํ•œ ์‹คํ—˜์„ ํ†ตํ•ด intrinsic parameter ์ •๋ณด๋ฅผ ์ œ๊ณต๋ฐ›์•„ ์ง„ํ–‰๋œ ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.1 INTRODUCTION 1 1.1 Background and Motivation 1 1.2 Literature Review 3 1.3 Contributions 10 1.4 Thesis Structure 11 2 Mathematical Preliminaries of Visual Odometry 13 2.1 Feature-based VO 13 2.2 Direct VO 17 2.3 Learning-based VO 21 2.3.1 Supervised learning-based VO 22 2.3.2 Unsupervised learning-based VO 25 3 Error Improvement in Visual Odometry Using Super-resolution 29 3.1 Introduction 29 3.2 Related Work 31 3.2.1 Visual Odometry 31 3.2.2 Super-resolution 33 3.3 SR-VO 34 3.3.1 VO performance analysis according to changing resolution 34 3.3.2 Super-Resolution Network 37 3.4 Experiments 40 3.4.1 Super-Resolution Procedure 40 3.4.2 VO with SR images 42 3.5 Summary 54 4 Visual Odometry Enhancement Method Using Fully Unsupervised Learning 55 4.1 Introduction 55 4.2 Related Work 57 4.2.1 Traditional Visual Odometry 57 4.2.2 Single-view Depth Recovery 58 4.2.3 Supervised Learning-based Visual Odometry 59 4.2.4 Unsupervised Learning-based Visual Odometry 60 4.2.5 Architecture Overview 62 4.3 Methods 62 4.3.1 Predicting the Target Image using Source Images 62 4.3.2 Intrinsic Parameters Regressor 63 4.4 Experiments 66 4.4.1 Monocular Depth Estimation 66 4.4.2 Visual Odometry 67 4.4.3 Intrinsic Parameters Estimation 77 5 Conclusion and Future Work 82 5.1 Conclusion 82 5.2 Future Work 85 Bibliography 86 Abstract (In Korean) 101Docto
    • โ€ฆ
    corecore