197 research outputs found

    Learning to Interpret Fluid Type Phenomena via Images

    Get PDF
    Learning to interpret fluid-type phenomena via images is a long-standing challenging problem in computer vision. The problem becomes even more challenging when the fluid medium is highly dynamic and refractive due to its transparent nature. Here, we consider imaging through such refractive fluid media like water and air. For water, we design novel supervised learning-based algorithms to recover its 3D surface as well as the highly distorted underground patterns. For air, we design a state-of-the-art unsupervised learning algorithm to predict the distortion-free image given a short sequence of turbulent images. Specifically, we design a deep neural network that estimates the depth and normal maps of a fluid surface by analyzing the refractive distortion of a reference background pattern. Regarding the recovery of severely downgraded underwater images due to the refractive distortions caused by water surface fluctuations, we present the distortion-guided network (DG-Net) for restoring distortion-free underwater images. The key idea is to use a distortion map to guide network training. The distortion map models the pixel displacement caused by water refraction. Furthermore, we present a novel unsupervised network to recover the latent distortion-free image. The key idea is to model non-rigid distortions as deformable grids. Our network consists of a grid deformer that estimates the distortion field and an image generator that outputs the distortion-free image. By leveraging the positional encoding operator, we can simplify the network structure while maintaining fine spatial details in the recovered images. We also develop a combinational deep neural network that can simultaneously perform recovery of the latent distortion-free image as well as 3D reconstruction of the transparent and dynamic fluid surface. Through extensive experiments on simulated and real captured fluid images, we demonstrate that our proposed deep neural networks outperform the current state-of-the-art on solving specific tasks

    Human robot interaction in a crowded environment

    No full text
    Human Robot Interaction (HRI) is the primary means of establishing natural and affective communication between humans and robots. HRI enables robots to act in a way similar to humans in order to assist in activities that are considered to be laborious, unsafe, or repetitive. Vision based human robot interaction is a major component of HRI, with which visual information is used to interpret how human interaction takes place. Common tasks of HRI include finding pre-trained static or dynamic gestures in an image, which involves localising different key parts of the human body such as the face and hands. This information is subsequently used to extract different gestures. After the initial detection process, the robot is required to comprehend the underlying meaning of these gestures [3]. Thus far, most gesture recognition systems can only detect gestures and identify a person in relatively static environments. This is not realistic for practical applications as difficulties may arise from peopleโ€Ÿs movements and changing illumination conditions. Another issue to consider is that of identifying the commanding person in a crowded scene, which is important for interpreting the navigation commands. To this end, it is necessary to associate the gesture to the correct person and automatic reasoning is required to extract the most probable location of the person who has initiated the gesture. In this thesis, we have proposed a practical framework for addressing the above issues. It attempts to achieve a coarse level understanding about a given environment before engaging in active communication. This includes recognizing human robot interaction, where a person has the intention to communicate with the robot. In this regard, it is necessary to differentiate if people present are engaged with each other or their surrounding environment. The basic task is to detect and reason about the environmental context and different interactions so as to respond accordingly. For example, if individuals are engaged in conversation, the robot should realize it is best not to disturb or, if an individual is receptive to the robotโ€Ÿs interaction, it may approach the person. Finally, if the user is moving in the environment, it can analyse further to understand if any help can be offered in assisting this user. The method proposed in this thesis combines multiple visual cues in a Bayesian framework to identify people in a scene and determine potential intentions. For improving system performance, contextual feedback is used, which allows the Bayesian network to evolve and adjust itself according to the surrounding environment. The results achieved demonstrate the effectiveness of the technique in dealing with human-robot interaction in a relatively crowded environment [7]

    Smart environment monitoring through micro unmanned aerial vehicles

    Get PDF
    In recent years, the improvements of small-scale Unmanned Aerial Vehicles (UAVs) in terms of flight time, automatic control, and remote transmission are promoting the development of a wide range of practical applications. In aerial video surveillance, the monitoring of broad areas still has many challenges due to the achievement of different tasks in real-time, including mosaicking, change detection, and object detection. In this thesis work, a small-scale UAV based vision system to maintain regular surveillance over target areas is proposed. The system works in two modes. The first mode allows to monitor an area of interest by performing several flights. During the first flight, it creates an incremental geo-referenced mosaic of an area of interest and classifies all the known elements (e.g., persons) found on the ground by an improved Faster R-CNN architecture previously trained. In subsequent reconnaissance flights, the system searches for any changes (e.g., disappearance of persons) that may occur in the mosaic by a histogram equalization and RGB-Local Binary Pattern (RGB-LBP) based algorithm. If present, the mosaic is updated. The second mode, allows to perform a real-time classification by using, again, our improved Faster R-CNN model, useful for time-critical operations. Thanks to different design features, the system works in real-time and performs mosaicking and change detection tasks at low-altitude, thus allowing the classification even of small objects. The proposed system was tested by using the whole set of challenging video sequences contained in the UAV Mosaicking and Change Detection (UMCD) dataset and other public datasets. The evaluation of the system by well-known performance metrics has shown remarkable results in terms of mosaic creation and updating, as well as in terms of change detection and object detection

    Underwater image and video dehazing with pure haze region segmentation

    Get PDF
    ยฉ 2017 The Authors Underwater scenes captured by cameras are plagued with poor contrast and a spectral distortion, which are the result of the scattering and absorptive properties of water. In this paper we present a novel dehazing method that improves visibility in images and videos by detecting and segmenting image regions that contain only water. The colour of these regions, which we refer to as pure haze regions, is similar to the haze that is removed during the dehazing process. Moreover, we propose a semantic white balancing approach for illuminant estimation that uses the dominant colour of the water to address the spectral distortion present in underwater scenes. To validate the results of our method and compare them to those obtained with state-of-the-art approaches, we perform extensive subjective evaluation tests using images captured in a variety of water types and underwater videos captured onboard an underwater vehicle

    Optical Imaging and Image Restoration Techniques for Deep Ocean Mapping: A Comprehensive Survey

    Get PDF
    Visual systems are receiving increasing attention in underwater applications. While the photogrammetric and computer vision literature so far has largely targeted shallow water applications, recently also deep sea mapping research has come into focus. The majority of the seafloor, and of Earthโ€™s surface, is located in the deep ocean below 200 m depth, and is still largely uncharted. Here, on top of general image quality degradation caused by water absorption and scattering, additional artificial illumination of the survey areas is mandatory that otherwise reside in permanent darkness as no sunlight reaches so deep. This creates unintended non-uniform lighting patterns in the images and non-isotropic scattering effects close to the camera. If not compensated properly, such effects dominate seafloor mosaics and can obscure the actual seafloor structures. Moreover, cameras must be protected from the high water pressure, e.g. by housings with thick glass ports, which can lead to refractive distortions in images. Additionally, no satellite navigation is available to support localization. All these issues render deep sea visual mapping a challenging task and most of the developed methods and strategies cannot be directly transferred to the seafloor in several kilometers depth. In this survey we provide a state of the art review of deep ocean mapping, starting from existing systems and challenges, discussing shallow and deep water models and corresponding solutions. Finally, we identify open issues for future lines of research

    ๋”ฅ๋Ÿฌ๋‹์— ๊ธฐ์ดˆํ•œ ํšจ๊ณผ์ ์ธ Visual Odometry ๊ฐœ์„  ๋ฐฉ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์ด๋ฒ”ํฌ.Understanding the three-dimensional environment is one of the most important issues in robotics and computer vision. For this purpose, sensors such as a lidar, a ultrasound, infrared devices, an inertial measurement unit (IMU) and cameras are used, individually or simultaneously, through sensor fusion. Among these sensors, in recent years, researches for use of visual sensors, which can obtain a lot of information at a low price, have been actively underway. Understanding of the 3D environment using cameras includes depth restoration, optical/scene flow estimation, and visual odometry (VO). Among them, VO estimates location of a camera and maps the surrounding environment, while a camera-equipped robot or person travels. This technology must be preceded by other tasks such as path planning and collision avoidance. Also, it can be applied to practical applications such as autonomous driving, augmented reality (AR), unmanned aerial vehicle (UAV) control, and 3D modeling. So far, researches on various VO algorithms have been proposed. Initial VO researches were conducted by filtering poses of robot and map features. Because of the disadvantage of the amount of computation being too large and errors are accumulated, a method using a keyframe was studied. Traditional VO can be divided into a feature-based method and a direct method. Methods using features obtain pose transformation between two images through feature extraction and matching. Direct methods directly compare the intensity of image pixels to obtain poses that minimize the sum of photometric errors. Recently, due to the development of deep learning skills, many studies have been conducted to apply deep learning to VO. Deep learning-based VO, like other fields using deep learning with images, first extracts convolutional neural network (CNN) features and calculates pose transformation between images. Deep learning-based VO can be divided into supervised learning-based and unsupervised learning-based. For VO, using supervised learning, a neural network is trained using ground truth poses, and the unsupervised learning-based method learns poses using only image sequences without given ground truth values. While existing research papers show decent performance, the image datasets used in these studies are all composed of high quality and clear images obtained using expensive cameras. There are also algorithms that can be operated only if non-image information such as exposure time, nonlinear response functions, and camera parameters is provided. In order for VO to be more widely applied to real-world application problems, odometry estimation should be performed even if the datasets are incomplete. Therefore, in this dissertation, two methods are proposed to improve VO performance using deep learning. First, I adopt a super-resolution (SR) technique to improve the performance of VO using images with low-resolution and noises. The existing SR techniques have mainly focused on increasing image resolution rather than execution time. However, a real-time property is very important for VO. Therefore, the SR network should be designed considering the execution time, resolution increment, and noise reduction in this case. Conducting a VO after passing through this SR network, a higher performance VO can be carried out, than using original images. Experimental results using the TUM dataset show that the proposed method outperforms the conventional VO and other SR methods. Second, I propose a fully unsupervised learning-based VO that performs odometry estimation, single-view depth estimation, and camera intrinsic parameter estimation simultaneously using a dataset consisting only of image sequences. In the existing unsupervised learning-based VO, algorithms were performed using the images and intrinsic parameters of the camera. Based on existing the technique, I propose a method for additionally estimating camera parameters from the deep intrinsic network. Intrinsic parameters are estimated by two assumptions using the properties of camera parameters in an intrinsic network. Experiments using the KITTI dataset show that the results are comparable to those of the conventional method.3์ฐจ์› ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์ดํ•ด๋Š” ๋กœ๋ณดํ‹ฑ์Šค์™€ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์—์„œ ๊ต‰์žฅํžˆ ์ค‘์š”ํ•œ ๋ฌธ์ œ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋ผ์ด๋‹ค, ์ดˆ์ŒํŒŒ, ์ ์™ธ์„ , inertial measurement unit (IMU), ์นด๋ฉ”๋ผ ๋“ฑ์˜ ์„ผ์„œ๊ฐ€ ๊ฐœ๋ณ„์ ์œผ๋กœ ๋˜๋Š” ์„ผ์„œ ์œตํ•ฉ์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ ์„ผ์„œ๊ฐ€ ๋™์‹œ์— ์‚ฌ์šฉ๋˜๊ธฐ๋„ ํ•œ๋‹ค. ์ด ์ค‘์—์„œ๋„ ์ตœ๊ทผ์—๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ์ €๋ ดํ•œ ๊ฐ€๊ฒฉ์— ๋งŽ์€ ์ •๋ณด๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์นด๋ฉ”๋ผ๋ฅผ ์ด์šฉํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํžˆ ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค. ์นด๋ฉ”๋ผ๋ฅผ ์ด์šฉํ•œ 3์ฐจ์› ํ™˜๊ฒฝ ์ธ์ง€๋Š” ๊นŠ์ด ๋ณต์›, optical/scene flow ์ถ”์ •, visual odometry (VO) ๋“ฑ์ด ์žˆ๋‹ค. ์ด ์ค‘ VO๋Š” ์นด๋ฉ”๋ผ๋ฅผ ์žฅ์ฐฉํ•œ ๋กœ๋ด‡ ํ˜น์€ ์‚ฌ๋žŒ์ด ์ด๋™ํ•˜๋ฉฐ ์ž์‹ ์˜ ์œ„์น˜๋ฅผ ํŒŒ์•…ํ•˜๊ณ  ์ฃผ๋ณ€ ํ™˜๊ฒฝ์˜ ์ง€๋„๋ฅผ ์ž‘์„ฑํ•˜๋Š” ๊ธฐ์ˆ ์ด๋‹ค. ์ด ๊ธฐ์ˆ ์€ ๊ฒฝ๋กœ ์„ค์ •, ์ถฉ๋Œ ํšŒํ”ผ ๋“ฑ ๋‹ค๋ฅธ ์ž„๋ฌด๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์ „์— ํ•„์ˆ˜์ ์œผ๋กœ ์„ ํ–‰๋˜์–ด์•ผ ํ•˜๋ฉฐ ์ž์œจ ์ฃผํ–‰, AR, UAV contron, 3D modelling ๋“ฑ ์‹ค์ œ ์‘์šฉ ๋ฌธ์ œ์— ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ํ˜„์žฌ ๋‹ค์–‘ํ•œ VO ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•œ ๋…ผ๋ฌธ์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ดˆ๊ธฐ VO ์—ฐ๊ตฌ๋Š” feature๋ฅผ ์ด์šฉํ•˜์—ฌ feature์™€ ๋กœ๋ด‡์˜ pose๋ฅผ ํ•„ํ„ฐ๋ง ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰๋˜์—ˆ๋‹ค. ํ•„ํ„ฐ๋ฅผ ์ด์šฉํ•œ ๋ฐฉ๋ฒ•์€ ๊ณ„์‚ฐ๋Ÿ‰์ด ๋„ˆ๋ฌด ๋งŽ๊ณ  ์˜ค์ฐจ๊ฐ€ ๋ˆ„์ ๋œ๋‹ค๋Š” ๋‹จ์  ๋•Œ๋ฌธ์— keyframe์„ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์—ฐ๊ตฌ๋˜์—ˆ๋‹ค. ์ด ๋ฐฉ์‹์œผ๋กœ feature๋ฅผ ์ด์šฉํ•˜๋Š” ๋ฐฉ์‹๊ณผ ํ”ฝ์…€์˜ intensity๋ฅผ ์ง์ ‘ ์‚ฌ์šฉํ•˜๋Š” direct ๋ฐฉ์‹์ด ์—ฐ๊ตฌ๋˜์—ˆ๋‹ค. feature๋ฅผ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์€ feature์˜ ์ถ”์ถœ๊ณผ ๋งค์นญ์„ ์ด์šฉํ•˜์—ฌ ๋‘ ์ด๋ฏธ์ง€ ์‚ฌ์ด์˜ pose ๋ณ€ํ™”๋ฅผ ๊ตฌํ•˜๋ฉฐ direct ๋ฐฉ๋ฒ•๋“ค์€ ์ด๋ฏธ์ง€ ํ”ฝ์…€์˜ intensity๋ฅผ ์ง์ ‘ ๋น„๊ตํ•˜์—ฌ photometric error๋ฅผ ์ตœ์†Œํ™” ์‹œํ‚ค๋Š” pose๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ตœ๊ทผ์—๋Š” deep learning ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ฐœ๋‹ฌ๋กœ ์ธํ•ด VO์—๋„ deep learning์„ ์ ์šฉ์‹œํ‚ค๋Š” ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์ด ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค. Deep learning-based VO๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ด์šฉํ•œ ๋‹ค๋ฅธ ๋ถ„์•ผ์™€ ๊ฐ™์ด ๊ธฐ๋ณธ์ ์œผ๋กœ CNN์„ ์ด์šฉํ•˜์—ฌ feature๋ฅผ ์ถ”์ถœํ•œ ๋’ค ์ด๋ฏธ์ง€ ์‚ฌ์ด์˜ pose ๋ณ€ํ™”๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด๋Š” ๋‹ค์‹œ supervised learning์„ ์ด์šฉํ•œ ๋ฐฉ์‹๊ณผ unsupervised learning์„ ์ด์šฉํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋‹ค. supervised learning์„ ์ด์šฉํ•œ VO๋Š” pose์˜ ์ฐธ๊ฐ’์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์„ ์‹œํ‚ค๋ฉฐ, unsupervised learning์„ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์ฃผ์–ด์ง€๋Š” ์ฐธ๊ฐ’ ์—†์ด ์ด๋ฏธ์ง€์˜ ์ •๋ณด๋งŒ์„ ์ด์šฉํ•˜์—ฌ pose๋ฅผ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ์‹์ด๋‹ค. ๊ธฐ์กด VO ๋…ผ๋ฌธ๋“ค์€ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์ง€๋งŒ ์—ฐ๊ตฌ์— ์‚ฌ์šฉ๋œ ์ด๋ฏธ์ง€ dataset๋“ค์€ ๋ชจ๋‘ ๊ณ ๊ฐ€์˜ ์นด๋ฉ”๋ผ๋ฅผ ์ด์šฉํ•˜์—ฌ ์–ป์–ด์ง„ ๊ณ ํ™”์งˆ์˜ ์„ ๋ช…ํ•œ ์ด๋ฏธ์ง€๋“ค๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. ๋˜ํ•œ ๋…ธ์ถœ ์‹œ๊ฐ„, ๋น„์„ ํ˜• ๋ฐ˜์‘ ํ•จ์ˆ˜, ์นด๋ฉ”๋ผ ํŒŒ๋ผ๋ฏธํ„ฐ ๋“ฑ์˜ ์ด๋ฏธ์ง€ ์™ธ์ ์ธ ์ •๋ณด๋ฅผ ์ด์šฉํ•ด์•ผ๋งŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋™์ž‘์ด ๊ฐ€๋Šฅํ•˜๋‹ค. VO๊ฐ€ ์‹ค์ œ ์‘์šฉ ๋ฌธ์ œ์— ๋” ๋„๋ฆฌ ์ ์šฉ๋˜๊ธฐ ์œ„ํ•ด์„œ๋Š” dataset์ด ๋ถˆ์™„์ „ํ•  ๊ฒฝ์šฐ์—๋„ odometry ์ถ”์ •์ด ์ž˜ ์ด๋ฃจ์–ด์ ธ์•ผ ํ•œ๋‹ค. ์ด์— ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” deep learning์„ ์ด์šฉํ•˜์—ฌ VO์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ๋Š” super-resolution (SR) ๊ธฐ๋ฒ•์œผ๋กœ ์ €ํ•ด์ƒ๋„, ๋…ธ์ด์ฆˆ๊ฐ€ ํฌํ•จ๋œ ์ด๋ฏธ์ง€๋ฅผ ์ด์šฉํ•œ VO์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด์˜ SR ๊ธฐ๋ฒ•์€ ์ˆ˜ํ–‰ ์‹œ๊ฐ„๋ณด๋‹ค๋Š” ์ด๋ฏธ์ง€์˜ ํ•ด์ƒ๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์— ์ฃผ๋กœ ์ง‘์ค‘ํ•˜์˜€๋‹ค. ํ•˜์ง€๋งŒ VO ์ˆ˜ํ–‰์— ์žˆ์–ด์„œ๋Š” ์‹ค์‹œ๊ฐ„์„ฑ์ด ๊ต‰์žฅํžˆ ์ค‘์š”ํ•˜๋‹ค. ๋”ฐ๋ผ์„œ ์ˆ˜ํ–‰ ์‹œ๊ฐ„์„ ๊ณ ๋ คํ•œ SR ๋„คํŠธ์›Œํฌ์˜ ์„ค๊ณ„ํ•˜์—ฌ ์ด๋ฏธ์ง€์˜ ํ•ด์ƒ๋„๋ฅผ ๋†’์ด๊ณ  ๋…ธ์ด์ฆˆ๋ฅผ ์ค„์˜€๋‹ค. ์ด SR ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ต๊ณผ์‹œํ‚จ ๋’ค VO๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉด ๊ธฐ์กด์˜ ์ด๋ฏธ์ง€๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ์˜ VO๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. TUM dataset์„ ์ด์šฉํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ ๊ธฐ์กด์˜ VO ๊ธฐ๋ฒ•๊ณผ ๋‹ค๋ฅธ SR ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์˜€์„ ๋•Œ ๋ณด๋‹ค ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์˜ ์„ฑ๋Šฅ์ด ๋” ๋†’์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ๋Š” ์—ฐ์†๋œ ์ด๋ฏธ์ง€๋งŒ์œผ๋กœ ๊ตฌ์„ฑ๋œ dataset์„ ์ด์šฉํ•˜์—ฌ VO, ๋‹จ์ผ ์ด๋ฏธ์ง€ ๊นŠ์ด ์ถ”์ •, ์นด๋ฉ”๋ผ ๋‚ด๋ถ€ ํŒŒ๋ผ๋ฏธํ„ฐ ์ถ”์ •์„ ์ˆ˜ํ–‰ํ•˜๋Š” fully unsupervised learning-based VO๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด unsupervised learning์„ ์ด์šฉํ•œ VO์—์„œ๋Š” ์ด๋ฏธ์ง€๋“ค๊ณผ ์ด๋ฏธ์ง€๋ฅผ ์ดฌ์˜ํ•œ ์นด๋ฉ”๋ผ์˜ ๋‚ด๋ถ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ VO๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ์ด ๊ธฐ์ˆ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” deep intrinsic ๋„คํŠธ์›Œํฌ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์นด๋ฉ”๋ผ ํŒŒ๋ผ๋ฏธํ„ฐ๊นŒ์ง€ ๋„คํŠธ์›Œํฌ์—์„œ ์ถ”์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. 0์œผ๋กœ ์ˆ˜๋ ดํ•˜๊ฑฐ๋‚˜ ์‰ฝ๊ฒŒ ๋ฐœ์‚ฐํ•˜๋Š” intrinsic ๋„คํŠธ์›Œํฌ์— ์นด๋ฉ”๋ผ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์„ฑ์งˆ์„ ์ด์šฉํ•œ ๋‘ ๊ฐ€์ง€ ๊ฐ€์ •์„ ํ†ตํ•ด ๋‚ด๋ถ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. KITTI dataset์„ ์ด์šฉํ•œ ์‹คํ—˜์„ ํ†ตํ•ด intrinsic parameter ์ •๋ณด๋ฅผ ์ œ๊ณต๋ฐ›์•„ ์ง„ํ–‰๋œ ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.1 INTRODUCTION 1 1.1 Background and Motivation 1 1.2 Literature Review 3 1.3 Contributions 10 1.4 Thesis Structure 11 2 Mathematical Preliminaries of Visual Odometry 13 2.1 Feature-based VO 13 2.2 Direct VO 17 2.3 Learning-based VO 21 2.3.1 Supervised learning-based VO 22 2.3.2 Unsupervised learning-based VO 25 3 Error Improvement in Visual Odometry Using Super-resolution 29 3.1 Introduction 29 3.2 Related Work 31 3.2.1 Visual Odometry 31 3.2.2 Super-resolution 33 3.3 SR-VO 34 3.3.1 VO performance analysis according to changing resolution 34 3.3.2 Super-Resolution Network 37 3.4 Experiments 40 3.4.1 Super-Resolution Procedure 40 3.4.2 VO with SR images 42 3.5 Summary 54 4 Visual Odometry Enhancement Method Using Fully Unsupervised Learning 55 4.1 Introduction 55 4.2 Related Work 57 4.2.1 Traditional Visual Odometry 57 4.2.2 Single-view Depth Recovery 58 4.2.3 Supervised Learning-based Visual Odometry 59 4.2.4 Unsupervised Learning-based Visual Odometry 60 4.2.5 Architecture Overview 62 4.3 Methods 62 4.3.1 Predicting the Target Image using Source Images 62 4.3.2 Intrinsic Parameters Regressor 63 4.4 Experiments 66 4.4.1 Monocular Depth Estimation 66 4.4.2 Visual Odometry 67 4.4.3 Intrinsic Parameters Estimation 77 5 Conclusion and Future Work 82 5.1 Conclusion 82 5.2 Future Work 85 Bibliography 86 Abstract (In Korean) 101Docto

    Survey on video anomaly detection in dynamic scenes with moving cameras

    Full text link
    The increasing popularity of compact and inexpensive cameras, e.g.~dash cameras, body cameras, and cameras equipped on robots, has sparked a growing interest in detecting anomalies within dynamic scenes recorded by moving cameras. However, existing reviews primarily concentrate on Video Anomaly Detection (VAD) methods assuming static cameras. The VAD literature with moving cameras remains fragmented, lacking comprehensive reviews to date. To address this gap, we endeavor to present the first comprehensive survey on Moving Camera Video Anomaly Detection (MC-VAD). We delve into the research papers related to MC-VAD, critically assessing their limitations and highlighting associated challenges. Our exploration encompasses three application domains: security, urban transportation, and marine environments, which in turn cover six specific tasks. We compile an extensive list of 25 publicly-available datasets spanning four distinct environments: underwater, water surface, ground, and aerial. We summarize the types of anomalies these datasets correspond to or contain, and present five main categories of approaches for detecting such anomalies. Lastly, we identify future research directions and discuss novel contributions that could advance the field of MC-VAD. With this survey, we aim to offer a valuable reference for researchers and practitioners striving to develop and advance state-of-the-art MC-VAD methods.Comment: Under revie
    • โ€ฆ
    corecore