7 research outputs found

    RPNet: an End-to-End Network for Relative Camera Pose Estimation

    Full text link
    This paper addresses the task of relative camera pose estimation from raw image pixels, by means of deep neural networks. The proposed RPNet network takes pairs of images as input and directly infers the relative poses, without the need of camera intrinsic/extrinsic. While state-of-the-art systems based on SIFT + RANSAC, are able to recover the translation vector only up to scale, RPNet is trained to produce the full translation vector, in an end-to-end way. Experimental results on the Cambridge Landmark dataset show very promising results regarding the recovery of the full translation vector. They also show that RPNet produces more accurate and more stable results than traditional approaches, especially for hard images (repetitive textures, textureless images, etc). To the best of our knowledge, RPNet is the first attempt to recover full translation vectors in relative pose estimation

    DriveTrack: A Benchmark for Long-Range Point Tracking in Real-World Videos

    Full text link
    This paper presents DriveTrack, a new benchmark and data generation framework for long-range keypoint tracking in real-world videos. DriveTrack is motivated by the observation that the accuracy of state-of-the-art trackers depends strongly on visual attributes around the selected keypoints, such as texture and lighting. The problem is that these artifacts are especially pronounced in real-world videos, but these trackers are unable to train on such scenes due to a dearth of annotations. DriveTrack bridges this gap by building a framework to automatically annotate point tracks on autonomous driving datasets. We release a dataset consisting of 1 billion point tracks across 24 hours of video, which is seven orders of magnitude greater than prior real-world benchmarks and on par with the scale of synthetic benchmarks. DriveTrack unlocks new use cases for point tracking in real-world videos. First, we show that fine-tuning keypoint trackers on DriveTrack improves accuracy on real-world scenes by up to 7%. Second, we analyze the sensitivity of trackers to visual artifacts in real scenes and motivate the idea of running assistive keypoint selectors alongside trackers.Comment: 16 pages, 13 figures, 5 table

    Precise pose estimation of the NASA Mars 2020 Perseverance rover through a stereo-vision-based approach

    Get PDF
    Visual Odometry (VO) is a fundamental technique to enhance the navigation capabilities of planetary exploration rovers. By processing the images acquired during the motion, VO methods provide estimates of the relative position and attitude between navigation steps with the detection and tracking of two-dimensional (2D) image keypoints. This method allows one to mitigate trajectory inconsistencies associated with slippage conditions resulting from dead-reckoning techniques. We present here an independent analysis of the high-resolution stereo images of the NASA Mars 2020 Perseverance rover to retrieve its accurate localization on sols 65, 66, 72, and 120. The stereo pairs are processed by using a 3D-to-3D stereo-VO approach that is based on consolidated techniques and accounts for the main nonlinear optical effects characterizing real cameras. The algorithm is first validated through the analysis of rectified stereo images acquired by the NASA Mars Exploration Rover Opportunity, and then applied to the determination of Perseverance's path. The results suggest that our reconstructed path is consistent with the telemetered trajectory, which was directly retrieved onboard the rover's system. The estimated pose is in full agreement with the archived rover's position and attitude after short navigation steps. Significant differences (~10–30 cm) between our reconstructed and telemetered trajectories are observed when Perseverance traveled distances larger than 1 m between the acquisition of stereo pairs

    Scene understanding through semantic image segmentation in augmented reality

    Get PDF
    Abstract. Semantic image segmentation, the task of assigning a label to each pixel in an image, is a major challenge in the field of computer vision. Semantic image segmentation using fully convolutional neural networks (FCNNs) offers an online solution to the scene understanding while having a simple training procedure and fast inference speed if designed efficiently. The semantic information provided by the semantic segmentation is a detailed understanding of the current context and this scene understanding is vital for scene modification in augmented reality (AR), especially if one aims to perform destructive scene augmentation. Augmented reality systems, by nature, aim to have a real-time modification of the context through head-mounted see-through or video-see-through displays, thus require efficiency in each step. Although there are many solutions to the semantic image segmentation in the literature such as DeeplabV3+, Deeplab DPC, they fail to offer a low latency inference due to their complex architectures in aim to acquire the best accuracy. As a part of this thesis work, we provide an efficient architecture for semantic image segmentation using an FCNN model and achieve real-time performance on smartphones at 19.65 frames per second (fps) while maintaining a high mean intersection over union (mIOU) of 67.7% on Cityscapes validation set with our "Basic" variant and 15.41 fps and 70.3% mIOU on Cityscapes test set using our "DPC" variant. The implementation is open-sourced and compatible with Tensorflow Lite, thus able to run on embedded and mobile devices. Furthermore, the thesis work demonstrates an augmented reality implementation where semantic segmentation masks are tracked online in a 3D environment using Google ARCore. We show that the frequent calculation of semantic information is not necessary and by tracking the calculated semantic information in 3D space using inertial-visual odometry that is provided by the ARCore framework, we can achieve savings on battery and CPU usage while maintaining a high mIOU. We further demonstrate a possible use case of the system by inpainting the objects in 3D space that are found by the semantic image segmentation network. The implemented Android application performs real-time augmented reality at 30 fps while running the computationally efficient network that was proposed as a part of this thesis work in parallel

    A deep learning approach for automatically generating descriptions of images containing people

    Get PDF
    Generating image descriptions is a challenging Artificial Intelligence problem with many interesting applications such as robots’ communication or helping visually impaired people. However, it is a complex task for computers: it requires Computer Vision algorithms, to understand what the image depicts, and Natural Language Processing algorithms, to generate a well-formed sentence. Nowadays, deep neural networks are the state-of-the-art in these two Artificial Intelligence fields. Furthermore, we believe that images that contain people are described in a slightly different manner and that restricting an image description generator model to these images may produce better descriptions. Therefore, the main objective of this project is to develop a Deep Learning model that automatically produces descriptions of images containing people and to conclude if it is a good practice the restriction to this kind of images. For this purpose, we have reviewed and studied the literature in the field and we have built, trained and compared four different models using Deep Learning techniques and a GPU to speed-up the computation, as well as a big and complete dataset

    A domain-extensible compiler with controllable automation of optimisations

    Get PDF
    In high performance domains like image processing, physics simulation or machine learning, program performance is critical. Programmers called performance engineers are responsible for the challenging task of optimising programs. Two major challenges prevent modern compilers targeting heterogeneous architectures from reliably automating optimisation. First, domain specific compilers such as Halide for image processing and TVM for machine learning are difficult to extend with the new optimisations required by new algorithms and hardware. Second, automatic optimisation is often unable to achieve the required performance, and performance engineers often fall back to painstaking manual optimisation. This thesis shows the potential of the Shine compiler to achieve domain-extensibility, controllable automation, and generate high performance code. Domain-extensibility facilitates adapting compilers to new algorithms and hardware. Controllable automation enables performance engineers to gradually take control of the optimisation process. The first research contribution is to add 3 code generation features to Shine, namely: synchronisation barrier insertion, kernel execution, and storage folding. Adding these features requires making novel design choices in terms of compiler extensibility and controllability. The rest of this thesis builds on these features to generate code with competitive runtime compared to established domain-specific compilers. The second research contribution is to demonstrate how extensibility and controllability are exploited to optimise a standard image processing pipeline for corner detection. Shine achieves 6 well-known image processing optimisations, 2 of them not being supported by Halide. Our results on 4 ARM multi-core CPUs show that the code generated by Shine for corner detection runs up to 1.4Ă— faster than the Halide code. However, we observe that controlling rewriting is tedious, motivating the need for more automation. The final research contribution is to introduce sketch-guided equality saturation, a semiautomated technique that allows performance engineers to guide program rewriting by specifying rewrite goals as sketches: program patterns that leave details unspecified. We evaluate this approach by applying 7 realistic optimisations of matrix multiplication. Without guidance, the compiler fails to apply the 5 most complex optimisations even given an hour and 60GB of RAM. With the guidance of at most 3 sketch guides, each 10 times smaller than the complete program, the compiler applies the optimisations in seconds using less than 1GB
    corecore