586 research outputs found

    Probabilistic RGB-D Odometry based on Points, Lines and Planes Under Depth Uncertainty

    Full text link
    This work proposes a robust visual odometry method for structured environments that combines point features with line and plane segments, extracted through an RGB-D camera. Noisy depth maps are processed by a probabilistic depth fusion framework based on Mixtures of Gaussians to denoise and derive the depth uncertainty, which is then propagated throughout the visual odometry pipeline. Probabilistic 3D plane and line fitting solutions are used to model the uncertainties of the feature parameters and pose is estimated by combining the three types of primitives based on their uncertainties. Performance evaluation on RGB-D sequences collected in this work and two public RGB-D datasets: TUM and ICL-NUIM show the benefit of using the proposed depth fusion framework and combining the three feature-types, particularly in scenes with low-textured surfaces, dynamic objects and missing depth measurements.Comment: Major update: more results, depth filter released as opensource, 34 page

    Video matching using DC-image and local features

    Get PDF
    This paper presents a suggested framework for video matching based on local features extracted from the DCimage of MPEG compressed videos, without decompression. The relevant arguments and supporting evidences are discussed for developing video similarity techniques that works directly on compressed videos, without decompression, and especially utilising small size images. Two experiments are carried to support the above. The first is comparing between the DC-image and I-frame, in terms of matching performance and the corresponding computation complexity. The second experiment compares between using local features and global features in video matching, especially in the compressed domain and with the small size images. The results confirmed that the use of DC-image, despite its highly reduced size, is promising as it produces at least similar (if not better) matching precision, compared to the full I-frame. Also, using SIFT, as a local feature, outperforms precision of most of the standard global features. On the other hand, its computation complexity is relatively higher, but it is still within the realtime margin. There are also various optimisations that can be done to improve this computation complexity

    Direct Monocular Odometry Using Points and Lines

    Full text link
    Most visual odometry algorithm for a monocular camera focuses on points, either by feature matching, or direct alignment of pixel intensity, while ignoring a common but important geometry entity: edges. In this paper, we propose an odometry algorithm that combines points and edges to benefit from the advantages of both direct and feature based methods. It works better in texture-less environments and is also more robust to lighting changes and fast motion by increasing the convergence basin. We maintain a depth map for the keyframe then in the tracking part, the camera pose is recovered by minimizing both the photometric error and geometric error to the matched edge in a probabilistic framework. In the mapping part, edge is used to speed up and increase stereo matching accuracy. On various public datasets, our algorithm achieves better or comparable performance than state-of-the-art monocular odometry methods. In some challenging texture-less environments, our algorithm reduces the state estimation error over 50%.Comment: ICRA 201

    Real-Time Accurate Visual SLAM with Place Recognition

    Get PDF
    El problema de localización y construcción simultánea de mapas (del inglés Simultaneous Localization and Mapping, abreviado SLAM) consiste en localizar un sensor en un mapa que se construye en línea. La tecnología de SLAM hace posible la localización de un robot en un entorno desconocido para él, procesando la información de sus sensores de a bordo y por tanto sin depender de infraestructuras externas. Un mapa permite localizarse en todo momento sin acumular deriva, a diferencia de una odometría donde se integran movimientos incrementales. Este tipo de tecnología es crítica para la navegación de robots de servicio y vehículos autónomos, o para la localización del usuario en aplicaciones de realidad aumentada o virtual. La principal contribución de esta tesis es ORB-SLAM, un sistema de SLAM monocular basado en características que trabaja en tiempo real en ambientes pequeños y grandes, de interior y exterior. El sistema es robusto a elementos dinámicos en la escena, permite cerrar bucles y relocalizar la cámara incluso si el punto de vista ha cambiado significativamente, e incluye un método de inicialización completamente automático. ORB-SLAM es actualmente la solución más completa, precisa y fiable de SLAM monocular empleando una cámara como único sensor. El sistema, estando basado en características y ajuste de haces, ha demostrado una precisión y robustez sin precedentes en secuencias públicas estándar.Adicionalmente se ha extendido ORB-SLAM para reconstruir el entorno de forma semi-densa. Nuestra solución desacopla la reconstrucción semi-densa de la estimación de la trayectoria de la cámara, lo que resulta en un sistema que combina la precisión y robustez del SLAM basado en características con las reconstrucciones más completas de los métodos directos. Además se ha extendido la solución monocular para aprovechar la información de cámaras estéreo, RGB-D y sensores inerciales, obteniendo precisiones superiores a otras soluciones del estado del arte. Con el fin de contribuir a la comunidad científica, hemos hecho libre el código de una implementación de nuestra solución de SLAM para cámaras monoculares, estéreo y RGB-D, siendo la primera solución de código libre capaz de funcionar con estos tres tipos de cámara. Bibliografía:R. Mur-Artal and J. D. Tardós.Fast Relocalisation and Loop Closing in Keyframe-Based SLAM.IEEE International Conference on Robotics and Automation (ICRA). Hong Kong, China, June 2014.R. Mur-Artal and J. D. Tardós.ORB-SLAM: Tracking and Mapping Recognizable Features.RSS Workshop on Multi VIew Geometry in RObotics (MVIGRO). Berkeley, USA, July 2014. R. Mur-Artal and J. D. Tardós.Probabilistic Semi-Dense Mapping from Highly Accurate Feature-Based Monocular SLAM.Robotics: Science and Systems (RSS). Rome, Italy, July 2015.R. Mur-Artal, J. M. M. Montiel and J. D. Tardós.ORB-SLAM: A Versatile and Accurate Monocular SLAM System.IEEE Transactions on Robotics, vol. 31, no. 5, pp. 1147-1163, October 2015.(2015 IEEE Transactions on Robotics Best Paper Award).R. Mur-Artal, and J. D. Tardós.Visual-Inertial Monocular SLAM with Map Reuse.IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 796-803, April 2017. (to be presented at ICRA 17).R.Mur-Artal, and J. D. Tardós. ORB-SLAM2: an Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras.ArXiv preprint arXiv:1610.06475, 2016. (under Review).<br /

    Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos

    Get PDF
    When designing a video affective content analysis algorithm, one of the most important steps is the selection of discriminative features for the effective representation of video segments. The majority of existing affective content analysis methods either use low-level audio-visual features or generate handcrafted higher level representations based on these low-level features. We propose in this work to use deep learning methods, in particular convolutional neural networks (CNNs), in order to automatically learn and extract mid-level representations from raw data. To this end, we exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space. We also incorporate dense trajectory based motion features in order to further enhance the performance of the analysis. By means of multi-class support vector machines (SVMs) and fusion mechanisms, music video clips are classified into one of four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results obtained on a subset of the DEAP dataset show (1) that higher level representations perform better than low-level features, and (2) that incorporating motion information leads to a notable performance gain, independently from the chosen representation

    Extracting textual overlays from social media videos using neural networks

    Full text link
    Textual overlays are often used in social media videos as people who watch them without the sound would otherwise miss essential information conveyed in the audio stream. This is why extraction of those overlays can serve as an important meta-data source, e.g. for content classification or retrieval tasks. In this work, we present a robust method for extracting textual overlays from videos that builds up on multiple neural network architectures. The proposed solution relies on several processing steps: keyframe extraction, text detection and text recognition. The main component of our system, i.e. the text recognition module, is inspired by a convolutional recurrent neural network architecture and we improve its performance using synthetically generated dataset of over 600,000 images with text prepared by authors specifically for this task. We also develop a filtering method that reduces the amount of overlapping text phrases using Levenshtein distance and further boosts system's performance. The final accuracy of our solution reaches over 80A% and is au pair with state-of-the-art methods.Comment: International Conference on Computer Vision and Graphics (ICCVG) 201

    Edited nearest neighbour for selecting keyframe summaries of egocentric videos

    Get PDF
    A keyframe summary of a video must be concise, comprehensive and diverse. Current video summarisation methods may not be able to enforce diversity of the summary if the events have highly similar visual content, as is the case of egocentric videos. We cast the problem of selecting a keyframe summary as a problem of prototype (instance) selection for the nearest neighbour classifier (1-nn). Assuming that the video is already segmented into events of interest (classes), and represented as a dataset in some feature space, we propose a Greedy Tabu Selector algorithm (GTS) which picks one frame to represent each class. An experiment with the UT (Egocentric) video database and seven feature representations illustrates the proposed keyframe summarisation method. GTS leads to improved match to the user ground truth compared to the closest-to-centroid baseline summarisation method. Best results were obtained with feature spaces obtained from a convolutional neural network (CNN).Leverhulme Trust, UKSao Paulo Research Foundation - FAPESPBangor Univ, Sch Comp Sci, Dean St, Bangor LL57 1UT, Gwynedd, WalesFed Univ Sao Paulo UNIFESP, Inst Sci & Technol, BR-12247014 Sao Jose Dos Campos, SP, BrazilFed Univ Sao Paulo UNIFESP, Inst Sci & Technol, BR-12247014 Sao Jose Dos Campos, SP, BrazilLeverhulme: RPG-2015-188FAPESP: 2016/06441-7Web of Scienc
    corecore