11 research outputs found

    End-to-End Learning of Representations for Asynchronous Event-Based Data

    Full text link
    Event cameras are vision sensors that record asynchronous streams of per-pixel brightness changes, referred to as "events". They have appealing advantages over frame-based cameras for computer vision, including high temporal resolution, high dynamic range, and no motion blur. Due to the sparse, non-uniform spatiotemporal layout of the event signal, pattern recognition algorithms typically aggregate events into a grid-based representation and subsequently process it by a standard vision pipeline, e.g., Convolutional Neural Network (CNN). In this work, we introduce a general framework to convert event streams into grid-based representations through a sequence of differentiable operations. Our framework comes with two main advantages: (i) allows learning the input event representation together with the task dedicated network in an end to end manner, and (ii) lays out a taxonomy that unifies the majority of extant event representations in the literature and identifies novel ones. Empirically, we show that our approach to learning the event representation end-to-end yields an improvement of approximately 12% on optical flow estimation and object recognition over state-of-the-art methods.Comment: To appear at ICCV 201

    Unsupervised Event-based Learning of Optical Flow, Depth, and Egomotion

    Get PDF
    In this work, we propose a novel framework for unsupervised learning for event cameras that learns motion information from only the event stream. In particular, we propose an input representation of the events in the form of a discretized volume that maintains the temporal distribution of the events, which we pass through a neural network to predict the motion of the events. This motion is used to attempt to remove any motion blur in the event image. We then propose a loss function applied to the motion compensated event image that measures the motion blur in this image. We train two networks with this framework, one to predict optical flow, and one to predict egomotion and depths, and evaluate these networks on the Multi Vehicle Stereo Event Camera dataset, along with qualitative results from a variety of different scenes.Comment: 9 pages, 7 figure

    DSEC: A Stereo Event Camera Dataset for Driving Scenarios

    Full text link
    Once an academic venture, autonomous driving has received unparalleled corporate funding in the last decade. Still, the operating conditions of current autonomous cars are mostly restricted to ideal scenarios. This means that driving in challenging illumination conditions such as night, sunrise, and sunset remains an open problem. In these cases, standard cameras are being pushed to their limits in terms of low light and high dynamic range performance. To address these challenges, we propose, DSEC, a new dataset that contains such demanding illumination conditions and provides a rich set of sensory data. DSEC offers data from a wide-baseline stereo setup of two color frame cameras and two high-resolution monochrome event cameras. In addition, we collect lidar data and RTK GPS measurements, both hardware synchronized with all camera data. One of the distinctive features of this dataset is the inclusion of high-resolution event cameras. Event cameras have received increasing attention for their high temporal resolution and high dynamic range performance. However, due to their novelty, event camera datasets in driving scenarios are rare. This work presents the first high-resolution, large-scale stereo dataset with event cameras. The dataset contains 53 sequences collected by driving in a variety of illumination conditions and provides ground truth disparity for the development and evaluation of event-based stereo algorithms.Comment: IEEE Robotics and Automation Letter

    이벤트-프레임 카메라를 위한 모션 및 깊이 추정

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 공과대학 기계항공공학부, 2023. 2. 김현진.Event cameras can stably measure visual information in high-dynamic-range and high-speed environments that are challenging for conventional cameras. However, conventional vision algorithms could not be directly employed to the event data, because of the frameless and asynchronous characteristics of event data. For several years, various applications for event cameras have been studied such as motion and depth estimation, image reconstruction with high-temporal resolution and object segmentation. Here, I propose the rotational motion estimation method with contrast maximization under high-speed motion environments. The proposed rotational motion estimation method runs in real-time and can handle the drift error accumulation, which the existing contrast maximization methods have not dealt with. However, it is still difficult for event cameras to replace frame cameras in non-challenging normal scenarios. In order to leverage the advantages of event and frame cameras, I conduct a study for the heterogeneous stereo camera system which employs both an event and a frame camera. The proposed system estimates the semi-dense disparity in real-time by matching heterogeneous data of an event and a frame camera in stereo. I propose an accurate, intuitive and efficient way to align events with 6-DOF camera motion, by suggesting the maximum shift distance method. The aligned event image shows high similarity to the edge image of the frame camera. The proposed depth estimation method runs in real-time and can estimate poses of an event camera and depth of events in a few frames, which can speed up the initialization of the event camera system. Additionally, I propose a feature tracking and a pose estimation methods that can operate in a hetero-stereo camera when the frame camera fails. The codes are released to the public on my project page, and I expect to contribute to the event camera community: https://haram-kim.github.io이벤트 카메라는 기존 카메라가 동작하기 어려운 환경에서 시각 데이터를 안정적으로 얻을 수 있다. 대표적으로 빛 밝기 범위가 넓거나 (High Dynamic Range: HDR) 빠르게 움직이는 환경에서 이벤트 카메라의 장점이 두드러진다. 그러나 이벤트 데이터는 기존의 컴퓨터 비전 알고리즘을 바로 적용할 수가 없다. 이벤트는 프레임 단위가 없으며 비동기적이기 때문에 새로운 접근 방법이 요구된다. 최근 몇 년 간, 동작 깊이 추정, 초고속 이미지 복원, 물체 추정 연구 등 다양한 활용을 보여주는 이벤트 연구가 활발하게 진행되었다. 본 논문 에서는 이벤트 카메라를 활용하여 고속 환경에서 운용 가능한 각운동 추정 연구를 다루었 다. 제안하는 방법은 대비 최대화 기법을 통해 각속도, 각위치를 추정하였고 실시간으로 동작하며 기존 대비 최대화 기법에서 다루지 않았던 드리프트 에러 누적 문제를 해결하여 뛰어난 성능을 보여주었다. 그러나 여전히 일반적인 사용환경에서는 이벤트 카메라가 기존 카메라를 대체하기에 어려움이 있다. 이벤트와 프레임 카메라의 장점을 모두 활용하기 위해, 본 논문에서는 헤 테로 스테레오 카메라 시스템을 제안하였다. 헤테로 스테레오 카메라 시스템은 이벤트와 프레임 카메라를 동시에 활용한다. 제안하는 방법은 두 카메라를 활용하여 실시간으로 이벤트와 프레임 데이터를 매칭하여 준-조밀한(semi-dense) 깊이 영상을 계산하였다. 이 과정에서 이벤트 데이터를 정확하고, 효율적이며, 직관적으로 정렬하는 방법을 제안하였 다. 최대 픽셀 이동 거리(maximum shift distance)를 제안하여 실시간 이벤트 정렬을 가능 하게 하였으며, 정렬된 이벤트로 획득한 이미지는 프레임 카메라의 모서리 이미지와 매우 유사한 형태를 띄는 것을 보여주었다. 제안하는 깊이 추정 방법은 카메라 위치 및 자세를 추정할 수 있으며 매우 짧은 시간 안에 시스템 초기화 구동(initialization)이 가능하다. 추가적으로, 헤테로 스테레오 카메라에서 프레임 카메라 동작이 불가능한 경우 이벤트 카메라가 대체하여 동작할 수 있도록, 이벤트 카메라 기반 특징점 추적 방법과 자세 추정 연구를 진행하였다. 이벤트 카메라 연구에 기여하기 위해, 본 학위 논문의 코드를 모두 오픈 소스로 공개하여 개인 프로젝트 페이지에 배포하였다. https://haram-kim.github.ioChapter 1 Introduction 1 1.1 Literature Survey 3 1.2 Motivation 6 1.3 Contribution and Outline 9 2 Background 11 2.1 Rigid Body Motion 11 2.2 Rectification 14 2.3 Non-linear Optimization 16 3 Real-time Rotational Motion Estimation with Contrast Maximization over Globally Aligned Events 18 3.1 Method 20 3.2 Experimental Results 29 3.3 Summary 45 4 Real-time Hetero-Stereo Matching for Event and Frame Camera with Aligned Events Using Maximum Shift Distance 47 4.1 Hetero Stereo Matching 48 4.2 Experimental Results 59 4.3 Summary 70 5 Feature Tracking and Pose Estimation for Hetero-Stereo Camera 74 5.1 Feature Tracking 74 5.2 Pose Estimation 90 5.3 Future Work 98 6 Conclusion 99 Appendix A Detailed Derivation of Contrast for Rotational Motion Estimation 101 References 105 Abstract (in Korean) 113박

    Event-Based Algorithms For Geometric Computer Vision

    Get PDF
    Event cameras are novel bio-inspired sensors which mimic the function of the human retina. Rather than directly capturing intensities to form synchronous images as in traditional cameras, event cameras asynchronously detect changes in log image intensity. When such a change is detected at a given pixel, the change is immediately sent to the host computer, where each event consists of the x,y pixel position of the change, a timestamp, accurate to tens of microseconds, and a polarity, indicating whether the pixel got brighter or darker. These cameras provide a number of useful benefits over traditional cameras, including the ability to track extremely fast motions, high dynamic range, and low power consumption. However, with a new sensing modality comes the need to develop novel algorithms. As these cameras do not capture photometric intensities, novel loss functions must be developed to replace the photoconsistency assumption which serves as the backbone of many classical computer vision algorithms. In addition, the relative novelty of these sensors means that there does not exist the wealth of data available for traditional images with which we can train learning based methods such as deep neural networks. In this work, we address both of these issues with two foundational principles. First, we show that the motion blur induced when the events are projected into the 2D image plane can be used as a suitable substitute for the classical photometric loss function. Second, we develop self-supervised learning methods which allow us to train convolutional neural networks to estimate motion without any labeled training data. We apply these principles to solve classical perception problems such as feature tracking, visual inertial odometry, optical flow and stereo depth estimation, as well as recognition tasks such as object detection and human pose estimation. We show that these solutions are able to utilize the benefits of event cameras, allowing us to operate in fast moving scenes with challenging lighting which would be incredibly difficult for traditional cameras

    Semantic Segmentation for Real-World Applications

    Get PDF
    En visión por computador, la comprensión de escenas tiene como objetivo extraer información útil de una escena a partir de datos de sensores. Por ejemplo, puede clasificar toda la imagen en una categoría particular o identificar elementos importantes dentro de ella. En este contexto general, la segmentación semántica proporciona una etiqueta semántica a cada elemento de los datos sin procesar, por ejemplo, a todos los píxeles de la imagen o, a todos los puntos de la nube de puntos. Esta información es esencial para muchas aplicaciones de visión por computador, como conducción, aplicaciones médicas o robóticas. Proporciona a los ordenadores una comprensión sobre el entorno que es necesaria para tomar decisiones autónomas.El estado del arte actual de la segmentación semántica está liderado por métodos de aprendizaje profundo supervisados. Sin embargo, las condiciones del mundo real presentan varias restricciones para la aplicación de estos modelos de segmentación semántica. Esta tesis aborda varios de estos desafíos: 1) la cantidad limitada de datos etiquetados disponibles para entrenar modelos de aprendizaje profundo, 2) las restricciones de tiempo y computación presentes en aplicaciones en tiempo real y/o en sistemas con poder computacional limitado, y 3) la capacidad de realizar una segmentación semántica cuando se trata de sensores distintos de la cámara RGB estándar.Las aportaciones principales en esta tesis son las siguientes:1. Un método nuevo para abordar el problema de los datos anotados limitados para entrenar modelos de segmentación semántica a partir de anotaciones dispersas. Los modelos de aprendizaje profundo totalmente supervisados lideran el estado del arte, pero mostramos cómo entrenarlos usando solo unos pocos píxeles etiquetados. Nuestro enfoque obtiene un rendimiento similar al de los modelos entrenados con imágenescompletamente etiquetadas. Demostramos la relevancia de esta técnica en escenarios de monitorización ambiental y en dominios más generales.2. También tratando con datos de entrenamiento limitados, proponemos un método nuevo para segmentación semántica semi-supervisada, es decir, cuando solo hay una pequeña cantidad de imágenes completamente etiquetadas y un gran conjunto de datos sin etiquetar. La principal novedad de nuestro método se basa en el aprendizaje por contraste. Demostramos cómo el aprendizaje por contraste se puede aplicar a la tarea de segmentación semántica y mostramos sus ventajas, especialmente cuando la disponibilidad de datos etiquetados es limitada logrando un nuevo estado del arte.3. Nuevos modelos de segmentación semántica de imágenes eficientes. Desarrollamos modelos de segmentación semántica que son eficientes tanto en tiempo de ejecución, requisitos de memoria y requisitos de cálculo. Algunos de nuestros modelos pueden ejecutarse en CPU a altas velocidades con alta precisión. Esto es muy importante para configuraciones y aplicaciones reales, ya que las GPU de gama alta nosiempre están disponibles.4. Nuevos métodos de segmentación semántica con sensores no RGB. Proponemos un método para la segmentación de nubes de puntos LiDAR que combina operaciones de aprendizaje eficientes tanto en 2D como en 3D. Logra un rendimiento de segmentación excepcional a velocidades realmente rápidas. También mostramos cómo mejorar la robustez de estos modelos al abordar el problema de sobreajuste y adaptaciónde dominio. Además, mostramos el primer trabajo de segmentación semántica con cámaras de eventos, haciendo frente a la falta de datos etiquetados.Estas contribuciones aportan avances significativos en el campo de la segmentación semántica para aplicaciones del mundo real. Para una mayor contribución a la comunidad cientfíica, hemos liberado la implementación de todas las soluciones propuestas.----------------------------------------In computer vision, scene understanding aims at extracting useful information of a scene from raw sensor data. For instance, it can classify the whole image into a particular category (i.e. kitchen or living room) or identify important elements within it (i.e., bottles, cups on a table or surfaces). In this general context, semantic segmentation provides a semantic label to every single element of the raw data, e.g., to all image pixels or to all point cloud points.This information is essential for many applications relying on computer vision, such as AR, driving, medical or robotic applications. It provides computers with understanding about the environment needed to make autonomous decisions, or detailed information to people interacting with the intelligent systems. The current state of the art for semantic segmentation is led by supervised deep learning methods.However, real-world scenarios and conditions introduce several challenges and restrictions for the application of these semantic segmentation models. This thesis tackles several of these challenges, namely, 1) the limited amount of labeled data available for training deep learning models, 2) the time and computation restrictions present in real time applications and/or in systems with limited computational power, such as a mobile phone or an IoT node, and 3) the ability to perform semantic segmentation when dealing with sensors other than the standard RGB camera.The general contributions presented in this thesis are following:A novel approach to address the problem of limited annotated data to train semantic segmentation models from sparse annotations. Fully supervised deep learning models are leading the state-of-the-art, but we show how to train them by only using a few sparsely labeled pixels in the training images. Our approach obtains similar performance than models trained with fully-labeled images. We demonstrate the relevance of this technique in environmental monitoring scenarios, where it is very common to have sparse image labels provided by human experts, as well as in more general domains. Also dealing with limited training data, we propose a novel method for semi-supervised semantic segmentation, i.e., when there is only a small number of fully labeled images and a large set of unlabeled data. We demonstrate how contrastive learning can be applied to the semantic segmentation task and show its advantages, especially when the availability of labeled data is limited. Our approach improves state-of-the-art results, showing the potential of contrastive learning in this task. Learning from unlabeled data opens great opportunities for real-world scenarios since it is an economical solution. Novel efficient image semantic segmentation models. We develop semantic segmentation models that are efficient both in execution time, memory requirements, and computation requirements. Some of our models able to run in CPU at high speed rates with high accuracy. This is very important for real set-ups and applications since high-end GPUs are not always available. Building models that consume fewer resources, memory and time, would increase the range of applications that can benefit from them. Novel methods for semantic segmentation with non-RGB sensors.We propose a novel method for LiDAR point cloud segmentation that combines efficient learning operations both in 2D and 3D. It surpasses state-of-the-art segmentation performance at really fast rates. We also show how to improve the robustness of these models tackling the overfitting and domain adaptation problem. Besides, we show the first work for semantic segmentation with event-based cameras, coping with the lack of labeled data. To increase the impact of this contributions and ease their application in real-world settings, we have made available an open-source implementation of all proposed solutions to the scientific community.<br /
    corecore