8 research outputs found

    PVO: Panoptic Visual Odometry

    Full text link
    We present PVO, a novel panoptic visual odometry framework to achieve more comprehensive modeling of the scene motion, geometry, and panoptic segmentation information. Our PVO models visual odometry (VO) and video panoptic segmentation (VPS) in a unified view, which makes the two tasks mutually beneficial. Specifically, we introduce a panoptic update module into the VO Module with the guidance of image panoptic segmentation. This Panoptic-Enhanced VO Module can alleviate the impact of dynamic objects in the camera pose estimation with a panoptic-aware dynamic mask. On the other hand, the VO-Enhanced VPS Module also improves the segmentation accuracy by fusing the panoptic segmentation result of the current frame on the fly to the adjacent frames, using geometric information such as camera pose, depth, and optical flow obtained from the VO Module. These two modules contribute to each other through recurrent iterative optimization. Extensive experiments demonstrate that PVO outperforms state-of-the-art methods in both visual odometry and video panoptic segmentation tasks.Comment: CVPR2023 Project page: https://zju3dv.github.io/pvo/ code: https://github.com/zju3dv/PV

    BIO-INSPIRED MOTION PERCEPTION: FROM GANGLION CELLS TO AUTONOMOUS VEHICLES

    Get PDF
    Animals are remarkable at navigation, even in extreme situations. Through motion perception, animals compute their own movements (egomotion) and find other objects (prey, predator, obstacles) and their motions in the environment. Analogous to animals, artificial systems such as robots also need to know where they are relative to structure and segment obstacles to avoid collisions. Even though substantial progress has been made in the development of artificial visual systems, they still struggle to achieve robust and generalizable solutions. To this end, I propose a bio-inspired framework that narrows the gap between natural and artificial systems. The standard approaches in robot motion perception seek to reconstruct a three-dimensional model of the scene and then use this model to estimate egomotion and object segmentation. However, the scene reconstruction process is data-heavy and computationally expensive and fails to deal with high-speed and dynamic scenarios. On the contrary, biological visual systems excel in the aforementioned difficult situation by extracting only minimal information sufficient for motion perception tasks. I derive minimalist/purposive ideas from biological processes throughout this thesis and develop mathematical solutions for robot motion perception problems. In this thesis, I develop a full range of solutions that utilize bio-inspired motion representation and learning approaches for motion perception tasks. Particularly, I focus on egomotion estimation and motion segmentation tasks. I have four main contributions: 1. First, I introduce NFlowNet, a neural network to estimate normal flow (bio-inspired motion filters). Normal flow estimation presents a new avenue for solving egomotion in a robust and qualitative framework. 2. Utilizing normal flow, I propose the DiffPoseNet framework to estimate egomotion by formulating the qualitative constraint in a differentiable optimization layer, which allows for end-to-end learning. 3. Further, utilizing a neuromorphic event camera, a retina-inspired vision sensor, I develop 0-MMS, a model-based optimization approach that employs event spikes to segment the scene into multiple moving parts in high-speed dynamic lighting scenarios. 4. To improve the precision of event-based motion perception across time, I develop SpikeMS, a novel bio-inspired learning approach that fully capitalizes on the rich temporal information in event spikes

    Real-Time Structure and Object Aware Semantic SLAM

    Get PDF
    Simultaneous Localization And Mapping (SLAM) is one of the fundamental problems in mobile robotics and addresses the reconstruction of a previously unseen environment while simultaneously localising a mobile robot with respect to it. For visual-SLAM, the simplest representation of the map is a collection of 3D points that is sparse and efficient to compute and update, particularly for large-scale environments, however it lacks semantic information and is not useful for high-level tasks such as robotic grasping and manipulation. Although methods to compute denser representations have been proposed, these reconstructions remain equivalent to a collection of points and therefore carry no additional semantic information or relationship. Man-made environments contain many structures and objects that carry high-level semantics and can potentially act as landmarks of a SLAM map, while encapsulating semantic information as opposed to a set of points. For instance, planes are good representations for feature deprived regions, where they provide information complimentary to points and can also model dominant planar layouts of the environment with very few parameters. Furthermore, a generic representation for previously unseen objects can be used as a general landmark that carries semantics in the reconstructed map. Integrating visual semantic understanding and geometric reconstruction has been studied before, however due to various reasons, including high- level geometric entities in the SLAM framework has been restricted to a slow, offline structure-from-motion context, or high-level entities merely act as regulators for points in the map instead of independent landmarks. One of those critical reasons is the lack of proper mathematical representation for high-level landmarks and the other main reasons are the challenge of detection and tracking of these landmarks and formulating an observation model – a mapping between corresponding image observable quantities and estimated parameters of the representations. In this work, we address these challenges to achieve an online real-time SLAM framework with scalable maps consisting of both sparse points and high-level structural and semantic landmarks such as planes and objects. We explicitly target real-time performance and keep that as a beacon which influences critically the representation choice and all the modules of our SLAM system. In the context of factor graphs, we propose novel representations for structural entities as planes and general unseen and not-predefined objects as bounded dual quadrics that decompose to permit clean, fast and effective real-time implementation that is amenable to the nonlinear leastsquare formulation and respects the sparsity pattern of the SLAM problem. In this representation we are not concerned with high-fidelity reconstruction of individual objects, but rather to represent the general layout and orientation of objects in the environment. Also the minimal representations of planes is explored leading to a representation that can be constructed and updated online in a least-squares framework. Another challenge that we address in this work is to marry high-level landmark detections based on deep-learned frameworks, with geometric SLAM systems. Due to the recent success of CNN-based object detections and also depth and surface normal estimations from single image, it is feasible now to detect and estimate these semantic landmarks from single RGB images, therefore leading us seamlessly from RGB-D SLAM system to pure monocular SLAM thanks to the real-time predictions of the trained CNN and appropriate representations. Furthermore, to benefit from deep-learned priors, we incorporate high-fidelity single-image reconstructions and hallucinations of objects on top of the coarse quadrics to enrich the sparse map semantically, while constraining the shape of the coarse quadrics even more. Pertinent to our beacon, proposed landmark representations in the map also provide the potential for imposing additional constraints and priors that carry crucial semantic information about the scene, without incurring great extra computational cost. In this work, we have explored and proposed constraints such as priors on the extent and shape of the objects, point-plane regularizer, plane-plane (Manhattan assumption), and plane-object (supporting affordance) constraints. We evaluate our proposed SLAM system extensively using different input sensor modalities from RGB-D to monocular in almost all publicly available benchmarks both indoors and outdoors to show its applicability as a general-purpose SLAM solution. The extensive experiments show the efficacy of our SLAM through different comparisons and ablation studies including high-level structures and objects with imposed constraints among them in various scenarios. In particular, the estimated camera trajectories have been improved significantly in varied sequences of visual SLAM datasets and also our own captured sequences with UR5 robotic arm equipped with a depth camera. In addition to more accurate camera trajectories, our system yields enriched sparse maps with semantically meaningful planar structures and generic objects in the scene along with their mutual relationshipsThesis (Ph.D.) -- University of Adelaide, School of Computer Science, 201

    Geometric Data Analysis: Advancements of the Statistical Methodology and Applications

    Get PDF
    Data analysis has become fundamental to our society and comes in multiple facets and approaches. Nevertheless, in research and applications, the focus was primarily on data from Euclidean vector spaces. Consequently, the majority of methods that are applied today are not suited for more general data types. Driven by needs from fields like image processing, (medical) shape analysis, and network analysis, more and more attention has recently been given to data from non-Euclidean spaces–particularly (curved) manifolds. It has led to the field of geometric data analysis whose methods explicitly take the structure (for example, the topology and geometry) of the underlying space into account. This thesis contributes to the methodology of geometric data analysis by generalizing several fundamental notions from multivariate statistics to manifolds. We thereby focus on two different viewpoints. First, we use Riemannian structures to derive a novel regression scheme for general manifolds that relies on splines of generalized Bézier curves. It can accurately model non-geodesic relationships, for example, time-dependent trends with saturation effects or cyclic trends. Since Bézier curves can be evaluated with the constructive de Casteljau algorithm, working with data from manifolds of high dimensions (for example, a hundred thousand or more) is feasible. Relying on the regression, we further develop a hierarchical statistical model for an adequate analysis of longitudinal data in manifolds, and a method to control for confounding variables. We secondly focus on data that is not only manifold- but even Lie group-valued, which is frequently the case in applications. We can only achieve this by endowing the group with an affine connection structure that is generally not Riemannian. Utilizing it, we derive generalizations of several well-known dissimilarity measures between data distributions that can be used for various tasks, including hypothesis testing. Invariance under data translations is proven, and a connection to continuous distributions is given for one measure. A further central contribution of this thesis is that it shows use cases for all notions in real-world applications, particularly in problems from shape analysis in medical imaging and archaeology. We can replicate or further quantify several known findings for shape changes of the femur and the right hippocampus under osteoarthritis and Alzheimer's, respectively. Furthermore, in an archaeological application, we obtain new insights into the construction principles of ancient sundials. Last but not least, we use the geometric structure underlying human brain connectomes to predict cognitive scores. Utilizing a sample selection procedure, we obtain state-of-the-art results

    Motion estimation from RGBD images using graph homomorphism

    No full text
    Recentemente surgiram dispositivos sensores de profundidade capazes de capturar textura e geometria de uma cena em tempo real. Com isso, diversas técnicas de Visão Computacional, que antes eram aplicadas apenas a texturas, agora são passíveis de uma reformulação, visando o uso também da geometria. Ao mesmo tempo em que tais algoritmos, tirando vantagem dessa nova tecnologia, podem ser acelerados ou tornarem-se mais robustos, surgem igualmente diversos novos desafios e problemas interessantes a serem enfrentados. Como exemplo desses dispositivos podemos citar o do Projeto Vídeo 4D, do IMPA, e o Kinect (TM), da Microsoft. Esses equipamentos fornecem imagens que vêm sendo chamadas de RGBD, fazendo referência aos três canais de cores e ao canal adicional de profundidade (com a letra \'D\' vindo do termo depth, profundidade em inglês). A pesquisa descrita nesta tese apresenta uma nova abordagem não-supervisionada para a estimação de movimento a partir de vídeos compostos por imagens RGBD. Esse é um passo intermediário necessário para a identificação de componentes rígidos de um objeto articulado. Nosso método faz uso da técnica de casamento inexato (homomorfismo) entre grafos para encontrar grupos de pixels (blocos) que se movem para um mesmo sentido em quadros consecutivos de um vídeo. Com o intuito de escolher o melhor casamento para cada bloco, é minimizada uma função custo que leva em conta distâncias tanto no espaço de cores RGB quanto no XYZ (espaço tridimensional do mundo). A contribuição metodológica consiste justamente na manipulação dos dados de profundidade fornecidos pelos novos dispositivos de captura, de modo que tais dados passem a integrar o vetor de características que representa cada bloco nos grafos a serem casados. Nosso método não usa quadros de referência para inicialização e é aplicável a qualquer vídeo que contenha movimento paramétrico por partes. Para blocos cujas dimensões causem uma relativa diminuição na resolução das imagens, nossa aplicação roda em tempo real. Para validar a metodologia proposta, são apresentados resultados envolvendo diversas classes de objetos com diferentes tipos de movimento, tais como vídeos de pessoas caminhando, os movimento de um braço e um casal de dançarinos de samba de gafieira. Também são apresentados os avanços obtidos na modelagem de um sistema de vídeo 4D orientado a objetos, o qual norteia o desenvolvimento de diversas aplicações a serem desenvolvidas na continuação deste trabalho.Depth-sensing devices have arised recently, allowing real-time scene texture and depth capture. As a result, many computer vision techniques, primarily applied only to textures, now can be reformulated using additional properties like the geometry. At the same time that these algorithms, making use of this new technology, can be accelerated or be made more robust, new interesting challenges and problems to be confronted are appearing. Examples of such devices include the 4D Video Project, from IMPA, and Kinect (TM) from Microsoft. These devices offer the so called RGBD images, being related to the three color channels and to the additional depth channel. The research described on this thesis presents a new non-supervised approach to estimate motion from videos composed by RGBD images. This is an intermediary and necessary step to identify the rigid components of an articulated object. Our method uses the technique of inexact graph matching (homomorphism) to find groups of pixels (patches) that move to the same direction in subsequent video frames. In order to choose the best matching for each patch, we minimize a cost function that accounts for distances on RGB color and XYZ (tridimensional world coordinates) spaces. The methodological contribution consists on depth data manipulation given by the new capture devices, such that these data become components of the feature vector that represents each patch on graphs to be matched. Our method does not use reference frames in order to be initialized and it can be applied to any video that contains piecewise parametric motion. For patches which allow a relative decrease on images resolution, our application runs in real-time. In order to validate the proposed methodology, we present results involving object classes with different movement kinds, such as videos with walking people, the motions of an arm and a couple of samba dancers. We also present the advances obtained on modeling an object oriented 4D video system, which guide a development of different applications to be developed as future work
    corecore