437 research outputs found

    Large-scale environment mapping and immersive human-robot interaction for agricultural mobile robot teleoperation

    Full text link
    Remote operation is a crucial solution to problems encountered in agricultural machinery operations. However, traditional video streaming control methods fall short in overcoming the challenges of single perspective views and the inability to obtain 3D information. In light of these issues, our research proposes a large-scale digital map reconstruction and immersive human-machine remote control framework for agricultural scenarios. In our methodology, a DJI unmanned aerial vehicle(UAV) was utilized for data collection, and a novel video segmentation approach based on feature points was introduced. To tackle texture richness variability, an enhanced Structure from Motion (SfM) using superpixel segmentation was implemented. This method integrates the open Multiple View Geometry (openMVG) framework along with Local Features from Transformers (LoFTR). The enhanced SfM results in a point cloud map, which is further processed through Multi-View Stereo (MVS) to generate a complete map model. For control, a closed-loop system utilizing TCP for VR control and positioning of agricultural machinery was introduced. Our system offers a fully visual-based immersive control method, where upon connection to the local area network, operators can utilize VR for immersive remote control. The proposed method enhances both the robustness and convenience of the reconstruction process, thereby significantly facilitating operators in acquiring more comprehensive on-site information and engaging in immersive remote control operations. The code is available at: https://github.com/LiuTao1126/Enhance-SF

    Depth-Assisted Semantic Segmentation, Image Enhancement and Parametric Modeling

    Get PDF
    This dissertation addresses the problem of employing 3D depth information on solving a number of traditional challenging computer vision/graphics problems. Humans have the abilities of perceiving the depth information in 3D world, which enable humans to reconstruct layouts, recognize objects and understand the geometric space and semantic meanings of the visual world. Therefore it is significant to explore how the 3D depth information can be utilized by computer vision systems to mimic such abilities of humans. This dissertation aims at employing 3D depth information to solve vision/graphics problems in the following aspects: scene understanding, image enhancements and 3D reconstruction and modeling. In addressing scene understanding problem, we present a framework for semantic segmentation and object recognition on urban video sequence only using dense depth maps recovered from the video. Five view-independent 3D features that vary with object class are extracted from dense depth maps and used for segmenting and recognizing different object classes in street scene images. We demonstrate a scene parsing algorithm that uses only dense 3D depth information to outperform using sparse 3D or 2D appearance features. In addressing image enhancement problem, we present a framework to overcome the imperfections of personal photographs of tourist sites using the rich information provided by large-scale internet photo collections (IPCs). By augmenting personal 2D images with 3D information reconstructed from IPCs, we address a number of traditionally challenging image enhancement techniques and achieve high-quality results using simple and robust algorithms. In addressing 3D reconstruction and modeling problem, we focus on parametric modeling of flower petals, the most distinctive part of a plant. The complex structure, severe occlusions and wide variations make the reconstruction of their 3D models a challenging task. We overcome these challenges by combining data driven modeling techniques with domain knowledge from botany. Taking a 3D point cloud of an input flower scanned from a single view, each segmented petal is fitted with a scale-invariant morphable petal shape model, which is constructed from individually scanned 3D exemplar petals. Novel constraints based on botany studies are incorporated into the fitting process for realistically reconstructing occluded regions and maintaining correct 3D spatial relations. The main contribution of the dissertation is in the intelligent usage of 3D depth information on solving traditional challenging vision/graphics problems. By developing some advanced algorithms either automatically or with minimum user interaction, the goal of this dissertation is to demonstrate that computed 3D depth behind the multiple images contains rich information of the visual world and therefore can be intelligently utilized to recognize/ understand semantic meanings of scenes, efficiently enhance and augment single 2D images, and reconstruct high-quality 3D models

    Semantic Segmentation for Real-World Applications

    Get PDF
    En visión por computador, la comprensión de escenas tiene como objetivo extraer información útil de una escena a partir de datos de sensores. Por ejemplo, puede clasificar toda la imagen en una categoría particular o identificar elementos importantes dentro de ella. En este contexto general, la segmentación semántica proporciona una etiqueta semántica a cada elemento de los datos sin procesar, por ejemplo, a todos los píxeles de la imagen o, a todos los puntos de la nube de puntos. Esta información es esencial para muchas aplicaciones de visión por computador, como conducción, aplicaciones médicas o robóticas. Proporciona a los ordenadores una comprensión sobre el entorno que es necesaria para tomar decisiones autónomas.El estado del arte actual de la segmentación semántica está liderado por métodos de aprendizaje profundo supervisados. Sin embargo, las condiciones del mundo real presentan varias restricciones para la aplicación de estos modelos de segmentación semántica. Esta tesis aborda varios de estos desafíos: 1) la cantidad limitada de datos etiquetados disponibles para entrenar modelos de aprendizaje profundo, 2) las restricciones de tiempo y computación presentes en aplicaciones en tiempo real y/o en sistemas con poder computacional limitado, y 3) la capacidad de realizar una segmentación semántica cuando se trata de sensores distintos de la cámara RGB estándar.Las aportaciones principales en esta tesis son las siguientes:1. Un método nuevo para abordar el problema de los datos anotados limitados para entrenar modelos de segmentación semántica a partir de anotaciones dispersas. Los modelos de aprendizaje profundo totalmente supervisados lideran el estado del arte, pero mostramos cómo entrenarlos usando solo unos pocos píxeles etiquetados. Nuestro enfoque obtiene un rendimiento similar al de los modelos entrenados con imágenescompletamente etiquetadas. Demostramos la relevancia de esta técnica en escenarios de monitorización ambiental y en dominios más generales.2. También tratando con datos de entrenamiento limitados, proponemos un método nuevo para segmentación semántica semi-supervisada, es decir, cuando solo hay una pequeña cantidad de imágenes completamente etiquetadas y un gran conjunto de datos sin etiquetar. La principal novedad de nuestro método se basa en el aprendizaje por contraste. Demostramos cómo el aprendizaje por contraste se puede aplicar a la tarea de segmentación semántica y mostramos sus ventajas, especialmente cuando la disponibilidad de datos etiquetados es limitada logrando un nuevo estado del arte.3. Nuevos modelos de segmentación semántica de imágenes eficientes. Desarrollamos modelos de segmentación semántica que son eficientes tanto en tiempo de ejecución, requisitos de memoria y requisitos de cálculo. Algunos de nuestros modelos pueden ejecutarse en CPU a altas velocidades con alta precisión. Esto es muy importante para configuraciones y aplicaciones reales, ya que las GPU de gama alta nosiempre están disponibles.4. Nuevos métodos de segmentación semántica con sensores no RGB. Proponemos un método para la segmentación de nubes de puntos LiDAR que combina operaciones de aprendizaje eficientes tanto en 2D como en 3D. Logra un rendimiento de segmentación excepcional a velocidades realmente rápidas. También mostramos cómo mejorar la robustez de estos modelos al abordar el problema de sobreajuste y adaptaciónde dominio. Además, mostramos el primer trabajo de segmentación semántica con cámaras de eventos, haciendo frente a la falta de datos etiquetados.Estas contribuciones aportan avances significativos en el campo de la segmentación semántica para aplicaciones del mundo real. Para una mayor contribución a la comunidad cientfíica, hemos liberado la implementación de todas las soluciones propuestas.----------------------------------------In computer vision, scene understanding aims at extracting useful information of a scene from raw sensor data. For instance, it can classify the whole image into a particular category (i.e. kitchen or living room) or identify important elements within it (i.e., bottles, cups on a table or surfaces). In this general context, semantic segmentation provides a semantic label to every single element of the raw data, e.g., to all image pixels or to all point cloud points.This information is essential for many applications relying on computer vision, such as AR, driving, medical or robotic applications. It provides computers with understanding about the environment needed to make autonomous decisions, or detailed information to people interacting with the intelligent systems. The current state of the art for semantic segmentation is led by supervised deep learning methods.However, real-world scenarios and conditions introduce several challenges and restrictions for the application of these semantic segmentation models. This thesis tackles several of these challenges, namely, 1) the limited amount of labeled data available for training deep learning models, 2) the time and computation restrictions present in real time applications and/or in systems with limited computational power, such as a mobile phone or an IoT node, and 3) the ability to perform semantic segmentation when dealing with sensors other than the standard RGB camera.The general contributions presented in this thesis are following:A novel approach to address the problem of limited annotated data to train semantic segmentation models from sparse annotations. Fully supervised deep learning models are leading the state-of-the-art, but we show how to train them by only using a few sparsely labeled pixels in the training images. Our approach obtains similar performance than models trained with fully-labeled images. We demonstrate the relevance of this technique in environmental monitoring scenarios, where it is very common to have sparse image labels provided by human experts, as well as in more general domains. Also dealing with limited training data, we propose a novel method for semi-supervised semantic segmentation, i.e., when there is only a small number of fully labeled images and a large set of unlabeled data. We demonstrate how contrastive learning can be applied to the semantic segmentation task and show its advantages, especially when the availability of labeled data is limited. Our approach improves state-of-the-art results, showing the potential of contrastive learning in this task. Learning from unlabeled data opens great opportunities for real-world scenarios since it is an economical solution. Novel efficient image semantic segmentation models. We develop semantic segmentation models that are efficient both in execution time, memory requirements, and computation requirements. Some of our models able to run in CPU at high speed rates with high accuracy. This is very important for real set-ups and applications since high-end GPUs are not always available. Building models that consume fewer resources, memory and time, would increase the range of applications that can benefit from them. Novel methods for semantic segmentation with non-RGB sensors.We propose a novel method for LiDAR point cloud segmentation that combines efficient learning operations both in 2D and 3D. It surpasses state-of-the-art segmentation performance at really fast rates. We also show how to improve the robustness of these models tackling the overfitting and domain adaptation problem. Besides, we show the first work for semantic segmentation with event-based cameras, coping with the lack of labeled data. To increase the impact of this contributions and ease their application in real-world settings, we have made available an open-source implementation of all proposed solutions to the scientific community.<br /

    Brain MR Image Segmentation: From Multi-Atlas Method To Deep Learning Models

    Get PDF
    Quantitative analysis of the brain structures on magnetic resonance (MR) images plays a crucial role in examining brain development and abnormality, as well as in aiding the treatment planning. Although manual delineation is commonly considered as the gold standard, it suffers from the shortcomings in terms of low efficiency and inter-rater variability. Therefore, developing automatic anatomical segmentation of human brain is of importance in providing a tool for quantitative analysis (e.g., volume measurement, shape analysis, cortical surface mapping). Despite a large number of existing techniques, the automatic segmentation of brain MR images remains a challenging task due to the complexity of the brain anatomical structures and the great inter- and intra-individual variability among these anatomical structures. To address the existing challenges, four methods are proposed in this thesis. The first work proposes a novel label fusion scheme for the multi-atlas segmentation. A two-stage majority voting scheme is developed to address the over-segmentation problem in the hippocampus segmentation of brain MR images. The second work of the thesis develops a supervoxel graphical model for the whole brain segmentation, in order to relieve the dependencies on complicated pairwise registration for the multi-atlas segmentation methods. Based on the assumption that pixels within a supervoxel are supposed to have the same label, the proposed method converts the voxel labeling problem to a supervoxel labeling problem which is solved by a maximum-a-posteriori (MAP) inference in Markov random field (MRF) defined on supervoxels. The third work incorporates attention mechanism into convolutional neural networks (CNN), aiming at learning the spatial dependencies between the shallow layers and the deep layers in CNN and producing an aggregation of the attended local feature and high-level features to obtain more precise segmentation results. The fourth method takes advantage of the success of CNN in computer vision, combines the strength of the graphical model with CNN, and integrates them into an end-to-end training network. The proposed methods are evaluated on public MR image datasets, such as MICCAI2012, LPBA40, and IBSR. Extensive experiments demonstrate the effectiveness and superior performance of the three proposed methods compared with the other state-of-the-art methods

    Multi-Modal Learning For Adaptive Scene Understanding

    Get PDF
    Modern robotics systems typically possess sensors of different modalities. Segmenting scenes observed by the robot into a discrete set of classes is a central requirement for autonomy. Equally, when a robot navigates through an unknown environment, it is often necessary to adjust the parameters of the scene segmentation model to maintain the same level of accuracy in changing situations. This thesis explores efficient means of adaptive semantic scene segmentation in an online setting with the use of multiple sensor modalities. First, we devise a novel conditional random field(CRF) inference method for scene segmentation that incorporates global constraints, enforcing particular sets of nodes to be assigned the same class label. To do this efficiently, the CRF is formulated as a relaxed quadratic program whose maximum a posteriori(MAP) solution is found using a gradient-based optimization approach. These global constraints are useful, since they can encode "a priori" information about the final labeling. This new formulation also reduces the dimensionality of the original image-labeling problem. The proposed model is employed in an urban street scene understanding task. Camera data is used for the CRF based semantic segmentation while global constraints are derived from 3D laser point clouds. Second, an approach to learn CRF parameters without the need for manually labeled training data is proposed. The model parameters are estimated by optimizing a novel loss function using self supervised reference labels, obtained based on the information from camera and laser with minimum amount of human supervision. Third, an approach that can conduct the parameter optimization while increasing the model robustness to non-stationary data distributions in the long trajectories is proposed. We adopted stochastic gradient descent to achieve this goal by using a learning rate that can appropriately grow or diminish to gain adaptability to changes in the data distribution

    Target classification in multimodal video

    Get PDF
    The presented thesis focuses on enhancing scene segmentation and target recognition methodologies via the mobilisation of contextual information. The algorithms developed to achieve this goal utilise multi-modal sensor information collected across varying scenarios, from controlled indoor sequences to challenging rural locations. Sensors are chiefly colour band and long wave infrared (LWIR), enabling persistent surveillance capabilities across all environments. In the drive to develop effectual algorithms towards the outlined goals, key obstacles are identified and examined: the recovery of background scene structure from foreground object ’clutter’, employing contextual foreground knowledge to circumvent training a classifier when labeled data is not readily available, creating a labeled LWIR dataset to train a convolutional neural network (CNN) based object classifier and the viability of spatial context to address long range target classification when big data solutions are not enough. For an environment displaying frequent foreground clutter, such as a busy train station, we propose an algorithm exploiting foreground object presence to segment underlying scene structure that is not often visible. If such a location is outdoors and surveyed by an infra-red (IR) and visible band camera set-up, scene context and contextual knowledge transfer allows reasonable class predictions for thermal signatures within the scene to be determined. Furthermore, a labeled LWIR image corpus is created to train an infrared object classifier, using a CNN approach. The trained network demonstrates effective classification accuracy of 95% over 6 object classes. However, performance is not sustainable for IR targets acquired at long range due to low signal quality and classification accuracy drops. This is addressed by mobilising spatial context to affect network class scores, restoring robust classification capability
    corecore