11 research outputs found

    Detección, segmentación y seguimiento multi-objeto utilizando un procedimiento semiautomático de anotaciones a nivel de píxel empleando el conjunto de datos KITTI MOTS

    Get PDF
    En este Trabajo Fin de Máster (TFM) se van a estudiar distintos métodos de detección, segmentación y seguimiento multi-objeto existentes recientemente que utilizan el conjunto de datos de entrenamiento y prueba de la base de datos Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago (KITTI), tomando como partida la técnica propuesta en el artículo de investigación “MOTS: Multi-Object Tracking and Segmentation”, presentado en el congreso “Conference on Computer Vision and Pattern Recognition (CVPR)” del año 2019. Los métodos estudiados participan en Multi-object Tracking And Segmentation (MOTS) Evaluation de “The KITTI Vision Benchmark Suite” y sus códigos son abiertos para poder entrenar con ellos. Por lo tanto, se reproducirán los resultados obtenidos de cada método en KITTI MOTS Validation.In this Master’s Thesis, I’m going to study differents recently existing multi-object detection, segmentation and tracking methods that use the training and testing dataset of the KITTI database, taking as a starting point the technique proposed in the research paper “MOTS: Multi-Object Tracking and Segmentation”, presented at the “CVPR” conference in 2019. The studied methods participate in MOTS Evaluation of ”The KITTI Vision Benchmark Suite” and their codes are open for training with them. Therefore, the results obtained by each method in KITTI MOTS Validation will be reproduced.Máster Universitario en Ingeniería Industrial (M141

    CML-MOTS: Collaborative Multi-task Learning for Multi-Object Tracking and Segmentation

    Full text link
    The advancement of computer vision has pushed visual analysis tasks from still images to the video domain. In recent years, video instance segmentation, which aims to track and segment multiple objects in video frames, has drawn much attention for its potential applications in various emerging areas such as autonomous driving, intelligent transportation, and smart retail. In this paper, we propose an effective framework for instance-level visual analysis on video frames, which can simultaneously conduct object detection, instance segmentation, and multi-object tracking. The core idea of our method is collaborative multi-task learning which is achieved by a novel structure, named associative connections among detection, segmentation, and tracking task heads in an end-to-end learnable CNN. These additional connections allow information propagation across multiple related tasks, so as to benefit these tasks simultaneously. We evaluate the proposed method extensively on KITTI MOTS and MOTS Challenge datasets and obtain quite encouraging results

    Dynamic scene understanding using deep neural networks

    Get PDF

    Downstream Task Self-Supervised Learning for Object Recognition and Tracking

    Get PDF
    This dissertation addresses three limitations of deep learning methods in image and video understanding-based machine vision applications. Firstly, although deep convolutional neural networks (CNNs) are efficient for image recognition applications such as object detection and segmentation, they perform poorly under perspective distortions. In real-world applications, the camera perspective is a common problem that we can address by annotating large amounts of data, thus limiting the applicability of the deep learning models. Secondly, the typical approach for single-camera tracking problems is to use separate motion and appearance models, which are expensive in terms of computations and training data requirements. Finally, conventional multi-camera video understanding techniques use supervised learning algorithms to determine temporal relationships among objects. In large-scale applications, these methods are also limited by the requirement of extensive manually annotated data and computational resources.To address these limitations, we develop an uncertainty-aware self-supervised learning (SSL) technique that captures a model\u27s instance or semantic segmentation uncertainty from overhead images and guides the model to learn the impact of the new perspective on object appearance. The test-time data augmentation-based pseudo-label refinement technique continuously trains a model until convergence on new perspective images. The proposed method can be applied for both self-supervision and semi-supervision, thus increasing the effectiveness of a deep pre-trained model in new domains. Extensive experiments demonstrate the effectiveness of the SSL technique in both object detection and semantic segmentation problems. In video understanding applications, we introduce simultaneous segmentation and tracking as an unsupervised spatio-temporal latent feature clustering problem. The jointly learned multi-task features leverage the task-dependent uncertainty to generate discriminative features in multi-object videos. Experiments have shown that the proposed tracker outperforms several state-of-the-art supervised methods. Finally, we proposed an unsupervised multi-camera tracklet association (MCTA) algorithm to track multiple objects in real-time. MCTA leverages the self-supervised detector model for single-camera tracking and solves the multi-camera tracking problem using multiple pair-wise camera associations modeled as a connected graph. The graph optimization method generates a global solution for partially or fully overlapping camera networks

    Wire-cell 3D pattern recognition techniques for neutrino event reconstruction in large LArTPCs: algorithm description and quantitative evaluation with MicroBooNE simulation

    Get PDF
    Wire-Cell is a 3D event reconstruction package for liquid argon time projection chambers. Through geometry, time, and drifted charge from multiple readout wire planes, 3D space points with associated charge are reconstructed prior to the pattern recognition stage. Pattern recognition techniques, including track trajectory and dQ/dx (ionization charge per unit length) fitting, 3D neutrino vertex fitting, track and shower separation, particle-level clustering, and particle identification are then applied on these 3D space points as well as the original 2D projection measurements. A deep neural network is developed to enhance the reconstruction of the neutrino interaction vertex. Compared to traditional algorithms, the deep neural network boosts the vertex efficiency by a relative 30% for charged-current νe interactions. This pattern recognition achieves 80–90% reconstruction efficiencies for primary leptons, after a 65.8% (72.9%) vertex efficiency for charged-current νe (νμ) interactions. Based on the resulting reconstructed particles and their kinematics, we also achieve 15-20% energy reconstruction resolutions for charged-current neutrino interactions

    On the Data Efficiency and Model Complexity of Visual Learning

    Get PDF
    Computer vision is a research field that aims to automate the procedure of gaining abstract understanding from digital images or videos. The recent rapid developments of deep neural networks have demonstrated human-level performance or beyond on many vision tasks that require high-level understanding, such as image recognition, object detection, etc. However, training deep neural networks usually requires large-scale datasets annotated by humans, and the models typically have millions of parameters and consume a lot of computation resources. The issues of data efficiency and model complexity are commonly observed in many frameworks based on deep neural networks, limiting their deployment in real-world applications. In this dissertation, I will present our research works that address the issues of data efficiency and model complexity of deep neural networks. For the data efficiency, (i) we study the problem of few-shot image recognition, where the training datasets are limited to having only a few examples per category. (ii) We also investigate semi-supervised visual learning, which provides unlabeled samples in addition to the annotated dataset and aims to utilize them to learn better models. For the model complexity, (iii) we seek alternatives to cascading layers or blocks for improving the representation capacities of convolutional neural networks without introducing additional computations. (iv) We improve the computational resource utilization of deep neural networks by finding, reallocating, and rejuvenating underutilized neurons. (v) We present two techniques for object detection that reuse computations to reduce the architecture complexity and improve the detection performance. (vi) Finally, we show our work on reusing visual features for multi-task learning to improve computation efficiency and share training information between different tasks

    Acquiring 3D scene information from 2D images

    Get PDF
    In recent years, people are becoming increasingly acquainted with 3D technologies such as 3DTV, 3D movies and 3D virtual navigation of city environments in their daily life. Commercial 3D movies are now commonly available for consumers. Virtual navigation of our living environment as used on a personal computer has become a reality due to well-known web-based geographic applications using advanced imaging technologies. To enable such 3D applications, many technological challenges such as 3D content creation, 3D displaying technology and 3D content transmission need to tackled and deployed at low cost. This thesis concentrates on the reconstruction of 3D scene information from multiple 2D images, aiming for an automatic and low-cost production of the 3D content. In this thesis, two multiple-view 3D reconstruction systems are proposed: a 3D modeling system for reconstructing the sparse 3D scene model from long video sequences captured with a hand-held consumer camcorder, and a depth reconstruction system for creating depth maps from multiple-view videos taken by multiple synchronized cameras. Both systems are designed to compute the 3D scene information in an automated way with minimum human interventions, in order to reduce the production cost of 3D contents. Experimental results on real videos of hundreds and thousands frames have shown that the two systems are able to accurately and automatically reconstruct the 3D scene information from 2D image data. The findings of this research are useful for emerging 3D applications such as 3D games, 3D visualization and 3D content production. Apart from designing and implementing the two proposed systems, we have developed three key scientific contributions to enable the two proposed 3D reconstruction systems. The first contribution is that we have designed a novel feature point matching algorithm that uses only a smoothness constraint for matching the points, which states that neighboring feature points in images tend to move with similar directions and magnitudes. The employed smoothness assumption is not only valid but also robust for most images with limited image motion, regardless of the camera motion and scene structure. Because of this, the algorithm obtains two major advan- 1 tages. First, the algorithm is robust to illumination changes, as the employed smoothness constraint does not rely on any texture information. Second, the algorithm has a good capability to handle the drift of the feature points over time, as the drift can hardly lead to a violation of the smoothness constraint. This leads to the large number of feature points matched and tracked by the proposed algorithm, which significantly helps the subsequent 3D modeling process. Our feature point matching algorithm is specifically designed for matching and tracking feature points in image/video sequences where the image motion is limited. Our extensive experimental results show that the proposed algorithm is able to track at least 2.5 times as many feature points compared with the state-of-the-art algorithms, with a comparable or higher accuracy. This contributes significantly to the robustness of the 3D reconstruction process. The second contribution is that we have developed algorithms to detect critical configurations where the factorization-based 3D reconstruction degenerates. Based on the detection, we have proposed a sequence-dividing algorithm to divide a long sequence into subsequences, such that successful 3D reconstructions can be performed on individual subsequences with a high confidence. The partial reconstructions are merged later to obtain the 3D model of the complete scene. In the critical configuration detection algorithm, the four critical configurations are detected: (1) coplanar 3D scene points, (2) pure camera rotation, (3) rotation around two camera centers, and (4) presence of excessive noise and outliers in the measurements. The configurations in cases (1), (2) and (4) will affect the rank of the Scaled Measurement Matrix (SMM). The number of camera centers in case (3) will affect the number of independent rows of the SMM. By examining the rank and the row space of the SMM, the abovementioned critical configurations are detected. Based on the detection results, the proposed sequence-dividing algorithm divides a long sequence into subsequences, such that each subsequence is free of the four critical configurations in order to obtain successful 3D reconstructions on individual subsequences. Experimental results on both synthetic and real sequences have demonstrated that the above four critical configurations are robustly detected, and a long sequence of thousands frames is automatically divided into subsequences, yielding successful 3D reconstructions. The proposed critical configuration detection and sequence-dividing algorithms provide an essential processing block for an automatical 3D reconstruction on long sequences. The third contribution is that we have proposed a coarse-to-fine multiple-view depth labeling algorithm to compute depth maps from multiple-view videos, where the accuracy of resulting depth maps is gradually refined in multiple optimization passes. In the proposed algorithm, multiple-view depth reconstruction is formulated as an image-based labeling problem using the framework of Maximum A Posterior (MAP) on Markov Random Fields (MRF). The MAP-MRF framework allows the combination of various objective and heuristic depth cues to define the local penalty and the interaction energies, which provides a straightforward and computationally tractable formulation. Furthermore, the global optimal MAP solution to depth labeli ing can be found by minimizing the local energies, using existing MRF optimization algorithms. The proposed algorithm contains the following three key contributions. (1) A graph construction algorithm to proposed to construct triangular meshes on over-segmentation maps, in order to exploit the color and the texture information for depth labeling. (2) Multiple depth cues are combined to define the local energies. Furthermore, the local energies are adapted to the local image content, in order to consider the varying nature of the image content for an accurate depth labeling. (3) Both the density of the graph nodes and the intervals of the depth labels are gradually refined in multiple labeling passes. By doing so, both the computational efficiency and the robustness of the depth labeling process are improved. The experimental results on real multiple-view videos show that the depth maps of for selected reference view are accurately reconstructed. Depth discontinuities are very well preserved

    Towards Better Navigation: Optimizing Algorithms for Mapping, Localization and Planning

    Full text link
    Navigation is the problem of going from one place to another. It is an important problem in the areas of autonomous driving, mobile robotics and robotic manipulation. When navigation algorithms realize their potential, autonomous driving will lead to a new era of mobility, enabling greater independence for the disabled and greater convenience for all. It will reduce car ownership, increase fuel efficiency and reduce traffic jams. The navigation problem needs to be solved efficiently. It is well understood that slow reaction time in driving can be fatal. For self-driving cars to be safe, the navigation algorithms have to be optimized without compromising accuracy. In this thesis, we focus on optimizing algorithms in the navigation domain. Navigation is often addressed in three parts: mapping, localization and planning. Mapping is the problem of building a representation (a map) of the environment from noisy observations. Mapping is the slowest step in the navigation pipeline because the set of possible maps grows exponentially with the size of the map. This makes optimizing mapping crucial for optimizing navigation. We focus on grid based mapping algorithms that divide the space into grid cells. The state of the art (SOTA) grid based mapping algorithms have either of two limitations. They either make a limiting assumption of each grid-cell being independent of its neighboring cells or they use slow sampling based methods like Gibbs sampling and Metropolis Hastings. Although the sampling based methods are guaranteed to converge, they are slow in practice because they do not fully utilize the relationships among random variables. We avoid the independent cell assumption and use modern inference methods like Belief Propagation and Dual Decomposition, instead of sampling based methods. These modern methods not only converge up to two times faster but also lead to more accurate maps. Localization, another part of navigation, is the problem of finding the robot's position with respect to the environment or the map. It is usually carried out under two restrictive assumptions: (1) there is only one robot in the environment (2) the map (or its estimate) is known. We relax the former assumption by recognizing the fact that robots can cooperatively map the environment faster. We propose a polynomial root-finding based mutual localization algorithm that uses observations from two robots. Our algorithm depends upon only a few fiducial markers instead of external landmarks, used by methods like Bundler, which makes it faster. The final step of navigation, called planning, is the problem of taking actions based on the map, the robot's position and desired destination. Reinforcement learning (RL) is a popular algorithm for planning, when the effect of actions on the environment are not easily modeled. A special class of RL algorithms, called Goal-conditioned RL, is applied to cases when the goal location can change for every trial. The SOTA algorithms in Goal-conditioned RL specify the goal location in multiple, redundant ways. Due to this redundant information, algorithms like Hindsight Experience Replay re-samples rewards which makes them slow. We propose a deep extension to Floyd-Warshall Reinforcement Learning which removes of this redundant information thus avoiding rewards re-sampling. The resultant algorithm requires only half the reward samples as required by the baselines.PHDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/150011/1/dhiman_1.pd
    corecore