33 research outputs found

    Optimization for Training Deep Models and Deep Learning Based Point Cloud Analysis and Image Classification

    Get PDF
    Deep learning (DL) has dramatically improved the state-of-the-art performances in broad applications of computer vision, such as image recognition, object detection, segmentation, and point cloud analysis. However, the reasons for such huge empirical success of DL still keep elusive theoretically. In this dissertation, to understand DL and improve its efficiency, robustness, and interpretability, we theoretically investigate optimization algorithms for training deep models and empirically explore deep learning based point cloud analysis and image classification. 1). Optimization for Training Deep Models: Neural network training is one of the most difficult optimization problems involved in DL. Recently, it has been attracting more and more attention to understand the global optimality in DL. However, we observe that conventional DL solvers have not been developed intentionally to seek for such global optimality. In this dissertation, we propose a novel approximation algorithm, BPGrad, towards optimizing deep models globally via branch and pruning. The proposed BPGrad algorithm is based on the assumption of Lipschitz continuity in DL, and as a result, it can adaptively determine the step size for the current gradient given the history of previous updates, wherein theoretically no smaller steps can achieve the global optimality. Empirically an efficient solver based on BPGrad for DL is proposed as well, and it outperforms conventional DL solvers such as Adagrad, Adadelta, RMSProp, and Adam in the tasks of object recognition, detection, and segmentation. 2). Deep Learning Based Point Cloud Analysis and Image Classification: The network architecture is of central importance for many visual recognition tasks. In this dissertation, we focus on the emerging field of point clouds analysis and image classification. 2.1) Point cloud analysis: We observe that traditional 6D pose estimation approaches are not sufficient to address the problem where neither a CAD model of the object nor the ground-truth 6D poses of its instances are available during training. We propose a novel unsupervised approach to jointly learn the 3D object model and estimate the 6D poses of multiple instances of the same object in a single end-to-end deep neural network framework, with applications to depth-based instance segmentation. The inputs are depth images, and the learned object model is represented by a 3D point cloud. Specifically, our network produces a 3D object model and a list of rigid transformations on this model to generate instances, which when rendered must match the observed point cloud to minimizing the Chamfer distance. To render the set of instance point clouds with occlusions, the network automatically removes the occluded points in a given camera view. Extensive experiments evaluate our technique on several object models and a varying number of instances in 3D point clouds. Compared with popular baselines for instance segmentation, our model not only demonstrates competitive performance, but also learns a 3D object model that is represented as a 3D point cloud. 2.2) Low quality image classification: We propose a simple while effective unsupervised deep feature transfer network to address the degrading problem of the state-of-the-art classification algorithms on low-quality images. No fine-tuning is required in our method. We use a pre-trained deep model to extract features for both high-resolution (HR) and low-resolution (LR) images, and feed them into a multilayer feature transfer network for knowledge transfer. An SVM classifier is learned directly using these transferred low-resolution features. Our network can be embedded into the state-of-the-art network models as a plug-in feature enhancement module. It preserves data structures in feature space for HR images, and transfers the distinguishing features from a well-structured source domain (HR features space) to a not well-organized target domain (LR features space). Extensive experiments show that the proposed transfer network achieves significant improvements over the baseline method

    Stable Topological Structures in Data

    Get PDF
    Topological data analysis is a relatively new and emerging field - it combines concepts from computational geometry and algebraic topology to analyze data. This thesis builds on the recent result on continuous optimization of point clouds via persistence diagrams by giving further theoretic results on the geometric realization of the gradient on the persistence diagrams in the underlying metric space of point clouds and giving experimental results on two distinct applications where the use of this method yields novel approaches to the problems presented. After introduction of the needed preliminary information about simplicial complexes, (persistent) homology, and neural networks, the key concepts from the three topics are used to describe and define the notions of continuous optimization of point clouds via persistence diagrams. The idea behind it is to use the information about the topology, which is encoded in the persistence diagram, to adjust the coordinates of the points in the point cloud so that the point cloud exhibits some desired topological structures or properties. The algorithm for optimization is described for two different cases: when the coordinates of the point cloud are directly optimized, and when we optimize the coordinates indirectly through some parameters. These two cases serve as basis for two applications we discuss - to optimize the point clouds and to optimize the embedding of a time series. Experimental work on optimization of point clouds revealed strong potential for generation of point clouds with specific requirements on the topological structure - may it be in terms of persistence diagrams or with regards to the distribution of points in the persistence diagram. When optimizing the delay-coordinate embeddings of a time series the method gave good results on finding an appropriate delay to construct informative embeddings with

    Differentiable world programs

    Full text link
    L'intelligence artificielle (IA) moderne a ouvert de nouvelles perspectives prometteuses pour la création de robots intelligents. En particulier, les architectures d'apprentissage basées sur le gradient (réseaux neuronaux profonds) ont considérablement amélioré la compréhension des scènes 3D en termes de perception, de raisonnement et d'action. Cependant, ces progrès ont affaibli l'attrait de nombreuses techniques ``classiques'' développées au cours des dernières décennies. Nous postulons qu'un mélange de méthodes ``classiques'' et ``apprises'' est la voie la plus prometteuse pour développer des modèles du monde flexibles, interprétables et exploitables : une nécessité pour les agents intelligents incorporés. La question centrale de cette thèse est : ``Quelle est la manière idéale de combiner les techniques classiques avec des architectures d'apprentissage basées sur le gradient pour une compréhension riche du monde 3D ?''. Cette vision ouvre la voie à une multitude d'applications qui ont un impact fondamental sur la façon dont les agents physiques perçoivent et interagissent avec leur environnement. Cette thèse, appelée ``programmes différentiables pour modèler l'environnement'', unifie les efforts de plusieurs domaines étroitement liés mais actuellement disjoints, notamment la robotique, la vision par ordinateur, l'infographie et l'IA. Ma première contribution---gradSLAM--- est un système de localisation et de cartographie simultanées (SLAM) dense et entièrement différentiable. En permettant le calcul du gradient à travers des composants autrement non différentiables tels que l'optimisation non linéaire par moindres carrés, le raycasting, l'odométrie visuelle et la cartographie dense, gradSLAM ouvre de nouvelles voies pour intégrer la reconstruction 3D classique et l'apprentissage profond. Ma deuxième contribution - taskography - propose une sparsification conditionnée par la tâche de grandes scènes 3D encodées sous forme de graphes de scènes 3D. Cela permet aux planificateurs classiques d'égaler (et de surpasser) les planificateurs de pointe basés sur l'apprentissage en concentrant le calcul sur les attributs de la scène pertinents pour la tâche. Ma troisième et dernière contribution---gradSim--- est un simulateur entièrement différentiable qui combine des moteurs physiques et graphiques différentiables pour permettre l'estimation des paramètres physiques et le contrôle visuomoteur, uniquement à partir de vidéos ou d'une image fixe.Modern artificial intelligence (AI) has created exciting new opportunities for building intelligent robots. In particular, gradient-based learning architectures (deep neural networks) have tremendously improved 3D scene understanding in terms of perception, reasoning, and action. However, these advancements have undermined many ``classical'' techniques developed over the last few decades. We postulate that a blend of ``classical'' and ``learned'' methods is the most promising path to developing flexible, interpretable, and actionable models of the world: a necessity for intelligent embodied agents. ``What is the ideal way to combine classical techniques with gradient-based learning architectures for a rich understanding of the 3D world?'' is the central question in this dissertation. This understanding enables a multitude of applications that fundamentally impact how embodied agents perceive and interact with their environment. This dissertation, dubbed ``differentiable world programs'', unifies efforts from multiple closely-related but currently-disjoint fields including robotics, computer vision, computer graphics, and AI. Our first contribution---gradSLAM---is a fully differentiable dense simultaneous localization and mapping (SLAM) system. By enabling gradient computation through otherwise non-differentiable components such as nonlinear least squares optimization, ray casting, visual odometry, and dense mapping, gradSLAM opens up new avenues for integrating classical 3D reconstruction and deep learning. Our second contribution---taskography---proposes a task-conditioned sparsification of large 3D scenes encoded as 3D scene graphs. This enables classical planners to match (and surpass) state-of-the-art learning-based planners by focusing computation on task-relevant scene attributes. Our third and final contribution---gradSim---is a fully differentiable simulator that composes differentiable physics and graphics engines to enable physical parameter estimation and visuomotor control, solely from videos or a still image

    LiDAR-based Semantic Labeling : Automotive 3D Scene Understanding

    Get PDF
    Mobile Roboter und autonome Fahrzeuge verwenden verschiedene Sensormodalitäten zur Erkennung und Interpretation ihrer Umgebung. Neben Kameras und RaDAR Sensoren repräsentieren LiDAR Sensoren eine zentrale Komponente für moderne Methoden der Umgebungswahrnehmung. Zusätzlich zu einer präzisen Distanzmessung dieser Sensoren, ist ein umfangreiches semantisches Szeneverständnis notwendig, um ein effizientes und sicheres Agieren autonomer Systeme zu ermöglichen. In dieser Arbeit wird das neu entwickelte LiLaNet, eine echtzeitfähige, neuronale Netzarchitektur zur semantischen, punktweisen Klassifikation von LiDAR Punktwolken, vorgestellt. Hierfür finden die Ansätze der 2D Bildverarbeitung Verwendung, indem die 3D LiDAR Punktwolke als 2D zylindrisches Bild dargestellt wird. Dadurch werden Ergebnisse moderner Ansätze zur LiDAR-basierten, punktweisen Klassifikation übertroffen, was an unterschiedlichen Datensätzen demonstriert wird. Zur Entwicklung von Ansätzen des maschinellen Lernens, wie sie in dieser Arbeit verwendet werden, spielen umfangreiche Datensätze eine elementare Rolle. Aus diesem Grund werden zwei Datensätze auf Basis von modernen LiDAR Sensoren erzeugt. Durch das in dieser Arbeit entwickelte automatische Verfahren zur Datensatzgenerierung auf Basis von mehreren Sensormodalitäten, speziell der Kamera und des LiDAR Sensors, werden Kosten und Zeit der typischerweise manuellen Datensatzgenerierung reduziert. Zusätzlich wird eine multimodale Datenkompression vorgestellt, welche ein Kompressionsverfahren der Stereokamera auf den LiDAR Sensor überträgt. Dies führt zu einer Reduktion der LiDAR Daten bei gleichzeitigem Erhalt der zugrundeliegenden semantischen und geometrischen Information. Daraus resultiert eine erhöhte Echtzeitfähigkeit nachgelagerter Algorithmen autonomer Systeme. Außerdem werden zwei Erweiterungen zum vorgestellten Verfahren der semantischen Klassifikation umrissen. Zum einen wird die Sensorabhängigkeit durch Einführung des PiLaNets, einer neuen 3D Netzarchitektur, reduziert indem die LiDAR Punktwolke im 3D kartesischen Raum belassen wird, um die eher sensorabhängige 2D zylindrische Projektion zu ersetzen. Zum anderen wird die Unsicherheit neuronaler Netze implizit modelliert, indem eine Klassenhierarchie in den Trainingsprozess integriert wird. Insgesamt stellt diese Arbeit neuartige, performante Ansätze des 3D LiDAR-basierten, semantischen Szeneverstehens vor, welche zu einer Verbesserung der Leistung, Zuverlässigkeit und Sicherheit zukünftiger mobile Roboter und autonomer Fahrzeuge beitragen

    Universal Mini-Batch Consistency for Set Encoding Functions

    Full text link
    Previous works have established solid foundations for neural set functions, complete with architectures which preserve the necessary properties for operating on sets, such as invariance to permutations of the set elements. Subsequent work has highlighted the utility of Mini-Batch Consistency (MBC), the ability to sequentially process any permutation of a set partition scheme (e.g. streaming chunks of data) while maintaining consistency guarantees on the output, although there are limited options for MBC architectures. We propose a framework which can convert an arbitrary non-MBC model to one which satisfies MBC. In doing so, we allow all set functions to universally be considered in an MBC setting (UMBC). Additionally, we explore a Monte Carlo dropout strategy made possible by our framework which allows performing Monte Carlo dropout on streaming sets while never seeing the entire set at once. We validate UMBC with theoretical proofs, unit tests, and also provide qualitative/quantitative experiments on Gaussian data, clean and corrupted point cloud classification, and amortized clustering on ImageNet. Additionally, we investigate the probabilistic calibration of set-functions under test-time distributional shifts. Our results demonstrate the utility of universal mini-batch consistency, and we further discover that our dropout strategy improves uncertainty calibration

    Bridging the gap between reconstruction and synthesis

    Get PDF
    Aplicat embargament des de la data de defensa fins el 15 de gener de 20223D reconstruction and image synthesis are two of the main pillars in computer vision. Early works focused on simple tasks such as multi-view reconstruction and texture synthesis. With the spur of Deep Learning, the field has rapidly progressed, making it possible to achieve more complex and high level tasks. For example, the 3D reconstruction results of traditional multi-view approaches are currently obtained with single view methods. Similarly, early pattern based texture synthesis works have resulted in techniques that allow generating novel high-resolution images. In this thesis we have developed a hierarchy of tools that cover all these range of problems, lying at the intersection of computer vision, graphics and machine learning. We tackle the problem of 3D reconstruction and synthesis in the wild. Importantly, we advocate for a paradigm in which not everything should be learned. Instead of applying Deep Learning naively we propose novel representations, layers and architectures that directly embed prior 3D geometric knowledge for the task of 3D reconstruction and synthesis. We apply these techniques to problems including scene/person reconstruction and photo-realistic rendering. We first address methods to reconstruct a scene and the clothed people in it while estimating the camera position. Then, we tackle image and video synthesis for clothed people in the wild. Finally, we bridge the gap between reconstruction and synthesis under the umbrella of a unique novel formulation. Extensive experiments conducted along this thesis show that the proposed techniques improve the performance of Deep Learning models in terms of the quality of the reconstructed 3D shapes / synthesised images, while reducing the amount of supervision and training data required to train them. In summary, we provide a variety of low, mid and high level algorithms that can be used to incorporate prior knowledge into different stages of the Deep Learning pipeline and improve performance in tasks of 3D reconstruction and image synthesis.La reconstrucció 3D i la síntesi d'imatges són dos dels pilars fonamentals en visió per computador. Els estudis previs es centren en tasques senzilles com la reconstrucció amb informació multi-càmera i la síntesi de textures. Amb l'aparició del "Deep Learning", aquest camp ha progressat ràpidament, fent possible assolir tasques molt més complexes. Per exemple, per obtenir una reconstrucció 3D, tradicionalment s'utilitzaven mètodes multi-càmera, en canvi ara, es poden obtenir a partir d'una sola imatge. De la mateixa manera, els primers treballs de síntesi de textures basats en patrons han donat lloc a tècniques que permeten generar noves imatges completes en alta resolució. En aquesta tesi, hem desenvolupat una sèrie d'eines que cobreixen tot aquest ventall de problemes, situats en la intersecció entre la visió per computador, els gràfics i l'aprenentatge automàtic. Abordem el problema de la reconstrucció i la síntesi 3D en el món real. És important destacar que defensem un paradigma on no tot s'ha d'aprendre. Enlloc d'aplicar el "Deep Learning" de forma naïve, proposem representacions novedoses i arquitectures que incorporen directament els coneixements geomètrics ja existents per a aconseguir la reconstrucció 3D i la síntesi d'imatges. Nosaltres apliquem aquestes tècniques a problemes com ara la reconstrucció d'escenes/persones i a la renderització d'imatges fotorealistes. Primer abordem els mètodes per reconstruir una escena, les persones vestides que hi ha i la posició de la càmera. A continuació, abordem la síntesi d'imatges i vídeos de persones vestides en situacions quotidianes. I finalment, aconseguim, a través d'una nova formulació única, connectar la reconstrucció amb la síntesi. Els experiments realitzats al llarg d'aquesta tesi demostren que les tècniques proposades milloren el rendiment dels models de "Deepp Learning" pel que fa a la qualitat de les reconstruccions i les imatges sintetitzades alhora que redueixen la quantitat de dades necessàries per entrenar-los. En resum, proporcionem una varietat d'algoritmes de baix, mitjà i alt nivell que es poden utilitzar per incorporar els coneixements previs a les diferents etapes del "Deep Learning" i millorar el rendiment en tasques de reconstrucció 3D i síntesi d'imatges.Postprint (published version

    Generalization Through the Lens of Learning Dynamics

    Full text link
    A machine learning (ML) system must learn not only to match the output of a target function on a training set, but also to generalize to novel situations in order to yield accurate predictions at deployment. In most practical applications, the user cannot exhaustively enumerate every possible input to the model; strong generalization performance is therefore crucial to the development of ML systems which are performant and reliable enough to be deployed in the real world. While generalization is well-understood theoretically in a number of hypothesis classes, the impressive generalization performance of deep neural networks has stymied theoreticians. In deep reinforcement learning (RL), our understanding of generalization is further complicated by the conflict between generalization and stability in widely-used RL algorithms. This thesis will provide insight into generalization by studying the learning dynamics of deep neural networks in both supervised and reinforcement learning tasks.Comment: PhD Thesi
    corecore