8 research outputs found

    An Empirical Study of the Generalization Ability of Lidar 3D Object Detectors to Unseen Domains

    Full text link
    3D Object Detectors (3D-OD) are crucial for understanding the environment in many robotic tasks, especially autonomous driving. Including 3D information via Lidar sensors improves accuracy greatly. However, such detectors perform poorly on domains they were not trained on, i.e. different locations, sensors, weather, etc., limiting their reliability in safety-critical applications. There exist methods to adapt 3D-ODs to these domains; however, these methods treat 3D-ODs as a black box, neglecting underlying architectural decisions and source-domain training strategies. Instead, we dive deep into the details of 3D-ODs, focusing our efforts on fundamental factors that influence robustness prior to domain adaptation. We systematically investigate four design choices (and the interplay between them) often overlooked in 3D-OD robustness and domain adaptation: architecture, voxel encoding, data augmentations, and anchor strategies. We assess their impact on the robustness of nine state-of-the-art 3D-ODs across six benchmarks encompassing three types of domain gaps - sensor type, weather, and location. Our main findings are: (1) transformer backbones with local point features are more robust than 3D CNNs, (2) test-time anchor size adjustment is crucial for adaptation across geographical locations, significantly boosting scores without retraining, (3) source-domain augmentations allow the model to generalize to low-resolution sensors, and (4) surprisingly, robustness to bad weather is improved when training directly on more clean weather data than on training with bad weather data. We outline our main conclusions and findings to provide practical guidance on developing more robust 3D-ODs

    Semantic Segmentation of Ambiguous Images

    Get PDF
    Medizinische Bilder können schwer zu interpretieren sein. Nicht nur weil das Erkennen von Strukturen und möglichen Veränderungen Erfahrung und jahrelanges Training bedarf, sondern auch weil die dargestellten Messungen oft im Kern mehrdeutig sind. Fundamental ist dies eine Konsequenz dessen, dass medizinische Bild-Modalitäten, wie bespielsweise MRT oder CT, nur indirekte Messungen der zu Grunde liegenden molekularen Identitäten bereithalten. Die semantische Bedeutung eines Bildes kann deshalb im Allgemeinen nur gegeben einem größeren Bild-Kontext erfasst werden, welcher es oft allerdings nur unzureichend erlaubt eine eindeutige Interpretation in Form einer einzelnen Hypothese vorzunehmen. Ähnliche Szenarien existieren in natürlichen Bildern, in welchen die Kontextinformation, die es braucht um Mehrdeutigkeiten aufzulösen, limitiert sein kann, beispielsweise aufgrund von Verdeckungen oder Rauschen in der Aufnahme. Zusätzlich können überlappende oder vage Klassen-Definitionen zu schlecht gestellten oder diversen Lösungsräumen führen. Die Präsenz solcher Mehrdeutigkeiten kann auch das Training und die Leistung von maschinellen Lernverfahren beeinträchtigen. Darüber hinaus sind aktuelle Modelle ueberwiegend unfähig komplex strukturierte und diverse Vorhersagen bereitzustellen und stattdessen dazu gezwungen sich auf sub-optimale, einzelne Lösungen oder ununterscheidbare Mixturen zu beschränken. Dies kann besonders problematisch sein wenn Klassifikationsverfahren zu pixel-weisen Vorhersagen wie in der semantischen Segmentierung skaliert werden. Die semantische Segmentierung befasst sich damit jedem Pixel in einem Bild eine Klassen-Kategorie zuzuweisen. Diese Art des detailierten Bild-Verständnisses spielt auch eine wichtige Rolle in der Diagnose und der Behandlung von Krankheiten wie Krebs: Tumore werden häufig in MRT oder CT Bildern entdeckt und deren präzise Lokalisierung und Segmentierung ist von grosser Bedeutung in deren Bewertung, der Vorbereitung möglicher Biopsien oder der Planung von Fokal-Therapien. Diese klinischen Bildverarbeitungen, aber auch die optische Wahrnehmung unserer Umgebung im Rahmen von täglichen Aufgaben wie dem Autofahren, werden momentan von Menschen durchgeführt. Als Teil des zunehmenden Einbindens von maschinellen Lernverfahren in unsere Entscheidungsfindungsprozesse, ist es wichtig diese Aufgaben adequat zu modellieren. Dies schliesst Unsicherheitsabschätzungen der Modellvorhersagen mit ein, mitunter solche Unsicherheiten die den Bild-Mehrdeutigkeiten zugeschrieben werden können. Die vorliegende Thesis schlägt mehrere Art und Weisen vor mit denen mit einer mehrdeutigen Bild-Evidenz umgegangen werden kann. Zunächst untersuchen wir den momentanen klinischen Standard der im Falle von Prostata Läsionen darin besteht, die MRT-sichtbaren Läsionen subjektiv auf ihre Aggressivität hin zu bewerten, was mit einer hohen Variabilität zwischen Bewertern einhergeht. Unseren Studien zufolge können bereits einfache machinelle Lernverfahren und sogar simple quantitative MRT-basierte Parameter besser abschneiden als ein individueller, subjektiver Experte, was ein vielversprechendes Potential der Quantifizerung des Prozesses nahelegt. Desweiteren stellen wir die derzeit erfolgreichste Segmentierungsarchitektur auf einem stark mehrdeutigen Datensatz zur Probe der während klinischer Routine erhoben und annotiert wurde. Unsere Experimente zeigen, dass die standard Segmentierungsverlustfuntion in Szenarien mit starkem Annotationsrauschen sub-optimal sein kann. Als eine Alternative erproben wir die Möglichkeit ein Modell der Verlustunktion zu lernen mit dem Ziel die Koexistenz von plausiblen Lösungen während des Trainings zuzulassen. Wir beobachten gesteigerte Performanz unter Verwendung dieser Trainingsmethode für ansonsten unveränderte neuronale Netzarchitekturen und finden weiter gesteigerte relative Verbesserungen im Limit weniger Daten. Mangel an Daten und Annotationen, hohe Maße an Bild- und Annotationsrauschen sowie mehrdeutige Bild-Evidenz finden sich besonders häufig in Datensätzen medizinischer Bilder wieder. Dieser Teil der Thesis exponiert daher einige der Schwächen die standard Techniken des maschinellen Lernens im Lichte dieser Besonderheiten aufweisen können. Derzeitige Segmentierungsmodelle, wie die zuvor Herangezogenen, sind dahingehend eingeschränkt, dass sie nur eine einzige Vorhersage abgeben können. Dies kontrastiert die Beobachtung dass eine Gruppe von Annotierern, gegeben mehrdeutiger Bilddaten, typischer Weise eine Menge an diverser aber plausibler Annotationen produziert. Um die vorgenannte Modell-Einschränkung zu beheben und die angemessen probabilistische Behandlung der Aufgabe zu ermöglichen, entwickeln wir zwei Modelle, die eine Verteilung über plausible Annotationen vorhersagen statt nur einer einzigen, deterministischen Annotation. Das erste der beiden Modelle kombiniert ein `encoder-decoder\u27 Modell mit dem Verfahren der `variational inference\u27 und verwendet einen globalen `latent vector\u27, der den Raum der möglichen Annotationen für ein gegebenes Bild kodiert. Wir zeigen, dass dieses Modell deutlich besser als die Referenzmethoden abschneidet und gut kalibrierte Unsicherheiten aufweist. Das zweite Modell verbessert diesen Ansatz indem es eine flexiblere und hierarchische Formulierung verwendet, die es erlaubt die Variabilität der Segmentierungen auf verschiedenden Skalen zu erfassen. Dies erhöht die Granularität der Segmentierungsdetails die das Modell produzieren kann und erlaubt es unabhängig variierende Bildregionen und Skalen zu modellieren. Beide dieser neuartigen generativen Segmentierungs-Modelle ermöglichen es, falls angebracht, diverse und kohärente Bild Segmentierungen zu erstellen, was im Kontrast zu früheren Arbeiten steht, welche entweder deterministisch sind, die Modellunsicherheiten auf der Pixelebene modellieren oder darunter leiden eine unangemessen geringe Diversität abzubilden. Im Ergebnis befasst sich die vorliegende Thesis mit der Anwendung von maschinellem Lernen für die Interpretation medizinischer Bilder: Wir zeigen die Möglichkeit auf den klinischen Standard mit Hilfe einer quantitativen Verwendung von Bildparametern, die momentan nur subjektiv in Diagnosen einfliessen, zu verbessern, wir zeigen den möglichen Nutzen eines neuen Trainingsverfahrens um die scheinbare Verletzlichkeit der standard Segmentierungsverlustfunktion gegenüber starkem Annotationsrauschen abzumildern und wir schlagen zwei neue probabilistische Segmentierungsmodelle vor, die die Verteilung über angemessene Annotationen akkurat erlernen können. Diese Beiträge können als Schritte hin zu einer quantitativeren, verstärkt Prinzipien-gestützten und unsicherheitsbewussten Analyse von medizinischen Bildern gesehen werden -ein wichtiges Ziel mit Blick auf die fortschreitende Integration von lernbasierten Systemen in klinischen Arbeitsabläufen

    Panoramic Image-to-Image Translation

    Full text link
    In this paper, we tackle the challenging task of Panoramic Image-to-Image translation (Pano-I2I) for the first time. This task is difficult due to the geometric distortion of panoramic images and the lack of a panoramic image dataset with diverse conditions, like weather or time. To address these challenges, we propose a panoramic distortion-aware I2I model that preserves the structure of the panoramic images while consistently translating their global style referenced from a pinhole image. To mitigate the distortion issue in naive 360 panorama translation, we adopt spherical positional embedding to our transformer encoders, introduce a distortion-free discriminator, and apply sphere-based rotation for augmentation and its ensemble. We also design a content encoder and a style encoder to be deformation-aware to deal with a large domain gap between panoramas and pinhole images, enabling us to work on diverse conditions of pinhole images. In addition, considering the large discrepancy between panoramas and pinhole images, our framework decouples the learning procedure of the panoramic reconstruction stage from the translation stage. We show distinct improvements over existing I2I models in translating the StreetLearn dataset in the daytime into diverse conditions. The code will be publicly available online for our community

    Deep Learning for Lung Cancer Detection: An Analysis of the Effects of Imperfect Data and Model Biases

    Get PDF
    Lung cancer is the cancer with the highest mortality as it is usually diagnosed in later stages when treatment options are limited. The most promising solution to reducing the burden associated with lung cancer is screening so that signs of cancer may be detected while still in the early stages. The National Lung Screening Trial (NLST) has shown that the use of low-dose Computed Tomography (CT) for screening instead of chest radiography led to a reduction of 20% in lung cancer mortality in high-risk patients. The introduction of screening programmes will produce a large volume of thoracic CT scans that will need to be processed and assessed by expert radiologists. In this thesis, the aim is to leverage machine learning, and specifically deep learning techniques, for the detection of lung cancer. While the detection of pulmonary nodules can be considered as a mostly solved problem, the characterisation of the nodules still remains a challenging task.Initially, this thesis explores how Convolutional Neural Network (CNN) architectures perform on pulmonary nodule characterisation tasks, such as spiculation and malignancy classification, by leveraging the publicly available and widely used Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI). The analysis delves deeper into the learnt latent representations to provide valuable insights about CNNs. The findings indicate the presence of biases with a strong inter-connection of size and malignancy. This correlation, however, while not spurious, is not the only cause for the malignant nature of a pulmonary nodule.To uncover the reasons behind the presence of such biases, the thesis then branches out in two directions in an attempt to understand whether the short-comings occur from the data or the models. The first direction focuses on LIDC-IDRI and sheds new light on the problematic design choices in prior works, which are aggregating multiple annotations to extract nodule labels. The second direction introduces a synthetic dataset with fully controllable modes of variation to explore the features that CNN architectures learn under different loss functions and learning schemes, such as contrastive learning.Having identified that many issues relating to biases in computational methods for lung nodule analysis arise from the data and not the model, the last part of this thesis turns to the NLST dataset, which contains biopsy-confirmed ground truth labels. However, the lack of consistency in the design of lung cancer datasets and primarily the absence of nodule-level annotations hampers the direct transfer of methods developed for LIDC-IDRI. To mitigate these issues, multiple instance learning and weak annotations are explored, in order to perform patient-level cancer classification.Overall, this thesis focuses on representation learning for pulmonary nodule characterisation and highlights its limitations, which stem from imperfect data and inconsistencies in the dataset generation process

    Deep Graph Representation Learning and its Application on Graph Clustering

    Get PDF
    Graphs like social networks, molecular graphs, and traffic networks are everywhere in the real world. Deep Graph Representation Learning (DGL) is essential for most graph applications, such as Graph Classification, Link Prediction, and Community Detection. DGL has made significant progress in recent years because of the development of Graph Neural Networks (GNNs). However, there are still several crucial challenges that the field faces, including in (semi-)supervised DGL, self-supervised DGL, and DGL-based graph clustering. In this thesis, I proposed three models to address the problems in these three aspects respectively. GNNs have been widely used in DGL problems. However, GNNs suffer from over- smoothing due to their repeated local aggregation and over-squashing due to the exponential growth in computation paths with increased model depth, which confines their expres- sive power. To solve this problem, a Hierarchical Structure Graph Transformer called HighFormer is proposed to leverage local and relatively global structure information. I use GNNs to learn the initial graph node representation based on the local structure in- formation. At the same time, a structural attention module is used to learn the relatively global structural similarity. Then, the improved attention matrix was obtained by adding the relatively global structure similarity matrix to the traditional attention matrix. Finally, the graph representation was learned by the improved attention matrix. Graph contrastive learning (GCL) has recently become the most powerful method in self-supervised graph representation learning (SGL), of which graph augmentation is a critical component to generating different views of input graphs. Most existing GCL methods perform stochastic data augmentation schemes, for example, randomly dropping edges or masking node features. However, uniform transformations without carefully designed augmentation techniques may drastically change the underlying semantics of graphs or graph nodes. I argue that the graph augmentation schemes should preserve the intrinsic semantics of graphs. Besides, existing GCL methods neglect the semantic information that may introduce false-negative samples. Therefore, a novel GCL method with semantic invariance graph augmentation termed SemiGCL is proposed by designing a semantic invariance graph augmentation (SemiAug) and a semantic-based graph contrastive (SGC) scheme. Deep graph clustering (DGC), which aims to divide the graph nodes into different clusters, is challenging for graph analysis. DGC usually consists of an encoding neural network and a clustering method. Although DGC has made remarkable progress with the development of deep learning, I observed two drawbacks to the existing methods: 1) Existing methods usually overlook learning the global structural information in the node encoding process. Consequently, the discriminative capability of representations will be limited. 2) Most existing methods leverage traditional clustering methods such as K- means and spectral clustering. However, these clustering methods can not simultaneously be trained with the DGL methods, leading to sub-optimal clustering performance. To address these issues, I propose a novel self-supervised DGC method termed Structural Semantic Contrastive Deep Graph Clustering (SECRET). To get a more discriminative representation, I design a structure contrastive scheme (SCS) by contrasting the aggregation of first-order neighbors with a graph diffusion. A consistent loss was also proposed to keep the structure of different views consistent. To jointly optimize the DGL and clustering method, I proposed a novel Self-supervised Deep-learning-based Clustering (SDC) model

    Contributions and applications around low resource deep learning modeling

    Get PDF
    El aprendizaje profundo representa la vanguardia del aprendizaje automático en multitud de aplicaciones. Muchas de estas tareas requieren una gran cantidad de recursos computacionales, lo que limita su adopción en dispositivos integrados. El objetivo principal de esta tesis es estudiar métodos y algoritmos que permiten abordar problemas utilizando aprendizaje profundo con bajos recursos computacionales. Este trabajo también tiene como objetivo presentar aplicaciones de aprendizaje profundo en la industria. La primera contribución es una nueva función de activación para redes de aprendizaje profundo: la función de módulo. Los experimentos muestran que la función de activación propuesta logra resultados superiores en tareas de visión artificial cuando se compara con las alternativas encontradas en la literatura. La segunda contribución es una nueva estrategia para combinar modelos preentrenados usando destilación de conocimiento. Los resultados de este capítulo muestran que es posible aumentar significativamente la precisión de los modelos preentrenados más pequeños, lo que permite un alto rendimiento a un menor costo computacional. La siguiente contribución de esta tesis aborda el problema de la previsión de ventas en el campo de la logística. Se proponen dos sistemas de extremo a extremo con dos técnicas diferentes de aprendizaje profundo (modelos de secuencia a secuencia y transformadores). Los resultados de este capítulo concluyen que es posible construir sistemas integrales para predecir las ventas de múltiples productos individuales, en múltiples puntos de venta y en diferentes momentos con un único modelo de aprendizaje automático. El modelo propuesto supera las alternativas encontradas en la literatura. Finalmente, las dos últimas contribuciones pertenecen al campo de la tecnología del habla. El primero estudia cómo construir un sistema de reconocimiento de voz Keyword Spotting utilizando una versión eficiente de una red neuronal convolucional. En este estudio, el sistema propuesto es capaz de superar el rendimiento de todos los puntos de referencia encontrados en la literatura cuando se prueba contra las subtareas más complejas. El último estudio propone un modelo independiente de texto a voz de última generación capaz de sintetizar voz inteligible en miles de perfiles de voz, mientras genera un discurso con variaciones de prosodia significativas y expresivas. El enfoque propuesto elimina la dependencia de los modelos anteriores de un sistema de voz adicional, lo que hace que el sistema propuesto sea más eficiente en el tiempo de entrenamiento e inferencia, y permite operaciones fuera de línea y en el dispositivo.Deep learning is the state of the art for several machine learning tasks. Many of these tasks require large amount of computational resources, which limits their adoption in embedded devices. The main goal of this dissertation is to study methods and algorithms that allow to approach problems using deep learning with restricted computational resources. This work also aims at presenting applications of deep learning in industry. The first contribution is a new activation function for deep learning networks: the modulus function. The experiments show that the proposed activation function achieves superior results in computer vision tasks when compared with the alternatives found in the literature. The second contribution is a new strategy to combine pre-trained models using knowledge distillation. The results of this chapter show that it is possible to significantly increase the accuracy of the smallest pre-trained models, allowing high performance at a lower computational cost. The following contribution in this thesis tackles the problem of sales fore- casting in the field of logistics. Two end-to-end systems with two different deep learning techniques (sequence-to-sequence models and transformers) are pro- posed. The results of this chapter conclude that it is possible to build end-to-end systems to predict the sales of multiple individual products, at multiple points of sale and different times with a single machine learning model. The proposed model outperforms the alternatives found in the literature. Finally, the last two contributions belong to the speech technology field. The former, studies how to build a Keyword Spotting speech recognition system using an efficient version of a convolutional neural network. In this study, the proposed system is able to beat the performance of all the benchmarks found in the literature when tested against the most complex subtasks. The latter study proposes a standalone state-of-the-art text-to-speech model capable of synthesizing intelligible voice in thousands of voice profiles, while generating speech with meaningful and expressive prosody variations. The proposed approach removes the dependency of previous models on an additional voice system, which makes the proposed system more efficient at training and inference time, and enables offline and on-device operations

    SCENE AND ACTION UNDERSTANDING USING CONTEXT AND KNOWLEDGE SHARING

    Get PDF
    Complete scene understanding from video data involves spatio-temporal decision making over long sequences and utilization of world knowledge. We propose a method that captures edge connections between these spatio-temporal components or knowledge graphs through a graph convolution network (GCN). Our approach uses the GCN to fuse various information in the video like detected objects, human pose, scene information etc. for action segmentation. For certain functions like zero shot and few shot action recognition, we learn a classifier for unseen test classes through comparison with similar training classes. We provide information about similarity between two classes through an explicit relationship map i.e. the knowledge graph. We study different kinds of knowledge graphs based on action phrases, verbs or nouns and visual features to demonstrate how they perform with respect to each other. We build an integrated approach for zero-shot and few-shot learning. We also show further improvements through adaptive learning of the input knowledge graphs and using triplet loss along with the task specific loss while training. We add results for semi-supervised learning as well to understand improvements from our graph learning technique. For complete scene understanding, we also study depth completion using deep depth prior based on the deep image prior (DIP) technique. DIP shows that structure of convolutional neural networks (CNNs) induces a strong prior that favors natural images. Given color images and noisy or incomplete target depth maps, we optimize a randomly-initialized CNN model to reconstruct a depth map restored by virtue of using the CNN network structure as a prior combined with a view-constrained photo-consistency loss. This loss is computed using images from a geometrically calibrated camera from nearby viewpoints. It is based on test time optimization, so it is independent of training data distributions. We apply this deep depth prior for inpainting and refining incomplete and noisy depth maps within both binocular and multi-view stereo pipelines

    Effective offline training and efficient online adaptation

    Get PDF
    Developing agents that behave intelligently in the world is an open challenge in machine learning. Desiderata for such agents are efficient exploration, maximizing long term utility, and the ability to effectively leverage prior data to solve new tasks. Reinforcement learning (RL) is an approach that is predicated on learning by directly interacting with an environment through trial-and-error, and presents a way for us to train and deploy such agents. Moreover, combining RL with powerful neural network function approximators – a sub-field known as “deep RL” – has shown evidence towards achieving this goal. For instance, deep RL has yielded agents that can play Go at superhuman levels, improve the efficiency of microchip designs, and learn complex novel strategies for controlling nuclear fusion reactions. A key issue that stands in the way of deploying deep RL is poor sample efficiency. Concretely, while it is possible to train effective agents using deep RL, the key successes have largely been in environments where we have access to large amounts of online interaction, often through the use of simulators. However, in many real-world problems, we are confronted with scenarios where samples are expensive to obtain. As has been alluded to, one way to alleviate this issue is through accessing some prior data, often termed “offline data”, which can accelerate how quickly we learn such agents, such as leveraging exploratory data to prevent redundant deployments, or using human-expert data to quickly guide agents towards promising behaviors and beyond. However, the best way to incorporate this data into existing deep RL algorithms is not straightforward; naïvely pre-training using RL algorithms on this offline data, a paradigm called “offline RL” as a starting point for subsequent learning is often detrimental. Moreover, it is unclear how to explicitly derive useful behaviors online that are positively influenced by this offline pre-training. With these factors in mind, this thesis follows a 3-pronged strategy towards improving sample-efficiency in deep RL. First, we investigate effective pre-training on offline data. Then, we tackle the online problem, looking at efficient adaptation to environments when operating purely online. Finally, we conclude with hybrid strategies that use offline data to explicitly augment policies when acting online
    corecore