18 research outputs found
Leveraging Supervoxels for Medical Image Volume Segmentation With Limited Supervision
The majority of existing methods for machine learning-based medical image segmentation are supervised models that require large amounts of fully annotated images. These types of datasets are typically not available in the medical domain and are difficult and expensive to generate. A wide-spread use of machine learning based models for medical image segmentation therefore requires the development of data-efficient algorithms that only require limited supervision. To address these challenges, this thesis presents new machine learning methodology for unsupervised lung tumor segmentation and few-shot learning based organ segmentation. When working in the limited supervision paradigm, exploiting the available information in the data is key. The methodology developed in this thesis leverages automatically generated supervoxels in various ways to exploit the structural information in the images. The work on unsupervised tumor segmentation explores the opportunity of performing clustering on a population-level in order to provide the algorithm with as much information as possible. To facilitate this population-level across-patient clustering, supervoxel representations are exploited to reduce the number of samples, and thereby the computational cost. In the work on few-shot learning-based organ segmentation, supervoxels are used to generate pseudo-labels for self-supervised training. Further, to obtain a model that is robust to the typically large and inhomogeneous background class, a novel anomaly detection-inspired classifier is proposed to ease the modelling of the background. To encourage the resulting segmentation maps to respect edges defined in the input space, a supervoxel-informed feature refinement module is proposed to refine the embedded feature vectors during inference. Finally, to improve trustworthiness, an architecture-agnostic mechanism to estimate model uncertainty in few-shot segmentation is developed. Results demonstrate that supervoxels are versatile tools for leveraging structural information in medical data when training segmentation models with limited supervision
A Survey of the Impact of Self-Supervised Pretraining for Diagnostic Tasks with Radiological Images
Self-supervised pretraining has been observed to be effective at improving
feature representations for transfer learning, leveraging large amounts of
unlabelled data. This review summarizes recent research into its usage in
X-ray, computed tomography, magnetic resonance, and ultrasound imaging,
concentrating on studies that compare self-supervised pretraining to fully
supervised learning for diagnostic tasks such as classification and
segmentation. The most pertinent finding is that self-supervised pretraining
generally improves downstream task performance compared to full supervision,
most prominently when unlabelled examples greatly outnumber labelled examples.
Based on the aggregate evidence, recommendations are provided for practitioners
considering using self-supervised learning. Motivated by limitations identified
in current research, directions and practices for future study are suggested,
such as integrating clinical knowledge with theoretically justified
self-supervised learning methods, evaluating on public datasets, growing the
modest body of evidence for ultrasound, and characterizing the impact of
self-supervised pretraining on generalization.Comment: 32 pages, 6 figures, a literature survey submitted to BMC Medical
Imagin
Lucid Data Dreaming for Video Object Segmentation
Convolutional networks reach top quality in pixel-level video object
segmentation but require a large amount of training data (1k~100k) to deliver
such results. We propose a new training strategy which achieves
state-of-the-art results across three evaluation datasets while using 20x~1000x
less annotated data than competing methods. Our approach is suitable for both
single and multiple object segmentation. Instead of using large training sets
hoping to generalize across domains, we generate in-domain training data using
the provided annotation on the first frame of each video to synthesize ("lucid
dream") plausible future video frames. In-domain per-video training data allows
us to train high quality appearance- and motion-based models, as well as tune
the post-processing stage. This approach allows to reach competitive results
even when training from only a single annotated frame, without ImageNet
pre-training. Our results indicate that using a larger training set is not
automatically better, and that for the video object segmentation task a smaller
training set that is closer to the target domain is more effective. This
changes the mindset regarding how many training samples and general
"objectness" knowledge are required for the video object segmentation task.Comment: Accepted in International Journal of Computer Vision (IJCV
Machine Learning for Instance Segmentation
Volumetric Electron Microscopy images can be used for connectomics, the study of brain connectivity at the cellular level.
A prerequisite for this inquiry is the automatic identification of neural cells, which requires machine learning algorithms and in particular efficient image segmentation algorithms.
In this thesis, we develop new algorithms for this task.
In the first part we provide, for the first time in this
field, a method for training a neural network to predict optimal input data for a watershed algorithm.
We demonstrate its superior performance compared to other segmentation methods of its category.
In the second part, we develop an efficient watershed-based algorithm for weighted graph
partitioning, the \emph{Mutex Watershed}, which uses negative edge-weights for the first time.
We show that it is intimately related to the multicut and has a cutting edge performance on a connectomics challenge.
Our algorithm is currently used by the leaders of two connectomics challenges.
Finally, motivated by inpainting neural networks, we create a method to learn the graph weights without any supervision
Explainable artificial intelligence (XAI) in deep learning-based medical image analysis
With an increase in deep learning-based methods, the call for explainability
of such methods grows, especially in high-stakes decision making areas such as
medical image analysis. This survey presents an overview of eXplainable
Artificial Intelligence (XAI) used in deep learning-based medical image
analysis. A framework of XAI criteria is introduced to classify deep
learning-based medical image analysis methods. Papers on XAI techniques in
medical image analysis are then surveyed and categorized according to the
framework and according to anatomical location. The paper concludes with an
outlook of future opportunities for XAI in medical image analysis.Comment: Submitted for publication. Comments welcome by email to first autho
Recommended from our members
Three-Dimensional Object Search, Understanding, and Pose Estimation with Low-Cost Sensors
With the recent development of low-cost depth sensors, an entirely new type of 3D data is being generated rapidly by regular consumers. Traditionally, 3D data is produced by a small number of professional designers (i.e., the Computer Aided Design (CAD) model); however, 3D data from massive consumer-level sensors has the potential of introducing many new applications, such as user-captured 3D warehouse and search engines, robots with 3D sensing capability, and customized 3D printing. Nevertheless, the low-cost sensors used by general consumers also pose new technological challenges. First, they have relatively high levels of sensor noise. Second, the use of such consumer devices is often in uncontrolled settings, resulting in challenging conditions, such as poor lighting, cluttered scenes, and object occlusion. To address such emerging opportunities and associated challenges, this dissertation is dedicated to the development of novel algorithms and systems for 3D data understanding and processing, using input from a consumer-level 3D sensor.
In particular, the key problems of 3D shape retrieval, scene understanding, and pose recognition are explored in order to present a comprehensive coverage of the key aspects of content-based 3D shape analysis. To resolve the aforementioned challenges, we propose a flexible Markov Random Field (MRF) framework that uses local information to allow partial matching, and thus address the model incompleteness problem; the framework also uses higher-order correlation to provide additional robustness against sensor noise. With the MRF framework, these 3D analysis problems can be transformed into a unified potential energy minimization problem, while preserving the flexibility to adapt to different settings and resolve the unique challenges of each problem. The contributions of the dissertation include:
a. Cross-Domain 3D Retrieval: First we tackle the problem of searching 3D noise- free models using noisy data captured by low-cost 3D sensors – a unique cross-domain setting. To manage the challenges of sensor noise and model incompleteness from consumer-level sensors, we propose a novel MRF formulation for the retrieval problem. The potential function of the random field is designed to capture both the local shape and global spatial consistency in order to preserve the local matching capability, while offering robustness against the sensor noise. The specific form of the potential functions is determined efficiently by a series of weak classifiers, thus forming a variant of the Regression Tree Field (RTF). We achieve better retrieval precision and recall in the cross-domain settings with a consumer-level depth sensor compared with state-of-the-art approaches.
b. 3D Scene Understanding: We develop a scene understanding system based on input from consumer-level depth sensors. To resolve the key challenge of the lack of annotated 3D training data, we construct an MRF that connects the input 3D point cloud and the associated 2D reference images, based on which the 3D point cloud is stitched. A series of weak classifiers are trained to obtain an approximate semantic segmentation result from the reference images. The potential function of the field is designed to integrate the results from the classifiers, while taking advantage of the 3D spatial consistency in order to output a comprehensive scene understanding result. We achieve comparable accuracy and much faster speed compared with state-of-the-art 3D scene understanding systems, with the difference that we do not require annotated 3D training data.
c. Pose Recognition of Deformable Objects: We develop a method for supporting a robotics system to recognize pose and manipulate deformable objects. More specifically, garment pose is recognized with the help of an offline simulated database and the proposed retrieval approach. We use a novel binary feature representation extracted from the reconstructed 3D surfaces in order to allow efficient matching, thus achieving real-time performance. A spatial weight is further learned in order to integrate the local matching result. The system shows superior recognition accuracy and faster speed than the state-of-the-art approaches.
d. Application with 2D Data: In addition to the traditional 3D applications, we explore the possibility of extending MRF formulation to 2D data, especially those used in classical low-level 2D vision problems, such as image deblurring and denoising. One well-known technique that uses image prior, the probabilistic patched-based prior, is known to have bottlenecks in finding the most similar model from a model set, which can be posed as a retrieval problem. Therefore, we apply the MRF formulation originally developed for 3D shape retrieval, and extend it to this 2D problem by introducing a grid-like random field structure. We can achieve 40x acceleration compared with the state-of-the-art algorithm, while preserving quality.
We organize the dissertation as follows. First, the core problems of 3D shape retrieval, scene understanding, and pose recognition, and with the proposed solutions that use MRF and RTF are explored in Part I. In Part II, the extension to 2D data is discussed. Extensive evaluation is performed in each specific task in order to compare the proposed approaches with state-of-the-art algorithms and systems, and also to justify the components of the proposed methods. Finally, in Part III, we include the conclusion remarks and discussion of open issues and future work
Learning to segment in images and videos with different forms of supervision
Much progress has been made in image and video segmentation over the last years. To a large extent, the success can be attributed to the strong appearance models completely learned from data, in particular using deep learning methods. However, to perform best these methods require large representative datasets for training with expensive pixel-level annotations, which in case of videos are prohibitive to obtain. Therefore, there is a need to relax this constraint and to consider alternative forms of supervision, which are easier and cheaper to collect. In this thesis, we aim to develop algorithms for learning to segment in images and videos with different levels of supervision. First, we develop approaches for training convolutional networks with weaker forms of supervision, such as bounding boxes or image labels, for object boundary estimation and semantic/instance labelling tasks. We propose to generate pixel-level approximate groundtruth from these weaker forms of annotations to train a network, which allows to achieve high-quality results comparable to the full supervision quality without any modifications of the network architecture or the training procedure. Second, we address the problem of the excessive computational and memory costs inherent to solving video segmentation via graphs. We propose approaches to improve the runtime and memory efficiency as well as the output segmentation quality by learning from the available training data the best representation of the graph. In particular, we contribute with learning must-link constraints, the topology and edge weights of the graph as well as enhancing the graph nodes - superpixels - themselves. Third, we tackle the task of pixel-level object tracking and address the problem of the limited amount of densely annotated video data for training convolutional networks. We introduce an architecture which allows training with static images only and propose an elaborate data synthesis scheme which creates a large number of training examples close to the target domain from the given first frame mask. With the proposed techniques we show that densely annotated consequent video data is not necessary to achieve high-quality temporally coherent video segmentation results. In summary, this thesis advances the state of the art in weakly supervised image segmentation, graph-based video segmentation and pixel-level object tracking and contributes with the new ways of training convolutional networks with a limited amount of pixel-level annotated training data.In der Bild- und Video-Segmentierung wurden im Laufe der letzten Jahre große Fortschritte erzielt. Dieser Erfolg beruht weitgehend auf starken Appearance Models, die vollständig aus Daten gelernt werden, insbesondere mit Deep Learning Methoden. Für beste Performanz benötigen diese Methoden jedoch große repräsentative Datensätze für das Training mit teuren Annotationen auf Pixelebene, die bei Videos unerschwinglich sind. Deshalb ist es notwendig, diese Einschränkung zu überwinden und alternative Formen des überwachten Lernens in Erwägung zu ziehen, die einfacher und kostengünstiger zu sammeln sind. In dieser Arbeit wollen wir Algorithmen zur Segmentierung von Bildern und Videos mit verschiedenen Ebenen des überwachten Lernens entwickeln. Zunächst entwickeln wir Ansätze zum Training eines faltenden Netzwerkes (convolutional network) mit schwächeren Formen des überwachten Lernens, wie z.B. Begrenzungsrahmen oder Bildlabel, für Objektbegrenzungen und Semantik/Instanz- Klassifikationsaufgaben. Wir schlagen vor, aus diesen schwächeren Formen von Annotationen eine annähernde Ground Truth auf Pixelebene zu generieren, um ein Netzwerk zu trainieren, das hochwertige Ergebnisse ermöglicht, die qualitativ mit denen bei voll überwachtem Lernen vergleichbar sind, und dies ohne Änderung der Netzwerkarchitektur oder des Trainingsprozesses. Zweitens behandeln wir das Problem des beträchtlichen Rechenaufwands und Speicherbedarfs, das der Segmentierung von Videos mittels Graphen eigen ist. Wir schlagen Ansätze vor, um sowohl die Laufzeit und Speichereffizienz als auch die Qualität der Segmentierung zu verbessern, indem aus den verfügbaren Trainingsdaten die beste Darstellung des Graphen gelernt wird. Insbesondere leisten wir einen Beitrag zum Lernen mit must-link Bedingungen, zur Topologie und zu Kantengewichten des Graphen sowie zu verbesserten Superpixeln. Drittens gehen wir die Aufgabe des Objekt-Tracking auf Pixelebene an und befassen uns mit dem Problem der begrenzten Menge von dicht annotierten Videodaten zum Training eines faltenden Netzwerkes. Wir stellen eine Architektur vor, die das Training nur mit statischen Bildern ermöglicht, und schlagen ein aufwendiges Schema zur Datensynthese vor, das aus der gegebenen ersten Rahmenmaske eine große Anzahl von Trainingsbeispielen ähnlich der Zieldomäne schafft. Mit den vorgeschlagenen Techniken zeigen wir, dass dicht annotierte zusammenhängende Videodaten nicht erforderlich sind, um qualitativ hochwertige zeitlich kohärente Resultate der Segmentierung von Videos zu erhalten. Zusammenfassend lässt sich sagen, dass diese Arbeit den Stand der Technik in schwach überwachter Segmentierung von Bildern, graphenbasierter Segmentierung von Videos und Objekt-Tracking auf Pixelebene weiter entwickelt, und mit neuen Formen des Trainings faltender Netzwerke bei einer begrenzten Menge von annotierten Trainingsdaten auf Pixelebene einen Beitrag leistet