11 research outputs found
Weakly Supervised Medical Image Segmentation With Soft Labels and Noise Robust Loss
Recent advances in deep learning algorithms have led to significant benefits
for solving many medical image analysis problems. Training deep learning models
commonly requires large datasets with expert-labeled annotations. However,
acquiring expert-labeled annotation is not only expensive but also is
subjective, error-prone, and inter-/intra- observer variability introduces
noise to labels. This is particularly a problem when using deep learning models
for segmenting medical images due to the ambiguous anatomical boundaries.
Image-based medical diagnosis tools using deep learning models trained with
incorrect segmentation labels can lead to false diagnoses and treatment
suggestions. Multi-rater annotations might be better suited to train deep
learning models with small training sets compared to single-rater annotations.
The aim of this paper was to develop and evaluate a method to generate
probabilistic labels based on multi-rater annotations and anatomical knowledge
of the lesion features in MRI and a method to train segmentation models using
probabilistic labels using normalized active-passive loss as a "noise-tolerant
loss" function. The model was evaluated by comparing it to binary ground truth
for 17 knees MRI scans for clinical segmentation and detection of bone marrow
lesions (BML). The proposed method successfully improved precision 14, recall
22, and Dice score 8 percent compared to a binary cross-entropy loss function.
Overall, the results of this work suggest that the proposed normalized
active-passive loss using soft labels successfully mitigated the effects of
noisy labels
SoftSeg: Advantages of soft versus binary training for image segmentation
Most image segmentation algorithms are trained on binary masks formulated as
a classification task per pixel. However, in applications such as medical
imaging, this "black-and-white" approach is too constraining because the
contrast between two tissues is often ill-defined, i.e., the voxels located on
objects' edges contain a mixture of tissues. Consequently, assigning a single
"hard" label can result in a detrimental approximation. Instead, a soft
prediction containing non-binary values would overcome that limitation. We
introduce SoftSeg, a deep learning training approach that takes advantage of
soft ground truth labels, and is not bound to binary predictions. SoftSeg aims
at solving a regression instead of a classification problem. This is achieved
by using (i) no binarization after preprocessing and data augmentation, (ii) a
normalized ReLU final activation layer (instead of sigmoid), and (iii) a
regression loss function (instead of the traditional Dice loss). We assess the
impact of these three features on three open-source MRI segmentation datasets
from the spinal cord gray matter, the multiple sclerosis brain lesion, and the
multimodal brain tumor segmentation challenges. Across multiple
cross-validation iterations, SoftSeg outperformed the conventional approach,
leading to an increase in Dice score of 2.0% on the gray matter dataset
(p=0.001), 3.3% for the MS lesions, and 6.5% for the brain tumors. SoftSeg
produces consistent soft predictions at tissues' interfaces and shows an
increased sensitivity for small objects. The richness of soft labels could
represent the inter-expert variability, the partial volume effect, and
complement the model uncertainty estimation. The developed training pipeline
can easily be incorporated into most of the existing deep learning
architectures. It is already implemented in the freely-available deep learning
toolbox ivadomed (https://ivadomed.org)
Marginal Thresholding in Noisy Image Segmentation
This work presents a study on label noise in medical image segmentation by
considering a noise model based on Gaussian field deformations. Such noise is
of interest because it yields realistic looking segmentations and because it is
unbiased in the sense that the expected deformation is the identity mapping.
Efficient methods for sampling and closed form solutions for the marginal
probabilities are provided. Moreover, theoretically optimal solutions to the
loss functions cross-entropy and soft-Dice are studied and it is shown how they
diverge as the level of noise increases. Based on recent work on loss function
characterization, it is shown that optimal solutions to soft-Dice can be
recovered by thresholding solutions to cross-entropy with a particular a priori
unknown threshold that efficiently can be computed. This raises the question
whether the decrease in performance seen when using cross-entropy as compared
to soft-Dice is caused by using the wrong threshold. The hypothesis is
validated in 5-fold studies on three organ segmentation problems from the
TotalSegmentor data set, using 4 different strengths of noise. The results show
that changing the threshold leads the performance of cross-entropy to go from
systematically worse than soft-Dice to similar or better results than
soft-Dice
FedA3I: Annotation Quality-Aware Aggregation for Federated Medical Image Segmentation against Heterogeneous Annotation Noise
Federated learning (FL) has emerged as a promising paradigm for training
segmentation models on decentralized medical data, owing to its
privacy-preserving property. However, existing research overlooks the prevalent
annotation noise encountered in real-world medical datasets, which limits the
performance ceilings of FL. In this paper, we, for the first time, identify and
tackle this problem. For problem formulation, we propose a contour evolution
for modeling non-independent and identically distributed (Non-IID) noise across
pixels within each client and then extend it to the case of multi-source data
to form a heterogeneous noise model (i.e., Non-IID annotation noise across
clients). For robust learning from annotations with such two-level Non-IID
noise, we emphasize the importance of data quality in model aggregation,
allowing high-quality clients to have a greater impact on FL. To achieve this,
we propose Federated learning with Annotation quAlity-aware AggregatIon, named
FedA3I, by introducing a quality factor based on client-wise noise estimation.
Specifically, noise estimation at each client is accomplished through the
Gaussian mixture model and then incorporated into model aggregation in a
layer-wise manner to up-weight high-quality clients. Extensive experiments on
two real-world medical image segmentation datasets demonstrate the superior
performance of FedAI against the state-of-the-art approaches in dealing
with cross-client annotation noise. The code is available at
https://github.com/wnn2000/FedAAAI.Comment: Accepted at AAAI'2
3D hand tracking.
The hand is often considered as one of the most natural and intuitive interaction modalities for human-to-human interaction. In human-computer interaction (HCI), proper 3D hand tracking is the first step in developing a more intuitive HCI system which can be used in applications such as gesture recognition, virtual object manipulation and gaming. However, accurate 3D hand tracking, remains a challenging problem due to the hand’s deformation, appearance similarity, high inter-finger occlusion and complex articulated motion. Further, 3D hand tracking is also interesting from a theoretical point of view as it deals with three major areas of computer vision- segmentation (of hand), detection (of hand parts), and tracking (of hand). This thesis proposes a region-based skin color detection technique, a model-based and an appearance-based 3D hand tracking techniques to bring the human-computer interaction applications one step closer. All techniques are briefly described below. Skin color provides a powerful cue for complex computer vision applications. Although skin color detection has been an active research area for decades, the mainstream technology is based on individual pixels. This thesis presents a new region-based technique for skin color detection which outperforms the current state-of-the-art pixel-based skin color detection technique on the popular Compaq dataset (Jones & Rehg 2002). The proposed technique achieves 91.17% true positive rate with 13.12% false negative rate on the Compaq dataset tested over approximately 14,000 web images. Hand tracking is not a trivial task as it requires tracking of 27 degreesof- freedom of hand. Hand deformation, self occlusion, appearance similarity and irregular motion are major problems that make 3D hand tracking a very challenging task. This thesis proposes a model-based 3D hand tracking technique, which is improved by using proposed depth-foreground-background ii feature, palm deformation module and context cue. However, the major problem of model-based techniques is, they are computationally expensive. This can be overcome by discriminative techniques as described below. Discriminative techniques (for example random forest) are good for hand part detection, however they fail due to sensor noise and high interfinger occlusion. Additionally, these techniques have difficulties in modelling kinematic or temporal constraints. Although model-based descriptive (for example Markov Random Field) or generative (for example Hidden Markov Model) techniques utilize kinematic and temporal constraints well, they are computationally expensive and hardly recover from tracking failure. This thesis presents a unified framework for 3D hand tracking, using the best of both methodologies, which out performs the current state-of-the-art 3D hand tracking techniques. The proposed 3D hand tracking techniques in this thesis can be used to extract accurate hand movement features and enable complex human machine interaction such as gaming and virtual object manipulation
Matching and Segmentation for Multimedia Data
With the development of society, both industry and academia draw increasing attention to multimedia systems, which handle image/video data, audio data, and text data comprehensively and simultaneously. In this thesis, we mainly focus on multi-modality data understanding, combining the two subjects of Computer Vision (CV) and Natural Language Processing (NLP). Such a task is widely used in many real-world scenarios, including criminal search with language descriptions by the witness, robotic navigation with language instruction in the smart industry, terrorist tracking, missing person identification, and so on. However, such a multi-modality system still faces many challenges, limiting its performance and ability in real-life situations, including the domain gap between the modalities of vision and language, the request for high-quality datasets, and so on. Therefore, to better analyze and handle these challenges, this thesis focuses on the two fundamental tasks, including matching and segmentation.
Image-Text Matching (ITM) aims to retrieve the texts (images) that describe the most relevant contents for a given image (text) query. Due to the semantic gap between the linguistic and visual domains, aligning and comparing feature representations for languages and images are still challenging. To overcome this limitation, we propose a new framework for the image-text matching task, which uses an auxiliary captioning step to enhance the image feature, where the image feature is fused with the text feature of the captioning output. As the downstream application of ITM, the language-person search is one of the specific cases where language descriptions are provided to retrieve person images, which also suffers the domain gap between linguistic and visual data. To handle this problem, we propose a transformer-based language-person search matching framework with matching conducted between words and image regions for better image-text interaction. However, collecting a large amount of training data is neither cheap nor reliable using human annotations. We further study the one-shot person Re-ID (re-identification) task aiming to match people by offering one labeled reference image for each person, where previous methods request a large number of ground-truth labels. We propose progressive sample mining and representation learning to fit the limited labels for the one-shot Re-ID task better.
Referring Expression Segmentation (RES) aims to localize and segment the target according to the given language expression. Existing methods jointly consider the localization and segmentation steps, which rely on the fused visual and linguistic features for both steps. We argue that the conflict between the purpose of finding the object and generating the mask limits the RES performance. To solve this problem, we propose a parallel position-kernel-segmentation pipeline to better isolate then interact with the localization and segmentation steps. In our pipeline, linguistic information will not directly contaminate the visual feature for segmentation. Specifically, the localization step localizes the target object in the image based on the referring expression, then the visual kernel obtained from the localization step guides the segmentation step. This pipeline also enables us to train RES in a weakly-supervised way, where the pixel-level segmentation labels are replaced by click annotations on center and corner points. The position head is fully-supervised trained with the click annotations as supervision, and the segmentation head is trained with weakly-supervised segmentation losses.
This thesis focus on the key limitations of the multimedia system, where the experiments prove that the proposed frameworks are effective for specific tasks. The experiments are easy to reproduce with clear details, and source codes are provided for future works aiming at these tasks
On the Synergies between Machine Learning and Binocular Stereo for Depth Estimation from Images: a Survey
Stereo matching is one of the longest-standing problems in computer vision
with close to 40 years of studies and research. Throughout the years the
paradigm has shifted from local, pixel-level decision to various forms of
discrete and continuous optimization to data-driven, learning-based methods.
Recently, the rise of machine learning and the rapid proliferation of deep
learning enhanced stereo matching with new exciting trends and applications
unthinkable until a few years ago. Interestingly, the relationship between
these two worlds is two-way. While machine, and especially deep, learning
advanced the state-of-the-art in stereo matching, stereo itself enabled new
ground-breaking methodologies such as self-supervised monocular depth
estimation based on deep networks. In this paper, we review recent research in
the field of learning-based depth estimation from single and binocular images
highlighting the synergies, the successes achieved so far and the open
challenges the community is going to face in the immediate future.Comment: Accepted to TPAMI. Paper version of our CVPR 2019 tutorial:
"Learning-based depth estimation from stereo and monocular images: successes,
limitations and future challenges"
(https://sites.google.com/view/cvpr-2019-depth-from-image/home