146 research outputs found

    Adversarially Tuned Scene Generation

    Full text link
    Generalization performance of trained computer vision systems that use computer graphics (CG) generated data is not yet effective due to the concept of 'domain-shift' between virtual and real data. Although simulated data augmented with a few real world samples has been shown to mitigate domain shift and improve transferability of trained models, guiding or bootstrapping the virtual data generation with the distributions learnt from target real world domain is desired, especially in the fields where annotating even few real images is laborious (such as semantic labeling, and intrinsic images etc.). In order to address this problem in an unsupervised manner, our work combines recent advances in CG (which aims to generate stochastic scene layouts coupled with large collections of 3D object models) and generative adversarial training (which aims train generative models by measuring discrepancy between generated and real data in terms of their separability in the space of a deep discriminatively-trained classifier). Our method uses iterative estimation of the posterior density of prior distributions for a generative graphical model. This is done within a rejection sampling framework. Initially, we assume uniform distributions as priors on the parameters of a scene described by a generative graphical model. As iterations proceed the prior distributions get updated to distributions that are closer to the (unknown) distributions of target data. We demonstrate the utility of adversarially tuned scene generation on two real-world benchmark datasets (CityScapes and CamVid) for traffic scene semantic labeling with a deep convolutional net (DeepLab). We realized performance improvements by 2.28 and 3.14 points (using the IoU metric) between the DeepLab models trained on simulated sets prepared from the scene generation models before and after tuning to CityScapes and CamVid respectively.Comment: 9 pages, accepted at CVPR 201

    Deep Learning Approaches for Seagrass Detection in Multispectral Imagery

    Get PDF
    Seagrass forms the basis for critically important marine ecosystems. Seagrass is an important factor to balance marine ecological systems, and it is of great interest to monitor its distribution in different parts of the world. Remote sensing imagery is considered as an effective data modality based on which seagrass monitoring and quantification can be performed remotely. Traditionally, researchers utilized multispectral satellite images to map seagrass manually. Automatic machine learning techniques, especially deep learning algorithms, recently achieved state-of-the-art performances in many computer vision applications. This dissertation presents a set of deep learning models for seagrass detection in multispectral satellite images. It also introduces novel domain adaptation approaches to adapt the models for new locations and for temporal image series. In Chapter 3, I compare a deep capsule network (DCN) with a deep convolutional neural network (DCNN) for seagrass detection in high-resolution multispectral satellite images. These methods are tested on three satellite images in Florida coastal areas and obtain comparable performances. In addition, I also propose a few-shot deep learning strategy to transfer knowledge learned by DCN from one location to the others for seagrass detection. In Chapter 4, I develop a semi-supervised domain adaptation method to generalize a trained DCNN model to multiple locations for seagrass detection. First, the model utilizes a generative adversarial network (GAN) to align marginal distribution of data in the source domain to that in the target domain using unlabeled data from both domains. Second, it uses a few labeled samples from the target domain to align class-specific data distributions between the two. The model achieves the best results in 28 out of 36 scenarios as compared to other state-of-the-art domain adaptation methods. In Chapter 5, I develop a semantic segmentation method for seagrass detection in multispectral time-series images. First, I train a state-of-the-art image segmentation method using an active learning approach where I use the DCNN classifier in the loop. Then, I develop an unsupervised domain adaptation (UDA) algorithm to detect seagrass across temporal images. I also extend our unsupervised domain adaptation work for seagrass detection across locations. In Chapter 6, I present an automated bathymetry estimation model based on multispectral satellite images. Bathymetry refers to the depth of the ocean floor and contributes a predominant role in identifying marine species in seawater. Accurate bathymetry information of coastal areas will facilitate seagrass detection by reducing false positives because seagrass usually do not grow beyond a certain depth. However, bathymetry information of most parts of the world is obsolete or missing. Traditional bathymetry measurement systems require extensive labor efforts. I utilize an ensemble machine learning-based approach to estimate bathymetry based on a few in-situ sonar measurements and evaluate the proposed model in three coastal locations in Florida

    RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation

    Full text link
    3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images. However, image-based scene perception encounters significant challenges in achieving accurate prediction due to the absence of geometric priors. In this paper, we address this issue by exploring cross-modal knowledge distillation in this task, i.e., we leverage a stronger multi-modal model to guide the visual model during training. In practice, we observe that directly applying features or logits alignment, proposed and widely used in bird's-eyeview (BEV) perception, does not yield satisfactory results. To overcome this problem, we introduce RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction. By employing differentiable volume rendering, we generate depth and semantic maps in perspective views and propose two novel consistency criteria between the rendered outputs of teacher and student models. Specifically, the depth consistency loss aligns the termination distributions of the rendered rays, while the semantic consistency loss mimics the intra-segment similarity guided by vision foundation models (VLMs). Experimental results on the nuScenes dataset demonstrate the effectiveness of our proposed method in improving various 3D occupancy prediction approaches, e.g., our proposed methodology enhances our baseline by 2.2% in the metric of mIoU and achieves 50% in Occ3D benchmark.Comment: Accepted by AAAI 202

    From Fully-Supervised Single-Task to Semi-Supervised Multi-Task Deep Learning Architectures for Segmentation in Medical Imaging Applications

    Get PDF
    Medical imaging is routinely performed in clinics worldwide for the diagnosis and treatment of numerous medical conditions in children and adults. With the advent of these medical imaging modalities, radiologists can visualize both the structure of the body as well as the tissues within the body. However, analyzing these high-dimensional (2D/3D/4D) images demands a significant amount of time and effort from radiologists. Hence, there is an ever-growing need for medical image computing tools to extract relevant information from the image data to help radiologists perform efficiently. Image analysis based on machine learning has pivotal potential to improve the entire medical imaging pipeline, providing support for clinical decision-making and computer-aided diagnosis. To be effective in addressing challenging image analysis tasks such as classification, detection, registration, and segmentation, specifically for medical imaging applications, deep learning approaches have shown significant improvement in performance. While deep learning has shown its potential in a variety of medical image analysis problems including segmentation, motion estimation, etc., generalizability is still an unsolved problem and many of these successes are achieved at the cost of a large pool of datasets. For most practical applications, getting access to a copious dataset can be very difficult, often impossible. Annotation is tedious and time-consuming. This cost is further amplified when annotation must be done by a clinical expert in medical imaging applications. Additionally, the applications of deep learning in the real-world clinical setting are still limited due to the lack of reliability caused by the limited prediction capabilities of some deep learning models. Moreover, while using a CNN in an automated image analysis pipeline, it’s critical to understand which segmentation results are problematic and require further manual examination. To this extent, the estimation of uncertainty calibration in a semi-supervised setting for medical image segmentation is still rarely reported. This thesis focuses on developing and evaluating optimized machine learning models for a variety of medical imaging applications, ranging from fully-supervised, single-task learning to semi-supervised, multi-task learning that makes efficient use of annotated training data. The contributions of this dissertation are as follows: (1) developing a fully-supervised, single-task transfer learning for the surgical instrument segmentation from laparoscopic images; and (2) utilizing supervised, single-task, transfer learning for segmenting and digitally removing the surgical instruments from endoscopic/laparoscopic videos to allow the visualization of the anatomy being obscured by the tool. The tool removal algorithms use a tool segmentation mask and either instrument-free reference frames or previous instrument-containing frames to fill in (inpaint) the instrument segmentation mask; (3) developing fully-supervised, single-task learning via efficient weight pruning and learned group convolution for accurate left ventricle (LV), right ventricle (RV) blood pool and myocardium localization and segmentation from 4D cine cardiac MR images; (4) demonstrating the use of our fully-supervised memory-efficient model to generate dynamic patient-specific right ventricle (RV) models from cine cardiac MRI dataset via an unsupervised learning-based deformable registration field; and (5) integrating a Monte Carlo dropout into our fully-supervised memory-efficient model with inherent uncertainty estimation, with the overall goal to estimate the uncertainty associated with the obtained segmentation and error, as a means to flag regions that feature less than optimal segmentation results; (6) developing semi-supervised, single-task learning via self-training (through meta pseudo-labeling) in concert with a Teacher network that instructs the Student network by generating pseudo-labels given unlabeled input data; (7) proposing largely-unsupervised, multi-task learning to demonstrate the power of a simple combination of a disentanglement block, variational autoencoder (VAE), generative adversarial network (GAN), and a conditioning layer-based reconstructor for performing two of the foremost critical tasks in medical imaging — segmentation of cardiac structures and reconstruction of the cine cardiac MR images; (8) demonstrating the use of 3D semi-supervised, multi-task learning for jointly learning multiple tasks in a single backbone module – uncertainty estimation, geometric shape generation, and cardiac anatomical structure segmentation of the left atrial cavity from 3D Gadolinium-enhanced magnetic resonance (GE-MR) images. This dissertation summarizes the impact of the contributions of our work in terms of demonstrating the adaptation and use of deep learning architectures featuring different levels of supervision to build a variety of image segmentation tools and techniques that can be used across a wide spectrum of medical image computing applications centered on facilitating and promoting the wide-spread computer-integrated diagnosis and therapy data science

    Unsupervised Contrastive Representation Learning for Knowledge Distillation and Clustering

    Get PDF
    Unsupervised contrastive learning has emerged as an important training strategy to learn representation by pulling positive samples closer and pushing negative samples apart in low-dimensional latent space. Usually, positive samples are the augmented versions of the same input and negative samples are from different inputs. Once the low-dimensional representations are learned, further analysis, such as clustering, and classification can be performed using the representations. Currently, there are two challenges in this framework. First, the empirical studies reveal that even though contrastive learning methods show great progress in representation learning on large model training, they do not work well for small models. Second, this framework has achieved excellent clustering results on small datasets but has limitations on the datasets with a large number of clusters such as ImageNet. In this dissertation, our research goal is to develop new unsupervised contrastive representation learning methods and apply them to knowledge distillation and clustering. The knowledge distillation transfers knowledge from high-capacity teachers to small student models and then improves the performance of students. And the representational knowledge distillation methods try to distill the knowledge of representations from teachers to students. Current representational knowledge distillation methods undesirably push apart representations of samples from the same class in their correlation objectives, leading to inferior distillation results. Here, we introduce Dual-level Knowledge Distillation (DLKD) by explicitly combining knowledge alignment and knowledge correlation instead of using one single contrastive objective. We show that both knowledge alignment and knowledge correlation are necessary to improve distillation performance. The proposed DLKD is task-agnostic and model-agnostic and enables effective knowledge transfer from supervised or self-supervised trained teachers to students. Experiments demonstrate that DLKD outperforms other state-of-the-art methods in a large number of experimental settings including different (a) pretraining strategies (b) network architectures (c) datasets and (d) tasks. Currently, the two-stage framework is widely used in deep learning-based clustering, namely, learning representation first, then clustering algorithms, such as K-means, are usually performed on representations to obtain cluster assignment. However, the learned representation may not be optimized for clustering in this two-stage framework. Here, we propose Contrastive Learning-based Clustering (CLC), which uses contrastive learning to directly learn cluster assignment. We decompose the representation into two parts: one encodes the categorical information under an equipartition constraint, and the other captures the instance-wise factors. We theoretically analyze the proposed contrastive loss and reveal that CLC sets different weights for the negative samples while learning cluster assignments. Therefore, the proposed loss has high expressiveness that enables us to efficiently learn cluster assignments. Experimental evaluation shows that CLC achieves overall state-of-the-art or highly competitive clustering performance on multiple benchmark datasets. In particular, we achieve 53.4% accuracy on the full ImageNet dataset and outperform existing methods by large margins (+ 10.2%)

    DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

    Full text link
    Transformers have shown superior performance on various vision tasks. Their large receptive field endows Transformer models with higher representation power than their CNN counterparts. Nevertheless, simply enlarging the receptive field also raises several concerns. On the one hand, using dense attention in ViT leads to excessive memory and computational cost, and features can be influenced by irrelevant parts that are beyond the region of interests. On the other hand, the handcrafted attention adopted in PVT or Swin Transformer is data agnostic and may limit the ability to model long-range relations. To solve this dilemma, we propose a novel deformable multi-head attention module, where the positions of key and value pairs in self-attention are adaptively allocated in a data-dependent way. This flexible scheme enables the proposed deformable attention to dynamically focus on relevant regions while maintains the representation power of global attention. On this basis, we present Deformable Attention Transformer (DAT), a general vision backbone efficient and effective for visual recognition. We further build an enhanced version DAT++. Extensive experiments show that our DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.Comment: 17 pages, 6 figures, 11 table
    • …