484 research outputs found

    Distilled Hierarchical Neural Ensembles with Adaptive Inference Cost

    Get PDF
    Deep neural networks form the basis of state-of-the-art models across a variety of application domains. Moreover, networks that are able to dynamically adapt the computational cost of inference are important in scenarios where the amount of compute or input data varies over time. In this paper, we propose Hierarchical Neural Ensembles (HNE), a novel framework to embed an ensemble of multiple networks by sharing intermediate layers using a hierarchical structure. In HNE we control the inference cost by evaluating only a subset of models, which are organized in a nested manner. Our second contribution is a novel co-distillation method to boost the performance of ensemble predictions with low inference cost. This approach leverages the nested structure of our ensembles, to optimally allocate accuracy and diversity across the ensemble members. Comprehensive experiments over the CIFAR and Ima-geNet datasets confirm the effectiveness of HNE in building deep networks with adaptive inference cost for image classification

    Anytime Inference with Distilled Hierarchical Neural Ensembles

    Get PDF
    Inference in deep neural networks can be computationally expensive, and networks capable of anytime inference are important in mscenarios where the amount of compute or quantity of input data varies over time. In such networks the inference process can interrupted to provide a result faster, or continued to obtain a more accurate result. We propose Hierarchical Neural Ensembles (HNE), a novel framework to embed an ensemble of multiple networks in a hierarchical tree structure, sharing intermediate layers. In HNE we control the complexity of inference on-the-fly by evaluating more or less models in the ensemble. Our second contribution is a novel hierarchical distillation method to boost the prediction accuracy of small ensembles. This approach leverages the nested structure of our ensembles, to optimally allocate accuracy and diversity across the individual models. Our experiments show that, compared to previous anytime inference models, HNE provides state-of-the-art accuracy-computate trade-offs on the CIFAR-10/100 and ImageNet datasets

    ERM++: An Improved Baseline for Domain Generalization

    Full text link
    Multi-source Domain Generalization (DG) measures a classifier's ability to generalize to new distributions of data it was not trained on, given several training domains. While several multi-source DG methods have been proposed, they incur additional complexity during training by using domain labels. Recent work has shown that a well-tuned Empirical Risk Minimization (ERM) training procedure, that is simply minimizing the empirical risk on the source domains, can outperform most existing DG methods. We identify several key candidate techniques to further improve ERM performance, such as better utilization of training data, model parameter selection, and weight-space regularization. We call the resulting method ERM++, and show it significantly improves the performance of DG on five multi-source datasets by over 5% compared to standard ERM, and beats state-of-the-art despite being less computationally expensive. Additionally, we demonstrate the efficacy of ERM++ on the WILDS-FMOW dataset, a challenging DG benchmark. We hope that ERM++ becomes a strong baseline for future DG research. Code is released at https://github.com/piotr-teterwak/erm_plusplus.Comment: An improved baseline for Domain Generalizatio

    Compressing Deep Neural Networks via Knowledge Distillation

    Get PDF
    There has been a continuous evolution in deep neural network architectures since Alex Krizhevsky proposed AlexNet in 2012. Part of this has been due to increased complexity of the data and easier availability of datasets and part of it has been due to increased complexity of applications. These two factors form a self sustaining cycle and thereby have pushed the boundaries of deep learning to new domains in recent years. Many datasets have been proposed for different tasks. In computer vision, notable datasets like ImageNet, CIFAR-10, 100, MS-COCO provide large training data, with different tasks like classification, segmentation and object localization. Interdisciplinary datasets like the Visual Genome Dataset connect computer vision to tasks like natural language processing. All of these have fuelled the advent of architectures like AlexNet, VGG-Net, ResNet to achieve better predictive performance on these datasets. In object detection, networks like YOLO, SSD, Faster-RCNN have made great strides in achieving state of the art performance. However, amidst the growth of the neural networks one aspect that has been neglected is the problem of deploying them on devices which can support the computational and memory requirements of Deep Neural Networks (DNNs). Modern technology is only as good as the number of platforms it can support. Many applications like face detection, person classification and pedestrian detection require real time execution, with devices mounted on cameras. These devices are low powered and do not have the computational resources to run the data through a DNN and get instantaneous results. A natural solution to this problem is to make the DNN size smaller through compression. However, unlike file compression, DNN compression has a goal of not significantly impacting the overall accuracy of the network. In this thesis we consider the problem of model compression and present our end-to-end training algorithm for training a smaller model under the influence of a collection of expert models. The smaller model can be then deployed on resource constrained hardware independently from the expert models. We call this approach a form of compression since by deploying a smaller model we save the memory which would have been consumed by one or more expert models. We additionally introduce memory efficient architectures by building off from key ideas in literature that occupy very small memory and show the results of training them using our approach

    Improving Deep Neural Network Training with Knowledge Distillation

    Get PDF
    Knowledge distillation, as a popular compression technique, has been widely used to reduce deep neural network (DNN) size for a variety of applications. However, in recent years, some research had found its potential for improving deep neural network performance. This dissertation focuses on further exploring its power to facilitate accurate and reliable DNN training. First, I explored data-efficient method for blackbox knowledge distillation where the specifics of the DNN for distillation is inaccessible. I integrated active learning and mixup to obtain significant distillation performance gain with limited data. This work reveals the competence of knowledge distillation to facilitate large foundation model application. Next, I extended this work to solve a more challenging practical problem, i.e. COVID-19 infection prediction. Due to extremely limited data at the outbreak, it is very difficult to calibrate any existing epidemic model for practical prediction. I applied blackbox knowledge distillation with sequence mixup to distill a comprehensive physics-based simulation system. With the obtained distilled model, epidemic models are better calibrated to fit limited observation data and provide more accurate and reliable projection. This work validates that knowledge distillation can enhance DNN training for complex time series prediction with limited observation data. Next, I applied knowledge distillation to improve DNN reliability which reflects accurate model prediction confidence. Ensemble modeling and data augmentation had been blended to equip distillation process and obtain a reliable DNN. This work justifies that knowledge distillation can equip training for a more reliable DNN. Furthermore, this dissertation extended my knowledge distillation study to semantic segmentation tasks. The study started with investigation of semantic segmentation models, and then, proposed an approach of adaptive convolution to improve the heterogeneity of local convolution fields. The experiments had been carried out across different scales of segmentation benchmarks and justified that this approach outperforms existing state-of-the-art schemes and successfully boosts the performance of various backbone models. After this investigation study, semantic segmentation models had been calibrated with ensemble knowledge distillation which had been applied to solve image classification calibration. Stronger augmentation had been incorporated into distillation process. The experiments justify the effectiveness for semantic segmentation calibration
    • …
    corecore