490 research outputs found

    A Note on Knowledge Distillation Loss Function for Object Classification

    Full text link
    This research note provides a quick introduction to the knowledge distillation loss function used in object classification. In particular, we discuss its connection to a previously proposed logits matching loss function. We further treat knowledge distillation as a specific form of output regularization and demonstrate its connection to label smoothing and entropy-based regularization.Comment: Research Note, 4 page

    Knowledge Translation: A New Pathway for Model Compression

    Full text link
    Deep learning has witnessed significant advancements in recent years at the cost of increasing training, inference, and model storage overhead. While existing model compression methods strive to reduce the number of model parameters while maintaining high accuracy, they inevitably necessitate the re-training of the compressed model or impose architectural constraints. To overcome these limitations, this paper presents a novel framework, termed \textbf{K}nowledge \textbf{T}ranslation (KT), wherein a ``translation'' model is trained to receive the parameters of a larger model and generate compressed parameters. The concept of KT draws inspiration from language translation, which effectively employs neural networks to convert different languages, maintaining identical meaning. Accordingly, we explore the potential of neural networks to convert models of disparate sizes, while preserving their functionality. We propose a comprehensive framework for KT, introduce data augmentation strategies to enhance model performance despite restricted training data, and successfully demonstrate the feasibility of KT on the MNIST dataset. Code is available at \url{https://github.com/zju-SWJ/KT}

    Online Knowledge Distillation with Diverse Peers

    Full text link
    Distillation is an effective knowledge-transfer technique that uses predicted distributions of a powerful teacher model as soft targets to train a less-parameterized student model. A pre-trained high capacity teacher, however, is not always available. Recently proposed online variants use the aggregated intermediate predictions of multiple student models as targets to train each student model. Although group-derived targets give a good recipe for teacher-free distillation, group members are homogenized quickly with simple aggregation functions, leading to early saturated solutions. In this work, we propose Online Knowledge Distillation with Diverse peers (OKDDip), which performs two-level distillation during training with multiple auxiliary peers and one group leader. In the first-level distillation, each auxiliary peer holds an individual set of aggregation weights generated with an attention-based mechanism to derive its own targets from predictions of other auxiliary peers. Learning from distinct target distributions helps to boost peer diversity for effectiveness of group-based distillation. The second-level distillation is performed to transfer the knowledge in the ensemble of auxiliary peers further to the group leader, i.e., the model used for inference. Experimental results show that the proposed framework consistently gives better performance than state-of-the-art approaches without sacrificing training or inference complexity, demonstrating the effectiveness of the proposed two-level distillation framework.Comment: Accepted to AAAI-202

    Accelerating Diffusion Sampling with Classifier-based Feature Distillation

    Full text link
    Although diffusion model has shown great potential for generating higher quality images than GANs, slow sampling speed hinders its wide application in practice. Progressive distillation is thus proposed for fast sampling by progressively aligning output images of NN-step teacher sampler with N/2N/2-step student sampler. In this paper, we argue that this distillation-based accelerating method can be further improved, especially for few-step samplers, with our proposed \textbf{C}lassifier-based \textbf{F}eature \textbf{D}istillation (CFD). Instead of aligning output images, we distill teacher's sharpened feature distribution into the student with a dataset-independent classifier, making the student focus on those important features to improve performance. We also introduce a dataset-oriented loss to further optimize the model. Experiments on CIFAR-10 show the superiority of our method in achieving high quality and fast sampling. Code will be released soon

    Domain-Specific Bias Filtering for Single Labeled Domain Generalization

    Full text link
    Conventional Domain Generalization (CDG) utilizes multiple labeled source datasets to train a generalizable model for unseen target domains. However, due to expensive annotation costs, the requirements of labeling all the source data are hard to be met in real-world applications. In this paper, we investigate a Single Labeled Domain Generalization (SLDG) task with only one source domain being labeled, which is more practical and challenging than the CDG task. A major obstacle in the SLDG task is the discriminability-generalization bias: the discriminative information in the labeled source dataset may contain domain-specific bias, constraining the generalization of the trained model. To tackle this challenging task, we propose a novel framework called Domain-Specific Bias Filtering (DSBF), which initializes a discriminative model with the labeled source data and then filters out its domain-specific bias with the unlabeled source data for generalization improvement. We divide the filtering process into (1) feature extractor debiasing via k-means clustering-based semantic feature re-extraction and (2) classifier rectification through attention-guided semantic feature projection. DSBF unifies the exploration of the labeled and the unlabeled source data to enhance the discriminability and generalization of the trained model, resulting in a highly generalizable model. We further provide theoretical analysis to verify the proposed domain-specific bias filtering process. Extensive experiments on multiple datasets show the superior performance of DSBF in tackling both the challenging SLDG task and the CDG task.Comment: Accepted by International Journal of Computer Vision (IJCV

    Cross-Layer Distillation with Semantic Calibration

    Full text link
    Recently proposed knowledge distillation approaches based on feature-map transfer validate that intermediate layers of a teacher model can serve as effective targets for training a student model to obtain better generalization ability. Existing studies mainly focus on particular representation forms for knowledge transfer between manually specified pairs of teacher-student intermediate layers. However, semantics of intermediate layers may vary in different networks and manual association of layers might lead to negative regularization caused by semantic mismatch between certain teacher-student layer pairs. To address this problem, we propose Semantic Calibration for Cross-layer Knowledge Distillation (SemCKD), which automatically assigns proper target layers of the teacher model for each student layer with an attention mechanism. With a learned attention distribution, each student layer distills knowledge contained in multiple layers rather than a single fixed intermediate layer from the teacher model for appropriate cross-layer supervision in training. Consistent improvements over state-of-the-art approaches are observed in extensive experiments with various network architectures for teacher and student models, demonstrating the effectiveness and flexibility of the proposed attention based soft layer association mechanism for cross-layer distillation.Comment: AAAI-202

    Knowledge Distillation with Refined Logits

    Full text link
    Recent research on knowledge distillation has increasingly focused on logit distillation because of its simplicity, effectiveness, and versatility in model compression. In this paper, we introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods. Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions, creating a conflict between the standard distillation loss and the cross-entropy loss. This conflict can undermine the consistency of the student model's learning objectives. Previous attempts to use labels to empirically correct teacher predictions may undermine the class correlation. In contrast, our RLD employs labeling information to dynamically refine teacher logits. In this way, our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations, thus enhancing the value and efficiency of distilled knowledge. Experimental results on CIFAR-100 and ImageNet demonstrate its superiority over existing methods. The code is provided at \text{https://github.com/zju-SWJ/RLD}.Comment: 11 pages, 7 figure

    Super-Necking Crystal Growth and Structural and Magnetic Properties of SrTb2_2O4_4 Single Crystals

    Full text link
    We report on single-crystal growths of the SrTb2_2O4_4 compound by a super-necking technique with a laser-floating-zone furnace and study the stoichiometry, growth mode, and structural and magnetic properties by scanning electronic microscopy, neutron Laue, X-ray powder diffraction, and the physical property measurement system. We optimized the growth parameters, mainly the growth speed, atmosphere, and the addition of a Tb4_4O7_7 raw material. Neutron Laue diffraction displays the characteristic feature of a single crystal. Our study reveals an atomic ratio of Sr:Tb =0.97(2):2.00(1) = 0.97(2){:}2.00(1) and a possible layer by layer crystal growth mode. Our X-ray powder diffraction study determines the crystal structure, lattice constants and atomic positions. The paramagnetic (PM) Curie--Weiss (CW) temperature θCW=\theta_{\texttt{CW}} = 5.00(4) K, and the effective PM moment Mmeaeff=M^{\texttt{eff}}_{\texttt{mea}} = 10.97(1) μB\mu_\texttt{B} per Tb3+^{3+} ion. The data of magnetization versus temperature can be divided into three regimes, showing a coexistence of antiferromagnetic and ferromagnetic interactions. This probably leads to the magnetic frustration in the SrTb2_2O4_4 compound. The magnetization at 2 K and 14 T originates from both the Tb1 and Tb2 sites and is strongly frustrated with an expected saturation field at \sim41.5 T, displaying an intricate phase diagram with three ranges.Comment: 19 pages, 13 figure
    corecore