9 research outputs found
Improved Knowledge Distillation via Teacher Assistant
Despite the fact that deep neural networks are powerful models and achieve
appealing results on many tasks, they are too large to be deployed on edge
devices like smartphones or embedded sensor nodes. There have been efforts to
compress these networks, and a popular method is knowledge distillation, where
a large (teacher) pre-trained network is used to train a smaller (student)
network. However, in this paper, we show that the student network performance
degrades when the gap between student and teacher is large. Given a fixed
student network, one cannot employ an arbitrarily large teacher, or in other
words, a teacher can effectively transfer its knowledge to students up to a
certain size, not smaller. To alleviate this shortcoming, we introduce
multi-step knowledge distillation, which employs an intermediate-sized network
(teacher assistant) to bridge the gap between the student and the teacher.
Moreover, we study the effect of teacher assistant size and extend the
framework to multi-step distillation. Theoretical analysis and extensive
experiments on CIFAR-10,100 and ImageNet datasets and on CNN and ResNet
architectures substantiate the effectiveness of our proposed approach.Comment: AAAI 202
Wide Neural Networks Forget Less Catastrophically
A primary focus area in continual learning research is alleviating the
"catastrophic forgetting" problem in neural networks by designing new
algorithms that are more robust to the distribution shifts. While the recent
progress in continual learning literature is encouraging, our understanding of
what properties of neural networks contribute to catastrophic forgetting is
still limited. To address this, instead of focusing on continual learning
algorithms, in this work, we focus on the model itself and study the impact of
"width" of the neural network architecture on catastrophic forgetting, and show
that width has a surprisingly significant effect on forgetting. To explain this
effect, we study the learning dynamics of the network from various perspectives
such as gradient orthogonality, sparsity, and lazy training regime. We provide
potential explanations that are consistent with the empirical results across
different architectures and continual learning benchmarks.Comment: ICML 202
Recommended from our members
Alleviating Catastrophic Forgetting in Continual Learning
Machine learning has enjoyed rapid and substantial advances in the past few years. However, machine learning models cannot learn continually as we humans do. Humans are continual learners, meaning they can accumulate knowledge, use the previous knowledge to learn from new experiences better, and retain knowledge from previous experiences. In contrast, current machine learning models learn in an isolated manner meaning there is no notion of time (e.g., past or present) in their closed-world learning. The goal of continual learning is to mimic the learning mechanism of humans for machines with significant impacts on the machine learning community. However, this is a challenging problem since current machine learning systems suffer from the catastrophic forgetting problem, meaning they cannot preserve their learned knowledge. Catastrophic forgetting happens mainly because the model is trained sequentially over evolving data distributions. Consequently, the representations the model has learned for previous data will change to adapt to the new data, and the new representations are no longer adequate for the past data. While the recent progress in continual learning is encouraging, our understanding of the catastrophic forgetting problem is still limited. This dissertation aims to understand the continual learning problem better and fill this knowledge gap by studying the theoretical and practical implications of the catastrophic forgetting problem for deep learning models. We will study the catastrophic forgetting problem from various perspectives and show that the optimization, training regime, loss landscape, and architectures of neural networks all play a significant role in alleviating the forgetting. We then use the gained insights to develop continual agents that are more robust to catastrophic forgetting
Recommended from our members
Improved knowledge distillation for deep neural networks
Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too large to be deployed on edge devices like smartphones or embedded sensor nodes. There have been efforts to compress these networks, and a popular method is knowledge distillation, where a large (teacher) pre-trained network is used to train a smaller (student) network. However, in this thesis, we show that the student network performance degrades when the gap between student and teacher is large. Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To alleviate this shortcoming, we introduce multi-step knowledge distillation, which employs an intermediate-sized network (teacher assistant) to bridge the gap between the student and the teacher. Moreover, we study the effect of teacher assistant size and extend the framework to multi-step distillation. Theoretical analysis and extensive experiments on CIFAR10,100 and ImageNet datasets and on CNN and ResNet architectures substantiate the effectiveness of our proposed approach
Continual Learning Beyond a Single Model
A growing body of research in continual learning focuses on the catastrophic
forgetting problem. While many attempts have been made to alleviate this
problem, the majority of the methods assume a single model in the continual
learning setup. In this work, we question this assumption and show that
employing ensemble models can be a simple yet effective method to improve
continual performance. However, ensembles' training and inference costs can
increase significantly as the number of models grows. Motivated by this
limitation, we study different ensemble models to understand their benefits and
drawbacks in continual learning scenarios. Finally, to overcome the high
compute cost of ensembles, we leverage recent advances in neural network
subspace to propose a computationally cheap algorithm with similar runtime to a
single model yet enjoying the performance benefits of ensembles.Comment: Keywords: continual learning, neural network subspaces, efficient
trainin
Recommended from our members
Use of machine learning to predict medication adherence in individuals at risk for atherosclerotic cardiovascular disease
BackgroundMedication nonadherence is a critical problem with severe implications in individuals at risk for atherosclerotic cardiovascular disease. Many studies have attempted to predict medication adherence in this population, but few, if any, have been effective in prediction, sug-gesting that essential risk factors remain unidentified.ObjectiveThis study's objective was to (1) establish an accurate prediction model of medi-cation adherence in individuals at risk for atherosclerotic cardiovascular disease and (2) identify significant contributing factors to the predictive accuracy of medication adherence. In particular, we aimed to use only the baseline questionnaire data to assess medication adherence prediction feasibility.MethodsA sample of 40 individuals at risk for atherosclerotic cardiovascular disease was recruited for an eight-week feasibility study. After collecting baseline data, we recorded data from a pillbox that sent events to a cloud-based server. Health measures and medication use events were analyzed using machine learning algorithms to identify variables that best predict medication adherence.ResultsOur adherence prediction model, based on only the ten most relevant variables, achieved an average error rate of 12.9%. Medication adherence was closely correlated with being encouraged to play an active role in their treatment, having confidence about what to do in an emergency, knowledge about their medications, and having a special person in their life.ConclusionsOur results showed the significance of clinical and psychosocial factors for predicting medication adherence in people at risk for atherosclerotic cardiovascular diseases. Clini-cians and researchers can use these factors to stratify individuals to make evidence-based decisions to reduce the risks