8,831 research outputs found
Distilling Word Embeddings: An Encoding Approach
Distilling knowledge from a well-trained cumbersome network to a small one
has recently become a new research topic, as lightweight neural networks with
high performance are particularly in need in various resource-restricted
systems. This paper addresses the problem of distilling word embeddings for NLP
tasks. We propose an encoding approach to distill task-specific knowledge from
a set of high-dimensional embeddings, which can reduce model complexity by a
large margin as well as retain high accuracy, showing a good compromise between
efficiency and performance. Experiments in two tasks reveal the phenomenon that
distilling knowledge from cumbersome embeddings is better than directly
training neural networks with small embeddings.Comment: Accepted by CIKM-16 as a short paper, and by the Representation
Learning for Natural Language Processing (RL4NLP) Workshop @ACL-16 for
presentatio
Progressive Label Distillation: Learning Input-Efficient Deep Neural Networks
Much of the focus in the area of knowledge distillation has been on
distilling knowledge from a larger teacher network to a smaller student
network. However, there has been little research on how the concept of
distillation can be leveraged to distill the knowledge encapsulated in the
training data itself into a reduced form. In this study, we explore the concept
of progressive label distillation, where we leverage a series of
teacher-student network pairs to progressively generate distilled training data
for learning deep neural networks with greatly reduced input dimensions. To
investigate the efficacy of the proposed progressive label distillation
approach, we experimented with learning a deep limited vocabulary speech
recognition network based on generated 500ms input utterances distilled
progressively from 1000ms source training data, and demonstrated a significant
increase in test accuracy of almost 78% compared to direct learning.Comment: 9 page
On Correlated Knowledge Distillation for Monitoring Human Pose with Radios
In this work, we propose and develop a simple experimental testbed to study
the feasibility of a novel idea by coupling radio frequency (RF) sensing
technology with Correlated Knowledge Distillation (CKD) theory towards
designing lightweight, near real-time and precise human pose monitoring
systems. The proposed CKD framework transfers and fuses pose knowledge from a
robust "Teacher" model to a parameterized "Student" model, which can be a
promising technique for obtaining accurate yet lightweight pose estimates. To
assure its efficacy, we implemented CKD for distilling logits in our integrated
Software Defined Radio (SDR)-based experimental setup and investigated the
RF-visual signal correlation. Our CKD-RF sensing technique is characterized by
two modes -- a camera-fed Teacher Class Network (e.g., images, videos) with an
SDR-fed Student Class Network (e.g., RF signals). Specifically, our CKD model
trains a dual multi-branch teacher and student network by distilling and fusing
knowledge bases. The resulting CKD models are then subsequently used to
identify the multimodal correlation and teach the student branch in reverse.
Instead of simply aggregating their learnings, CKD training comprised multiple
parallel transformations with the two domains, i.e., visual images and RF
signals. Once trained, our CKD model can efficiently preserve privacy and
utilize the multimodal correlated logits from the two different neural networks
for estimating poses without using visual signals/video frames (by using only
the RF signals)
Adversarially Robust Distillation
Knowledge distillation is effective for producing small, high-performance
neural networks for classification, but these small networks are vulnerable to
adversarial attacks. This paper studies how adversarial robustness transfers
from teacher to student during knowledge distillation. We find that a large
amount of robustness may be inherited by the student even when distilled on
only clean images. Second, we introduce Adversarially Robust Distillation (ARD)
for distilling robustness onto student networks. In addition to producing small
models with high test accuracy like conventional distillation, ARD also passes
the superior robustness of large networks onto the student. In our experiments,
we find that ARD student models decisively outperform adversarially trained
networks of identical architecture in terms of robust accuracy, surpassing
state-of-the-art methods on standard robustness benchmarks. Finally, we adapt
recent fast adversarial training methods to ARD for accelerated robust
distillation.Comment: Accepted to AAAI Conference on Artificial Intelligence, 202
- …