264 research outputs found
Information Theoretic Representation Distillation
Despite the empirical success of knowledge distillation, current
state-of-the-art methods are computationally expensive to train, which makes
them difficult to adopt in practice. To address this problem, we introduce two
distinct complementary losses inspired by a cheap entropy-like estimator. These
losses aim to maximise the correlation and mutual information between the
student and teacher representations. Our method incurs significantly less
training overheads than other approaches and achieves competitive performance
to the state-of-the-art on the knowledge distillation and cross-model transfer
tasks. We further demonstrate the effectiveness of our method on a binary
distillation task, whereby it leads to a new state-of-the-art for binary
quantisation and approaches the performance of a full precision model. Code:
www.github.com/roymiles/ITRDComment: BMVC 202
DiME: Maximizing Mutual Information by a Difference of Matrix-Based Entropies
We introduce an information-theoretic quantity with similar properties to
mutual information that can be estimated from data without making explicit
assumptions on the underlying distribution. This quantity is based on a
recently proposed matrix-based entropy that uses the eigenvalues of a
normalized Gram matrix to compute an estimate of the eigenvalues of an
uncentered covariance operator in a reproducing kernel Hilbert space. We show
that a difference of matrix-based entropies (DiME) is well suited for problems
involving the maximization of mutual information between random variables.
While many methods for such tasks can lead to trivial solutions, DiME naturally
penalizes such outcomes. We compare DiME to several baseline estimators of
mutual information on a toy Gaussian dataset. We provide examples of use cases
for DiME, such as latent factor disentanglement and a multiview representation
learning problem where DiME is used to learn a shared representation among
views with high mutual information
Simple stopping criteria for information theoretic feature selection
Feature selection aims to select the smallest feature subset that yields the
minimum generalization error. In the rich literature in feature selection,
information theory-based approaches seek a subset of features such that the
mutual information between the selected features and the class labels is
maximized. Despite the simplicity of this objective, there still remain several
open problems in optimization. These include, for example, the automatic
determination of the optimal subset size (i.e., the number of features) or a
stopping criterion if the greedy searching strategy is adopted. In this paper,
we suggest two stopping criteria by just monitoring the conditional mutual
information (CMI) among groups of variables. Using the recently developed
multivariate matrix-based Renyi's \alpha-entropy functional, which can be
directly estimated from data samples, we showed that the CMI among groups of
variables can be easily computed without any decomposition or approximation,
hence making our criteria easy to implement and seamlessly integrated into any
existing information theoretic feature selection methods with a greedy search
strategy.Comment: Paper published in the journal of Entrop
- …