33,212 research outputs found
Caveats for information bottleneck in deterministic scenarios
Information bottleneck (IB) is a method for extracting information from one
random variable that is relevant for predicting another random variable
. To do so, IB identifies an intermediate "bottleneck" variable that has
low mutual information and high mutual information . The "IB
curve" characterizes the set of bottleneck variables that achieve maximal
for a given , and is typically explored by maximizing the "IB
Lagrangian", . In some cases, is a deterministic
function of , including many classification problems in supervised learning
where the output class is a deterministic function of the input . We
demonstrate three caveats when using IB in any situation where is a
deterministic function of : (1) the IB curve cannot be recovered by
maximizing the IB Lagrangian for different values of ; (2) there are
"uninteresting" trivial solutions at all points of the IB curve; and (3) for
multi-layer classifiers that achieve low prediction error, different layers
cannot exhibit a strict trade-off between compression and prediction, contrary
to a recent proposal. We also show that when is a small perturbation away
from being a deterministic function of , these three caveats arise in an
approximate way. To address problem (1), we propose a functional that, unlike
the IB Lagrangian, can recover the IB curve in all cases. We demonstrate the
three caveats on the MNIST dataset
Pareto-optimal clustering with the primal deterministic information bottleneck
At the heart of both lossy compression and clustering is a trade-off between
the fidelity and size of the learned representation. Our goal is to map out and
study the Pareto frontier that quantifies this trade-off. We focus on the
Deterministic Information Bottleneck (DIB) formulation of lossy compression,
which can be interpreted as a clustering problem. To this end, we introduce the
{\it primal} DIB problem, which we show results in a much richer frontier than
its previously studied dual counterpart. We present an algorithm for mapping
out the Pareto frontier of the primal DIB trade-off that is also applicable to
most other two-objective clustering problems. We study general properties of
the Pareto frontier, and give both analytic and numerical evidence for
logarithmic sparsity of the frontier in general. We provide evidence that our
algorithm has polynomial scaling despite the super-exponential search space;
and additionally propose a modification to the algorithm that can be used where
sampling noise is expected to be significant. Finally, we use our algorithm to
map the DIB frontier of three different tasks: compressing the English
alphabet, extracting informative color classes from natural images, and
compressing a group theory inspired dataset, revealing interesting features of
frontier, and demonstrating how the structure of the frontier can be used for
model selection with a focus on points previously hidden by the cloak of the
convex hull.Comment: 26 pages, 11 figure
- …