Information bottleneck (IB) is a method for extracting information from one
random variable X that is relevant for predicting another random variable
Y. To do so, IB identifies an intermediate "bottleneck" variable T that has
low mutual information I(X;T) and high mutual information I(Y;T). The "IB
curve" characterizes the set of bottleneck variables that achieve maximal
I(Y;T) for a given I(X;T), and is typically explored by maximizing the "IB
Lagrangian", I(Y;T)−βI(X;T). In some cases, Y is a deterministic
function of X, including many classification problems in supervised learning
where the output class Y is a deterministic function of the input X. We
demonstrate three caveats when using IB in any situation where Y is a
deterministic function of X: (1) the IB curve cannot be recovered by
maximizing the IB Lagrangian for different values of β; (2) there are
"uninteresting" trivial solutions at all points of the IB curve; and (3) for
multi-layer classifiers that achieve low prediction error, different layers
cannot exhibit a strict trade-off between compression and prediction, contrary
to a recent proposal. We also show that when Y is a small perturbation away
from being a deterministic function of X, these three caveats arise in an
approximate way. To address problem (1), we propose a functional that, unlike
the IB Lagrangian, can recover the IB curve in all cases. We demonstrate the
three caveats on the MNIST dataset