37,940 research outputs found
Caveats for information bottleneck in deterministic scenarios
Information bottleneck (IB) is a method for extracting information from one
random variable that is relevant for predicting another random variable
. To do so, IB identifies an intermediate "bottleneck" variable that has
low mutual information and high mutual information . The "IB
curve" characterizes the set of bottleneck variables that achieve maximal
for a given , and is typically explored by maximizing the "IB
Lagrangian", . In some cases, is a deterministic
function of , including many classification problems in supervised learning
where the output class is a deterministic function of the input . We
demonstrate three caveats when using IB in any situation where is a
deterministic function of : (1) the IB curve cannot be recovered by
maximizing the IB Lagrangian for different values of ; (2) there are
"uninteresting" trivial solutions at all points of the IB curve; and (3) for
multi-layer classifiers that achieve low prediction error, different layers
cannot exhibit a strict trade-off between compression and prediction, contrary
to a recent proposal. We also show that when is a small perturbation away
from being a deterministic function of , these three caveats arise in an
approximate way. To address problem (1), we propose a functional that, unlike
the IB Lagrangian, can recover the IB curve in all cases. We demonstrate the
three caveats on the MNIST dataset
Information based clustering
In an age of increasingly large data sets, investigators in many different
disciplines have turned to clustering as a tool for data analysis and
exploration. Existing clustering methods, however, typically depend on several
nontrivial assumptions about the structure of data. Here we reformulate the
clustering problem from an information theoretic perspective which avoids many
of these assumptions. In particular, our formulation obviates the need for
defining a cluster "prototype", does not require an a priori similarity metric,
is invariant to changes in the representation of the data, and naturally
captures non-linear relations. We apply this approach to different domains and
find that it consistently produces clusters that are more coherent than those
extracted by existing algorithms. Finally, our approach provides a way of
clustering based on collective notions of similarity rather than the
traditional pairwise measures.Comment: To appear in Proceedings of the National Academy of Sciences USA, 11
pages, 9 figure
Probabilistic Clustering Using Maximal Matrix Norm Couplings
In this paper, we present a local information theoretic approach to
explicitly learn probabilistic clustering of a discrete random variable. Our
formulation yields a convex maximization problem for which it is NP-hard to
find the global optimum. In order to algorithmically solve this optimization
problem, we propose two relaxations that are solved via gradient ascent and
alternating maximization. Experiments on the MSR Sentence Completion Challenge,
MovieLens 100K, and Reuters21578 datasets demonstrate that our approach is
competitive with existing techniques and worthy of further investigation.Comment: Presented at 56th Annual Allerton Conference on Communication,
Control, and Computing, 201
Clustering Boolean Tensors
Tensor factorizations are computationally hard problems, and in particular,
are often significantly harder than their matrix counterparts. In case of
Boolean tensor factorizations -- where the input tensor and all the factors are
required to be binary and we use Boolean algebra -- much of that hardness comes
from the possibility of overlapping components. Yet, in many applications we
are perfectly happy to partition at least one of the modes. In this paper we
investigate what consequences does this partitioning have on the computational
complexity of the Boolean tensor factorizations and present a new algorithm for
the resulting clustering problem. This algorithm can alternatively be seen as a
particularly regularized clustering algorithm that can handle extremely
high-dimensional observations. We analyse our algorithms with the goal of
maximizing the similarity and argue that this is more meaningful than
minimizing the dissimilarity. As a by-product we obtain a PTAS and an efficient
0.828-approximation algorithm for rank-1 binary factorizations. Our algorithm
for Boolean tensor clustering achieves high scalability, high similarity, and
good generalization to unseen data with both synthetic and real-world data
sets
- …