6 research outputs found

    Universal Approximation Depth and Errors of Narrow Belief Networks with Discrete Units

    Full text link
    We generalize recent theoretical work on the minimal number of layers of narrow deep belief networks that can approximate any probability distribution on the states of their visible units arbitrarily well. We relax the setting of binary units (Sutskever and Hinton, 2008; Le Roux and Bengio, 2008, 2010; Mont\'ufar and Ay, 2011) to units with arbitrary finite state spaces, and the vanishing approximation error to an arbitrary approximation error tolerance. For example, we show that a qq-ary deep belief network with L2+qmδ1q1L\geq 2+\frac{q^{\lceil m-\delta \rceil}-1}{q-1} layers of width nm+logq(m)+1n \leq m + \log_q(m) + 1 for some mNm\in \mathbb{N} can approximate any probability distribution on {0,1,,q1}n\{0,1,\ldots,q-1\}^n without exceeding a Kullback-Leibler divergence of δ\delta. Our analysis covers discrete restricted Boltzmann machines and na\"ive Bayes models as special cases.Comment: 19 pages, 5 figures, 1 tabl

    Scaling of Model Approximation Errors and Expected Entropy Distances

    Get PDF
    We compute the expected value of the Kullback-Leibler divergence to various fundamental statistical models with respect to canonical priors on the probability simplex. We obtain closed formulas for the expected model approximation errors, depending on the dimension of the models and the cardinalities of their sample spaces. For the uniform prior, the expected divergence from any model containing the uniform distribution is bounded by a constant 1γ1-\gamma, and for the models that we consider, this bound is approached if the state space is very large and the models' dimension does not grow too fast. For Dirichlet priors the expected divergence is bounded in a similar way, if the concentration parameters take reasonable values. These results serve as reference values for more complicated statistical models.Comment: 13 pages, 3 figures, WUPES'1

    Maximum information divergence from linear and toric models

    Full text link
    We study the problem of maximizing information divergence from a new perspective using logarithmic Voronoi polytopes. We show that for linear models, the maximum is always achieved at the boundary of the probability simplex. For toric models, we present an algorithm that combines the combinatorics of the chamber complex with numerical algebraic geometry. We pay special attention to reducible models and models of maximum likelihood degree one.Comment: 33 pages, 6 figure

    Finding the Maximizers of the Information Divergence from an Exponential Family: Finding the Maximizersof the Information Divergencefrom an Exponential Family

    No full text
    The subject of this thesis is the maximization of the information divergence from an exponential family on a finite set, a problem first formulated by Nihat Ay. A special case is the maximization of the mutual information or the multiinformation between different parts of a composite system. My thesis contributes mainly to the mathematical aspects of the optimization problem. A reformulation is found that relates the maximization of the information divergence with the maximization of an entropic quantity, defined on the normal space of the exponential family. This reformulation simplifies calculations in concrete cases and gives theoretical insight about the general problem. A second emphasis of the thesis is on examples that demonstrate how the theoretical results can be applied in particular cases. Third, my thesis contain first results on the characterization of exponential families with a small maximum value of the information divergence.:1. Introduction 2. Exponential families 2.1. Exponential families, the convex support and the moment map 2.2. The closure of an exponential family 2.3. Algebraic exponential families 2.4. Hierarchical models 3. Maximizing the information divergence from an exponential family 3.1. The directional derivatives of D(*|E ) 3.2. Projection points and kernel distributions 3.3. The function DE 3.4. The first order optimality conditions of DE 3.5. The relation between D(*|E) and DE 3.6. Computing the critical points 3.7. Computing the projection points 4. Examples 4.1. Low-dimensional exponential families 4.1.1. Zero-dimensional exponential families 4.1.2. One-dimensional exponential families 4.1.3. One-dimensional exponential families on four states 4.1.4. Other low-dimensional exponential families 4.2. Partition models 4.3. Exponential families with max D(*|E ) = log(2) 4.4. Binary i.i.d. models and binomial models 5. Applications and Outlook 5.1. Principles of learning, complexity measures and constraints 5.2. Optimally approximating exponential families 5.3. Asymptotic behaviour of the empirical information divergence A. Polytopes and oriented matroids A.1. Polytopes A.2. Oriented matroids Bibliography Index Glossary of notation
    corecore