2,065 research outputs found

    A Neural Network model with Bidirectional Whitening

    Full text link
    We present here a new model and algorithm which performs an efficient Natural gradient descent for Multilayer Perceptrons. Natural gradient descent was originally proposed from a point of view of information geometry, and it performs the steepest descent updates on manifolds in a Riemannian space. In particular, we extend an approach taken by the "Whitened neural networks" model. We make the whitening process not only in feed-forward direction as in the original model, but also in the back-propagation phase. Its efficacy is shown by an application of this "Bidirectional whitened neural networks" model to a handwritten character recognition data (MNIST data).Comment: 16page

    On-Line Learning Theory of Soft Committee Machines with Correlated Hidden Units - Steepest Gradient Descent and Natural Gradient Descent -

    Full text link
    The permutation symmetry of the hidden units in multilayer perceptrons causes the saddle structure and plateaus of the learning dynamics in gradient learning methods. The correlation of the weight vectors of hidden units in a teacher network is thought to affect this saddle structure, resulting in a prolonged learning time, but this mechanism is still unclear. In this paper, we discuss it with regard to soft committee machines and on-line learning using statistical mechanics. Conventional gradient descent needs more time to break the symmetry as the correlation of the teacher weight vectors rises. On the other hand, no plateaus occur with natural gradient descent regardless of the correlation for the limit of a low learning rate. Analytical results support these dynamics around the saddle point.Comment: 7 pages, 6 figure

    State Concentration Exponent as a Measure of Quickness in Kauffman-type Networks

    Full text link
    We study the dynamics of randomly connected networks composed of binary Boolean elements and those composed of binary majority vote elements. We elucidate their differences in both sparsely and densely connected cases. The quickness of large network dynamics is usually quantified by the length of transient paths, an analytically intractable measure. For discrete-time dynamics of networks of binary elements, we address this dilemma with an alternative unified framework by using a concept termed state concentration, defined as the exponent of the average number of t-step ancestors in state transition graphs. The state transition graph is defined by nodes corresponding to network states and directed links corresponding to transitions. Using this exponent, we interrogate the dynamics of random Boolean and majority vote networks. We find that extremely sparse Boolean networks and majority vote networks with arbitrary density achieve quickness, owing in part to long-tailed in-degree distributions. As a corollary, only relatively dense majority vote networks can achieve both quickness and robustness.Comment: 6 figure

    Laplace's rule of succession in information geometry

    Full text link
    Laplace's "add-one" rule of succession modifies the observed frequencies in a sequence of heads and tails by adding one to the observed counts. This improves prediction by avoiding zero probabilities and corresponds to a uniform Bayesian prior on the parameter. The canonical Jeffreys prior corresponds to the "add-one-half" rule. We prove that, for exponential families of distributions, such Bayesian predictors can be approximated by taking the average of the maximum likelihood predictor and the \emph{sequential normalized maximum likelihood} predictor from information theory. Thus in this case it is possible to approximate Bayesian predictors without the cost of integrating or sampling in parameter space

    Fluctuation Theorems on Nishimori Line

    Full text link
    The distribution of the performed work for spin glasses with gauge symmetry is considered. With the aid of the gauge symmetry, which leads to the exact/rigorous results in spin glasses, we find a fascinating relation of the performed work as the fluctuation theorem. The integral form of the resultant relation reproduces the Jarzynski-type equation for spin glasses we have obtained. We show that similar relations can be established not only for the distribution of the performed work but also that of the free energy of spin glasses with gauge symmetry, which provides another interpretation of the phase transition in spin glasses.Comment: 10 pages, and 1 figur

    Parametric Fokker-Planck equation

    Full text link
    We derive the Fokker-Planck equation on the parametric space. It is the Wasserstein gradient flow of relative entropy on the statistical manifold. We pull back the PDE to a finite dimensional ODE on parameter space. Some analytical example and numerical examples are presented

    Information Geometry, Inference Methods and Chaotic Energy Levels Statistics

    Full text link
    In this Letter, we propose a novel information-geometric characterization of chaotic (integrable) energy level statistics of a quantum antiferromagnetic Ising spin chain in a tilted (transverse) external magnetic field. Finally, we conjecture our results might find some potential physical applications in quantum energy level statistics.Comment: 9 pages, added correct journal referenc

    Pushing Stochastic Gradient towards Second-Order Methods -- Backpropagation Learning with Transformations in Nonlinearities

    Full text link
    Recently, we proposed to transform the outputs of each hidden neuron in a multi-layer perceptron network to have zero output and zero slope on average, and use separate shortcut connections to model the linear dependencies instead. We continue the work by firstly introducing a third transformation to normalize the scale of the outputs of each hidden neuron, and secondly by analyzing the connections to second order optimization methods. We show that the transformations make a simple stochastic gradient behave closer to second-order optimization methods and thus speed up learning. This is shown both in theory and with experiments. The experiments on the third transformation show that while it further increases the speed of learning, it can also hurt performance by converging to a worse local optimum, where both the inputs and outputs of many hidden neurons are close to zero.Comment: 10 pages, 5 figures, ICLR201

    Bifurcation analysis in an associative memory model

    Full text link
    We previously reported the chaos induced by the frustration of interaction in a non-monotonic sequential associative memory model, and showed the chaotic behaviors at absolute zero. We have now analyzed bifurcation in a stochastic system, namely a finite-temperature model of the non-monotonic sequential associative memory model. We derived order-parameter equations from the stochastic microscopic equations. Two-parameter bifurcation diagrams obtained from those equations show the coexistence of attractors, which do not appear at absolute zero, and the disappearance of chaos due to the temperature effect.Comment: 19 page
    • …
    corecore