62 research outputs found

    Blockwise perturbation theory for nearly uncoupled Markov chains and its application

    Get PDF
    AbstractLet P be the transition matrix of a nearly uncoupled Markov chain. The states can be grouped into aggregates such that P has the block form P=(Pij)i,j=1k, where Pii is square and Pij is small for i≠j. Let πT be the stationary distribution partitioned conformally as πT=(π1T,…,πkT). In this paper we bound the relative error in each aggregate distribution πiT caused by small relative perturbations in Pij. The error bounds demonstrate that nearly uncoupled Markov chains usually lead to well-conditioned problems in the sense of blockwise relative error. As an application, we show that with appropriate stopping criteria, iterative aggregation/disaggregation algorithms will achieve such structured backward errors and compute each aggregate distribution with high relative accuracy

    An Improved Convergence Analysis of Cyclic Block Coordinate Descent-type Methods for Strongly Convex Minimization

    Get PDF
    The cyclic block coordinate descent-type (CBCD-type) methods have shown remarkable computational performance for solving strongly convex minimization problems. Typical applications include many popular statistical machine learning methods such as elastic-net regression, ridge penalized logistic regression, and sparse additive regression. Existing optimization literature has shown that the CBCD-type methods attain iteration complexity of O(p · log(1/ )), where is a pre-specified accuracy of the objective value, and p is the number of blocks. However, such iteration complexity explicitly depends on p, and therefore is at least p times worse than those of gradient descent methods. To bridge this theoretical gap, we propose an improved convergence analysis for the CBCD-type methods. In particular, we first show that for a family of quadratic minimization problems, the iteration complexity of the CBCD-type methods matches that of the GD methods in term of dependency on p (up to a log 2 p factor). Thus our complexity bounds are sharper than the existing bounds by at least a factor of p/ log 2 p. We also provide a lower bound to confirm that our improved complexity bounds are tight (up to a log 2 p factor) if the largest and smallest eigenvalues of the Hessian matrix do not scale with p. Finally, we generalize our analysis to other strongly convex minimization problems beyond quadratic ones

    Norm Convergence Rate for Multivariate Quadratic Polynomials of Wigner Matrices

    Full text link
    We study Hermitian non-commutative quadratic polynomials of multiple independent Wigner matrices. We prove that, with the exception of some specific reducible cases, the limiting spectral density of the polynomials always has a square root growth at its edges and prove an optimal local law around these edges. Combining these two results, we establish that, as the dimension NN of the matrices grows to infinity, the operator norm of such polynomials qq converges to a deterministic limit with a rate of convergence of N−2/3+o(1)N^{-2/3+o(1)}. Here, the exponent in the rate of convergence is optimal. For the specific reducible cases, we also provide a classification of all possible edge behaviours.Comment: 38 page

    Author index to volumes 301–400

    Get PDF

    Parallel Asynchronous Matrix Multiplication for a Distributed Pipelined Neural Network

    Get PDF
    Machine learning is an approach to devise algorithms that compute an output without a given rule set but based on a self-learning concept. This approach is of great importance for several fields of applications in science and industry where traditional programming methods are not sufficient. In neural networks, a popular subclass of machine learning algorithms, commonly previous experience is used to train the network and produce good outputs for newly introduced inputs. By increasing the size of the network more complex problems can be solved which again rely on a huge amount of training data. Increasing the complexity also leads to higher computational demand and storage requirements and to the need for parallelization. Several parallelization approaches of neural networks have already been considered. Most approaches use special purpose hardware whilst other work focuses on using standard hardware. Often these approaches target the problem by parallelizing the training data. In this work a new parallelization method named poadSGD is proposed for the parallelization of fully-connected, largescale feedforward networks on a compute cluster with standard hardware. poadSGD is based on the stochastic gradient descent algorithm. A block-wise distribution of the network's layers to groups of processes and a pipelining scheme for batches of the training samples are used. The network is updated asynchronously without interrupting ongoing computations of subsequent batches. For this task a one-sided communication scheme is used. A main algorithmic part of the batch-wise pipelined version consists of matrix multiplications which occur for a special distributed setup, where each matrix is held by a different process group. GASPI, a parallel programming model from the field of "Partitioned Global Address Spaces" (PGAS) models is introduced and compared to other models from this class. As it mainly relies on one-sided and asynchronous communication it is a perfect candidate for the asynchronous update task in the poadSGD algorithm. Therefore, the matrix multiplication is also implemented based GASPI. In order to efficiently handle upcoming synchronizations within the process groups and achieve a good workload distribution, a two-dimensional block-cyclic data distribution is applied for the matrices. Based on this distribution, the multiplication algorithm is computed by diagonally iterating over the sub blocks of the resulting matrix and computing the sub blocks in subgroups of the processes. The sub blocks are computed by sharing the workload between the process groups and communicating mostly in pairs or in subgroups. The communication in pairs is set up to be overlapped by other ongoing computations. The implementations provide a special challenge, since the asynchronous communication routines must be handled with care as to which processor is working at what point in time with which data in order to prevent an unintentional dual use of data. The theoretical analysis shows the matrix multiplication to be superior to a naive implementation when the dimension of the sub blocks of the matrices exceeds 382. The performance achieved in the test runs did not withstand the expectations the theoretical analysis predicted. The algorithm is executed on up to 512 cores and for matrices up to a size of 131,072 x 131,072. The implementation using the GASPI API was found not be straightforward but to provide a good potential for overlapping communication with computations whenever the data dependencies of an application allow for it. The matrix multiplication was successfully implemented and can be used within an implementation of the poadSGD method that is yet to come. The poadSGD method seems to be very promising, especially as nowadays, with the larger amount of data and the increased complexity of the applications, the approaches to parallelization of neural networks are increasingly of interest
    • …
    corecore