62 research outputs found
Blockwise perturbation theory for nearly uncoupled Markov chains and its application
AbstractLet P be the transition matrix of a nearly uncoupled Markov chain. The states can be grouped into aggregates such that P has the block form P=(Pij)i,j=1k, where Pii is square and Pij is small for i≠j. Let πT be the stationary distribution partitioned conformally as πT=(π1T,…,πkT). In this paper we bound the relative error in each aggregate distribution πiT caused by small relative perturbations in Pij. The error bounds demonstrate that nearly uncoupled Markov chains usually lead to well-conditioned problems in the sense of blockwise relative error. As an application, we show that with appropriate stopping criteria, iterative aggregation/disaggregation algorithms will achieve such structured backward errors and compute each aggregate distribution with high relative accuracy
An Improved Convergence Analysis of Cyclic Block Coordinate Descent-type Methods for Strongly Convex Minimization
The cyclic block coordinate descent-type (CBCD-type) methods have shown remarkable computational performance for solving strongly convex minimization problems. Typical applications include many popular statistical machine learning methods such as elastic-net regression, ridge penalized logistic regression, and sparse additive regression. Existing optimization literature has shown that the CBCD-type methods attain iteration complexity of O(p · log(1/ )), where is a pre-specified accuracy of the objective value, and p is the number of blocks. However, such iteration complexity explicitly depends on p, and therefore is at least p times worse than those of gradient descent methods. To bridge this theoretical gap, we propose an improved convergence analysis for the CBCD-type methods. In particular, we first show that for a family of quadratic minimization problems, the iteration complexity of the CBCD-type methods matches that of the GD methods in term of dependency on p (up to a log 2 p factor). Thus our complexity bounds are sharper than the existing bounds by at least a factor of p/ log 2 p. We also provide a lower bound to confirm that our improved complexity bounds are tight (up to a log 2 p factor) if the largest and smallest eigenvalues of the Hessian matrix do not scale with p. Finally, we generalize our analysis to other strongly convex minimization problems beyond quadratic ones
Norm Convergence Rate for Multivariate Quadratic Polynomials of Wigner Matrices
We study Hermitian non-commutative quadratic polynomials of multiple
independent Wigner matrices. We prove that, with the exception of some specific
reducible cases, the limiting spectral density of the polynomials always has a
square root growth at its edges and prove an optimal local law around these
edges. Combining these two results, we establish that, as the dimension of
the matrices grows to infinity, the operator norm of such polynomials
converges to a deterministic limit with a rate of convergence of
. Here, the exponent in the rate of convergence is optimal. For
the specific reducible cases, we also provide a classification of all possible
edge behaviours.Comment: 38 page
Recommended from our members
Coding mechanisms for communication and compression : analysis of wireless channels and DNA sequencing
textThis thesis comprises of two related but distinct components: Coding arguments for communication channels and information-theoretic analysis for haplotype assembly. The common thread for both problems is utilizing information and coding theoretic principles in understanding their underlying mechanisms. For the first class of problems, I study two practical challenges that prevent optimal discrete codes utilizing in real communication and compression systems, namely, coding over analog alphabet and fading. In particular, I use an expansion coding scheme to convert the original analog channel coding and source coding problems into a set of independent discrete subproblems. By adopting optimal discrete codes over the expanded levels, this low-complexity coding scheme can approach Shannon limit perfectly or in ratio. Meanwhile, I design a polar coding scheme to deal with the unstable state of fading channels. This novel coding mechanism of hierarchically utilizing different types of polar codes has been proved to be ergodic capacity achievable for several fading systems, without channel state information known at the transmitter. For the second class of problems, I build an information-theoretic view for haplotype assembly. More precisely, the recovery of the target pair of haplotype sequences using short reads is rephrased as the joint source-channel coding problem. Two binary messages, representing haplotypes and chromosome memberships of reads, are encoded and transmitted over a channel with erasures and errors, where the channel model reflects salient features of highthroughput sequencing. The focus is on determining the required number of reads for reliable haplotype reconstruction.Electrical and Computer Engineerin
Parallel Asynchronous Matrix Multiplication for a Distributed Pipelined Neural Network
Machine learning is an approach to devise algorithms that compute an output without a given rule set but based on a self-learning concept. This approach is of great importance for several fields of applications in science and industry where traditional programming methods are not sufficient. In neural networks, a popular subclass of machine learning algorithms, commonly previous experience is used to train the network and produce good outputs for newly introduced inputs. By increasing the size of the network more complex problems can be solved which again rely on a huge amount of training data. Increasing the complexity also leads to higher computational demand and storage requirements and to the need for parallelization.
Several parallelization approaches of neural networks have already been considered. Most approaches use special purpose hardware whilst other work focuses on using standard hardware. Often these approaches target the problem by parallelizing the training data. In this work a new parallelization method named poadSGD is proposed for the parallelization of fully-connected, largescale feedforward networks on a compute cluster with standard hardware. poadSGD is based on the stochastic gradient descent algorithm. A block-wise distribution of the network's layers to groups of processes and a pipelining scheme for batches of the training samples are used. The network is updated asynchronously without interrupting ongoing computations of subsequent batches. For this task a one-sided communication scheme is used. A main algorithmic part of the batch-wise pipelined version consists of matrix multiplications which occur for a special distributed setup, where each matrix is held by a different process group.
GASPI, a parallel programming model from the field of "Partitioned Global Address Spaces" (PGAS) models is introduced and compared to other models from this class. As it mainly relies on one-sided and asynchronous communication it is a perfect candidate for the asynchronous update task in the poadSGD algorithm. Therefore, the matrix multiplication is also implemented based GASPI. In order to efficiently handle upcoming synchronizations within the process groups and achieve a good workload distribution, a two-dimensional block-cyclic data distribution is applied for the matrices. Based on this distribution, the multiplication algorithm is computed by diagonally iterating over the sub blocks of the resulting matrix and computing the sub blocks in subgroups of the processes. The sub blocks are computed by sharing the workload between the process groups and communicating mostly in pairs or in subgroups. The communication in pairs is set up to be overlapped by other ongoing computations. The implementations provide a special challenge, since the asynchronous communication routines must be handled with care as to which processor is working at what point in time with which data in order to prevent an unintentional dual use of data.
The theoretical analysis shows the matrix multiplication to be superior to a naive implementation when the dimension of the sub blocks of the matrices exceeds 382. The performance achieved in the test runs did not withstand the expectations the theoretical analysis predicted. The algorithm is executed on up to 512 cores and for matrices up to a size of 131,072 x 131,072.
The implementation using the GASPI API was found not be straightforward but to provide a good potential for overlapping communication with computations whenever the data dependencies of an application allow for it. The matrix multiplication was successfully implemented and can be used within an implementation of the poadSGD method that is yet to come. The poadSGD method seems to be very promising, especially as nowadays, with the larger amount of data and the increased complexity of the applications, the approaches to parallelization of neural networks are increasingly of interest
- …