1,065 research outputs found

    Distributing the Kalman Filter for Large-Scale Systems

    Full text link
    This paper derives a \emph{distributed} Kalman filter to estimate a sparsely connected, large-scale, nn-dimensional, dynamical system monitored by a network of NN sensors. Local Kalman filters are implemented on the (nln_l-dimensional, where nlnn_l\ll n) sub-systems that are obtained after spatially decomposing the large-scale system. The resulting sub-systems overlap, which along with an assimilation procedure on the local Kalman filters, preserve an LLth order Gauss-Markovian structure of the centralized error processes. The information loss due to the LLth order Gauss-Markovian approximation is controllable as it can be characterized by a divergence that decreases as LL\uparrow. The order of the approximation, LL, leads to a lower bound on the dimension of the sub-systems, hence, providing a criterion for sub-system selection. The assimilation procedure is carried out on the local error covariances with a distributed iterate collapse inversion (DICI) algorithm that we introduce. The DICI algorithm computes the (approximated) centralized Riccati and Lyapunov equations iteratively with only local communication and low-order computation. We fuse the observations that are common among the local Kalman filters using bipartite fusion graphs and consensus averaging algorithms. The proposed algorithm achieves full distribution of the Kalman filter that is coherent with the centralized Kalman filter with an LLth order Gaussian-Markovian structure on the centralized error processes. Nowhere storage, communication, or computation of nn-dimensional vectors and matrices is needed; only nlnn_l \ll n dimensional vectors and matrices are communicated or used in the computation at the sensors

    Exact Sparse Matrix-Vector Multiplication on GPU's and Multicore Architectures

    Full text link
    We propose different implementations of the sparse matrix--dense vector multiplication (\spmv{}) for finite fields and rings \Zb/m\Zb. We take advantage of graphic card processors (GPU) and multi-core architectures. Our aim is to improve the speed of \spmv{} in the \linbox library, and henceforth the speed of its black box algorithms. Besides, we use this and a new parallelization of the sigma-basis algorithm in a parallel block Wiedemann rank implementation over finite fields

    Covariance Estimation in High Dimensions via Kronecker Product Expansions

    Full text link
    This paper presents a new method for estimating high dimensional covariance matrices. The method, permuted rank-penalized least-squares (PRLS), is based on a Kronecker product series expansion of the true covariance matrix. Assuming an i.i.d. Gaussian random sample, we establish high dimensional rates of convergence to the true covariance as both the number of samples and the number of variables go to infinity. For covariance matrices of low separation rank, our results establish that PRLS has significantly faster convergence than the standard sample covariance matrix (SCM) estimator. The convergence rate captures a fundamental tradeoff between estimation error and approximation error, thus providing a scalable covariance estimation framework in terms of separation rank, similar to low rank approximation of covariance matrices. The MSE convergence rates generalize the high dimensional rates recently obtained for the ML Flip-flop algorithm for Kronecker product covariance estimation. We show that a class of block Toeplitz covariance matrices is approximatable by low separation rank and give bounds on the minimal separation rank rr that ensures a given level of bias. Simulations are presented to validate the theoretical bounds. As a real world application, we illustrate the utility of the proposed Kronecker covariance estimator for spatio-temporal linear least squares prediction of multivariate wind speed measurements.Comment: 47 pages, accepted to IEEE Transactions on Signal Processin

    Algebraic approaches for coded caching and distributed computing

    Get PDF
    This dissertation examines the power of algebraic methods in two areas of modern interest: caching for large scale content distribution and straggler mitigation within distributed computation. Caching is a popular technique for facilitating large scale content delivery over the Internet. Traditionally, caching operates by storing popular content closer to the end users. Recent work within the domain of information theory demonstrates that allowing coding in the cache and coded transmission from the server (referred to as coded caching) to the end users can allow for significant reductions in the number of bits transmitted from the server to the end users. The first part of this dissertation examines problems within coded caching. The original formulation of the coded caching problem assumes that the server and the end users are connected via a single shared link. In Chapter 2, we consider a more general topology where there is a layer of relay nodes between the server and the users. We propose novel schemes for a class of such networks that satisfy a so-called resolvability property and demonstrate that the performance of our scheme is strictly better than previously proposed schemes. Moreover, the original coded caching scheme requires that each file hosted in the server be partitioned into a large number (i.e., the subpacketization level) of non-overlapping subfiles. From a practical perspective, this is problematic as it means that prior schemes are only applicable when the size of the files is extremely large. In Chapter 3, we propose a novel coded caching scheme that enjoys a significantly lower subpacketization level than prior schemes, while only suffering a marginal increase in the transmission rate. We demonstrate that several schemes with subpacketization levels that are exponentially smaller than the basic scheme can be obtained. The second half of this dissertation deals with large scale distributed matrix computations. Distributed matrix multiplication is an important problem, especially in domains such as deep learning of neural networks. It is well recognized that the computation times on distributed clusters are often dominated by the slowest workers (called stragglers). Recently, techniques from coding theory have found applications in straggler mitigation in the specific context of matrix-matrix and matrix-vector multiplication. The computation can be completed as long as a certain number of workers (called the recovery threshold) complete their assigned tasks. In Chapter 4, we consider matrix multiplication under the assumption that the absolute values of the matrix entries are sufficiently small. Under this condition, we present a method with a significantly smaller recovery threshold than prior work. Besides, the prior work suffers from serious numerical issues owing to the condition number of the corresponding real Vandermonde-structured recovery matrices; this condition number grows exponentially in the number of workers. In Chapter 5, we present a novel approach that leverages the properties of circulant permutation matrices and rotation matrices for coded matrix computation. In addition to having an optimal recovery threshold, we demonstrate an upper bound on the worst case condition number of our recovery matrices grows polynomially in the number of workers

    Polynomial Chaos Expansion of random coefficients and the solution of stochastic partial differential equations in the Tensor Train format

    Full text link
    We apply the Tensor Train (TT) decomposition to construct the tensor product Polynomial Chaos Expansion (PCE) of a random field, to solve the stochastic elliptic diffusion PDE with the stochastic Galerkin discretization, and to compute some quantities of interest (mean, variance, exceedance probabilities). We assume that the random diffusion coefficient is given as a smooth transformation of a Gaussian random field. In this case, the PCE is delivered by a complicated formula, which lacks an analytic TT representation. To construct its TT approximation numerically, we develop the new block TT cross algorithm, a method that computes the whole TT decomposition from a few evaluations of the PCE formula. The new method is conceptually similar to the adaptive cross approximation in the TT format, but is more efficient when several tensors must be stored in the same TT representation, which is the case for the PCE. Besides, we demonstrate how to assemble the stochastic Galerkin matrix and to compute the solution of the elliptic equation and its post-processing, staying in the TT format. We compare our technique with the traditional sparse polynomial chaos and the Monte Carlo approaches. In the tensor product polynomial chaos, the polynomial degree is bounded for each random variable independently. This provides higher accuracy than the sparse polynomial set or the Monte Carlo method, but the cardinality of the tensor product set grows exponentially with the number of random variables. However, when the PCE coefficients are implicitly approximated in the TT format, the computations with the full tensor product polynomial set become possible. In the numerical experiments, we confirm that the new methodology is competitive in a wide range of parameters, especially where high accuracy and high polynomial degrees are required.Comment: This is a major revision of the manuscript arXiv:1406.2816 with significantly extended numerical experiments. Some unused material is remove

    Precoded FIR and Redundant V-BLAST Systems for Frequency-Selective MIMO Channels

    Get PDF
    The vertical Bell labs layered space-time (V-BLAST) system is a multi-input multioutput (MIMO) system designed to achieve good multiplexing gain. In recent literature, a precoder, which exploits channel information, has been added in the V-BLAST transmitter. This precoder forces each symbol stream to have an identical mean square error (MSE). It can be viewed as an alternative to the bit-loading method. In this paper, this precoded V-BLAST system is extended to the case of frequency-selective MIMO channels. Both the FIR and redundant types of transceivers, which use cyclic-prefixing and zero-padding, are considered. A fast algorithm for computing a cyclic-prefixing-based precoded V-BLAST transceiver is developed. Experiments show that the proposed methods with redundancy have better performance than the SVD-based system with optimal powerloading and bit loading for frequency-selective MIMO channels. The gain comes from the fact that the MSE-equalizing precoder has better bit-error rate performance than the optimal bitloading method

    A Tight Lower Bound for Counting Hamiltonian Cycles via Matrix Rank

    Get PDF
    For even kk, the matchings connectivity matrix Mk\mathbf{M}_k encodes which pairs of perfect matchings on kk vertices form a single cycle. Cygan et al. (STOC 2013) showed that the rank of Mk\mathbf{M}_k over Z2\mathbb{Z}_2 is Θ(2k)\Theta(\sqrt 2^k) and used this to give an O((2+2)pw)O^*((2+\sqrt{2})^{\mathsf{pw}}) time algorithm for counting Hamiltonian cycles modulo 22 on graphs of pathwidth pw\mathsf{pw}. The same authors complemented their algorithm by an essentially tight lower bound under the Strong Exponential Time Hypothesis (SETH). This bound crucially relied on a large permutation submatrix within Mk\mathbf{M}_k, which enabled a "pattern propagation" commonly used in previous related lower bounds, as initiated by Lokshtanov et al. (SODA 2011). We present a new technique for a similar pattern propagation when only a black-box lower bound on the asymptotic rank of Mk\mathbf{M}_k is given; no stronger structural insights such as the existence of large permutation submatrices in Mk\mathbf{M}_k are needed. Given appropriate rank bounds, our technique yields lower bounds for counting Hamiltonian cycles (also modulo fixed primes pp) parameterized by pathwidth. To apply this technique, we prove that the rank of Mk\mathbf{M}_k over the rationals is 4k/poly(k)4^k / \mathrm{poly}(k). We also show that the rank of Mk\mathbf{M}_k over Zp\mathbb{Z}_p is Ω(1.97k)\Omega(1.97^k) for any prime p2p\neq 2 and even Ω(2.15k)\Omega(2.15^k) for some primes. As a consequence, we obtain that Hamiltonian cycles cannot be counted in time O((6ϵ)pw)O^*((6-\epsilon)^{\mathsf{pw}}) for any ϵ>0\epsilon>0 unless SETH fails. This bound is tight due to a O(6pw)O^*(6^{\mathsf{pw}}) time algorithm by Bodlaender et al. (ICALP 2013). Under SETH, we also obtain that Hamiltonian cycles cannot be counted modulo primes p2p\neq 2 in time O(3.97pw)O^*(3.97^\mathsf{pw}), indicating that the modulus can affect the complexity in intricate ways.Comment: improved lower bounds modulo primes, improved figures, to appear in SODA 201

    Kronecker Product Factorization of de Boor\u27s Mixed-Radix FFT

    Get PDF
    corecore