84,605 research outputs found

    A Massively Parallel Algorithm for the Approximate Calculation of Inverse p-th Roots of Large Sparse Matrices

    Get PDF
    We present the submatrix method, a highly parallelizable method for the approximate calculation of inverse p-th roots of large sparse symmetric matrices which are required in different scientific applications. We follow the idea of Approximate Computing, allowing imprecision in the final result in order to be able to utilize the sparsity of the input matrix and to allow massively parallel execution. For an n x n matrix, the proposed algorithm allows to distribute the calculations over n nodes with only little communication overhead. The approximate result matrix exhibits the same sparsity pattern as the input matrix, allowing for efficient reuse of allocated data structures. We evaluate the algorithm with respect to the error that it introduces into calculated results, as well as its performance and scalability. We demonstrate that the error is relatively limited for well-conditioned matrices and that results are still valuable for error-resilient applications like preconditioning even for ill-conditioned matrices. We discuss the execution time and scaling of the algorithm on a theoretical level and present a distributed implementation of the algorithm using MPI and OpenMP. We demonstrate the scalability of this implementation by running it on a high-performance compute cluster comprised of 1024 CPU cores, showing a speedup of 665x compared to single-threaded execution

    Feasibility and performances of compressed-sensing and sparse map-making with Herschel/PACS data

    Full text link
    The Herschel Space Observatory of ESA was launched in May 2009 and is in operation since. From its distant orbit around L2 it needs to transmit a huge quantity of information through a very limited bandwidth. This is especially true for the PACS imaging camera which needs to compress its data far more than what can be achieved with lossless compression. This is currently solved by including lossy averaging and rounding steps on board. Recently, a new theory called compressed-sensing emerged from the statistics community. This theory makes use of the sparsity of natural (or astrophysical) images to optimize the acquisition scheme of the data needed to estimate those images. Thus, it can lead to high compression factors. A previous article by Bobin et al. (2008) showed how the new theory could be applied to simulated Herschel/PACS data to solve the compression requirement of the instrument. In this article, we show that compressed-sensing theory can indeed be successfully applied to actual Herschel/PACS data and give significant improvements over the standard pipeline. In order to fully use the redundancy present in the data, we perform full sky map estimation and decompression at the same time, which cannot be done in most other compression methods. We also demonstrate that the various artifacts affecting the data (pink noise, glitches, whose behavior is a priori not well compatible with compressed-sensing) can be handled as well in this new framework. Finally, we make a comparison between the methods from the compressed-sensing scheme and data acquired with the standard compression scheme. We discuss improvements that can be made on ground for the creation of sky maps from the data.Comment: 11 pages, 6 figures, 5 tables, peer-reviewed articl

    Distributing the Kalman Filter for Large-Scale Systems

    Full text link
    This paper derives a \emph{distributed} Kalman filter to estimate a sparsely connected, large-scale, n−n-dimensional, dynamical system monitored by a network of NN sensors. Local Kalman filters are implemented on the (nl−n_l-dimensional, where nl≪nn_l\ll n) sub-systems that are obtained after spatially decomposing the large-scale system. The resulting sub-systems overlap, which along with an assimilation procedure on the local Kalman filters, preserve an LLth order Gauss-Markovian structure of the centralized error processes. The information loss due to the LLth order Gauss-Markovian approximation is controllable as it can be characterized by a divergence that decreases as L↑L\uparrow. The order of the approximation, LL, leads to a lower bound on the dimension of the sub-systems, hence, providing a criterion for sub-system selection. The assimilation procedure is carried out on the local error covariances with a distributed iterate collapse inversion (DICI) algorithm that we introduce. The DICI algorithm computes the (approximated) centralized Riccati and Lyapunov equations iteratively with only local communication and low-order computation. We fuse the observations that are common among the local Kalman filters using bipartite fusion graphs and consensus averaging algorithms. The proposed algorithm achieves full distribution of the Kalman filter that is coherent with the centralized Kalman filter with an LLth order Gaussian-Markovian structure on the centralized error processes. Nowhere storage, communication, or computation of n−n-dimensional vectors and matrices is needed; only nl≪nn_l \ll n dimensional vectors and matrices are communicated or used in the computation at the sensors

    Parallel matrix inversion techniques

    Full text link
    In this paper, we present techniques for inverting sparse, symmetric and positive definite matrices on parallel and distributed computers. We propose two algorithms, one for SIMD implementation and the other for MIMD implementation. These algorithms are modified versions of Gaussian elimination and they take into account the sparseness of the matrix. Our algorithms perform better than the general parallel Gaussian elimination algorithm. In order to demonstrate the usefulness of our technique, we implemented the snake problem using our sparse matrix algorithm. Our studies reveal that the proposed sparse matrix inversion algorithm significantly reduces the time taken for obtaining the solution of the snake problem. In this paper, we present the results of our experimental work

    GPU-Accelerated Algorithms for Compressed Signals Recovery with Application to Astronomical Imagery Deblurring

    Get PDF
    Compressive sensing promises to enable bandwidth-efficient on-board compression of astronomical data by lifting the encoding complexity from the source to the receiver. The signal is recovered off-line, exploiting GPUs parallel computation capabilities to speedup the reconstruction process. However, inherent GPU hardware constraints limit the size of the recoverable signal and the speedup practically achievable. In this work, we design parallel algorithms that exploit the properties of circulant matrices for efficient GPU-accelerated sparse signals recovery. Our approach reduces the memory requirements, allowing us to recover very large signals with limited memory. In addition, it achieves a tenfold signal recovery speedup thanks to ad-hoc parallelization of matrix-vector multiplications and matrix inversions. Finally, we practically demonstrate our algorithms in a typical application of circulant matrices: deblurring a sparse astronomical image in the compressed domain

    Parallel computation of optimized arrays for 2-D electrical imaging surveys

    Get PDF
    Modern automatic multi-electrode survey instruments have made it possible to use non-traditional arrays to maximize the subsurface resolution from electrical imaging surveys. Previous studies have shown that one of the best methods for generating optimized arrays is to select the set of array configurations that maximizes the model resolution for a homogeneous earth model. The Sherman–Morrison Rank-1 update is used to calculate the change in the model resolution when a new array is added to a selected set of array configurations. This method had the disadvantage that it required several hours of computer time even for short 2-D survey lines. The algorithm was modified to calculate the change in the model resolution rather than the entire resolution matrix. This reduces the computer time and memory required as well as the computational round-off errors. The matrix–vector multiplications for a single add-on array were replaced with matrix–matrix multiplications for 28 add-on arrays to further reduce the computer time. The temporary variables were stored in the double-precision Single Instruction Multiple Data (SIMD) registers within the CPU to minimize computer memory access. A further reduction in the computer time is achieved by using the computer graphics card Graphics Processor Unit (GPU) as a highly parallel mathematical coprocessor. This makes it possible to carry out the calculations for 512 add-on arrays in parallel using the GPU. The changes reduce the computer time by more than two orders of magnitude. The algorithm used to generate an optimized data set adds a specified number of new array configurations after each iteration to the existing set. The resolution of the optimized data set can be increased by adding a smaller number of new array configurations after each iteration. Although this increases the computer time required to generate an optimized data set with the same number of data points, the new fast numerical routines has made this practical on commonly available microcomputers

    Enabling Factor Analysis on Thousand-Subject Neuroimaging Datasets

    Full text link
    The scale of functional magnetic resonance image data is rapidly increasing as large multi-subject datasets are becoming widely available and high-resolution scanners are adopted. The inherent low-dimensionality of the information in this data has led neuroscientists to consider factor analysis methods to extract and analyze the underlying brain activity. In this work, we consider two recent multi-subject factor analysis methods: the Shared Response Model and Hierarchical Topographic Factor Analysis. We perform analytical, algorithmic, and code optimization to enable multi-node parallel implementations to scale. Single-node improvements result in 99x and 1812x speedups on these two methods, and enables the processing of larger datasets. Our distributed implementations show strong scaling of 3.3x and 5.5x respectively with 20 nodes on real datasets. We also demonstrate weak scaling on a synthetic dataset with 1024 subjects, on up to 1024 nodes and 32,768 cores
    • …
    corecore