142 research outputs found

    Exploiting Parallelization in Spatial Statistics: an Applied Survey using R.

    Get PDF
    Computing tasks may be parallelized top-down by splitting into per-node chunks when the tasks permit this kind of division, and particularly when there is little or no need for communication between the nodes. Another approach is to parallelize bottom-up, by the substitution of multi-threaded low-level functions for single-threaded ones in otherwise unchanged user-level functions. This survey examines the timings of typical spatial data analysis tasks across a range of data sizes and hardware under different combinations of these two approaches. Conclusions are drawn concerning choices of alternatives for parallelization, and attention is drawn to factors conditioning those choices.Statistical software; Parallelization; Optimized linear algebra subroutines; Multicore processors; Spatial statistics.

    libNMF -- A Library for Nonnegative Matrix Factorization

    Get PDF
    We present libNMF -- a computationally efficient high performance library for computing nonnegative matrix factorizations (NMF) written in C. Various algorithms and algorithmic variants for computing NMF are supported. libNMF is based on external routines from BLAS (Basic Linear Algebra Subprograms), LAPack (Linear Algebra package) and ARPack, which provide efficient building blocks for performing central vector and matrix operations. Since modern BLAS implementations support multi-threading, libNMF can exploit the potential of multi-core architectures. In this paper, the basic NMF algorithms contained in libNMF and existing implementations found in the literature are briefly reviewed. Then, libNMF is evaluated in terms of computational efficiency and numerical accuracy and compared with the best existing codes available. libNMF is publicly available at http://rlcta.univie.ac.at/software

    Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

    Full text link
    We present a method for parallel block-sparse matrix-matrix multiplication on distributed memory clusters. By using a quadtree matrix representation, data locality is exploited without prior information about the matrix sparsity pattern. A distributed quadtree matrix representation is straightforward to implement due to our recent development of the Chunks and Tasks programming model [Parallel Comput. 40, 328 (2014)]. The quadtree representation combined with the Chunks and Tasks model leads to favorable weak and strong scaling of the communication cost with the number of processes, as shown both theoretically and in numerical experiments. Matrices are represented by sparse quadtrees of chunk objects. The leaves in the hierarchy are block-sparse submatrices. Sparsity is dynamically detected by the matrix library and may occur at any level in the hierarchy and/or within the submatrix leaves. In case graphics processing units (GPUs) are available, both CPUs and GPUs are used for leaf-level multiplication work, thus making use of the full computing capacity of each node. The performance is evaluated for matrices with different sparsity structures, including examples from electronic structure calculations. Compared to methods that do not exploit data locality, our locality-aware approach reduces communication significantly, achieving essentially constant communication per node in weak scaling tests.Comment: 35 pages, 14 figure

    Computation- and Space-Efficient Implementation of SSA

    Full text link
    The computational complexity of different steps of the basic SSA is discussed. It is shown that the use of the general-purpose "blackbox" routines (e.g. found in packages like LAPACK) leads to huge waste of time resources since the special Hankel structure of the trajectory matrix is not taken into account. We outline several state-of-the-art algorithms (for example, Lanczos-based truncated SVD) which can be modified to exploit the structure of the trajectory matrix. The key components here are hankel matrix-vector multiplication and hankelization operator. We show that both can be computed efficiently by the means of Fast Fourier Transform. The use of these methods yields the reduction of the worst-case computational complexity from O(N^3) to O(k N log(N)), where N is series length and k is the number of eigentriples desired.Comment: 27 pages, 8 figure

    Controlling the level of sparsity in MPC

    Full text link
    In optimization routines used for on-line Model Predictive Control (MPC), linear systems of equations are usually solved in each iteration. This is true both for Active Set (AS) methods as well as for Interior Point (IP) methods, and for linear MPC as well as for nonlinear MPC and hybrid MPC. The main computational effort is spent while solving these linear systems of equations, and hence, it is of greatest interest to solve them efficiently. Classically, the optimization problem has been formulated in either of two different ways. One of them leading to a sparse linear system of equations involving relatively many variables to solve in each iteration and the other one leading to a dense linear system of equations involving relatively few variables. In this work, it is shown that it is possible not only to consider these two distinct choices of formulations. Instead it is shown that it is possible to create an entire family of formulations with different levels of sparsity and number of variables, and that this extra degree of freedom can be exploited to get even better performance with the software and hardware at hand. This result also provides a better answer to an often discussed question in MPC; should the sparse or dense formulation be used. In this work, it is shown that the answer to this question is that often none of these classical choices is the best choice, and that a better choice with a different level of sparsity actually can be found

    Enumeration of 22-Polymatroids on up to Seven Elements

    Full text link
    A theory of single-element extensions of integer polymatroids analogous to that of matroids is developed. We present an algorithm to generate a catalog of 22-polymatroids, up to isomorphism. When we implemented this algorithm on a computer, obtaining all 22-polymatroids on at most seven elements, we discovered the surprising fact that the number of 22-polymatroids on seven elements fails to be unimodal in rank.Comment: 9 page

    Autotuning multigrid with PetaBricks

    Get PDF
    Algorithmic choice is essential in any problem domain to realizing optimal computational performance. Multigrid is a prime example: not only is it possible to make choices at the highest grid resolution, but a program can switch techniques as the problem is recursively attacked on coarser grid levels to take advantage of algorithms with different scaling behaviors. Additionally, users with different convergence criteria must experiment with parameters to yield a tuned algorithm that meets their accuracy requirements. Even after a tuned algorithm has been found, users often have to start all over when migrating from one machine to another. We present an algorithm and autotuning methodology that address these issues in a near-optimal and efficient manner. The freedom of independently tuning both the algorithm and the number of iterations at each recursion level results in an exponential search space of tuned algorithms that have different accuracies and performances. To search this space efficiently, our autotuner utilizes a novel dynamic programming method to build efficient tuned algorithms from the bottom up. The results are customized multigrid algorithms that invest targeted computational power to yield the accuracy required by the user. The techniques we describe allow the user to automatically generate tuned multigrid cycles of different shapes targeted to the user's specific combination of problem, hardware, and accuracy requirements. These cycle shapes dictate the order in which grid coarsening and grid refinement are interleaved with both iterative methods, such as Jacobi or Successive Over-Relaxation, as well as direct methods, which tend to have superior performance for small problem sizes. The need to make choices between all of these methods brings the issue of variable accuracy to the forefront. Not only must the autotuning framework compare different possible multigrid cycle shapes against each other, but it also needs the ability to compare tuned cycles against both direct and (non-multigrid) iterative methods. We address this problem by using an accuracy metric for measuring the effectiveness of tuned cycle shapes and making comparisons over all algorithmic types based on this common yardstick. In our results, we find that the flexibility to trade performance versus accuracy at all levels of recursive computation enables us to achieve excellent performance on a variety of platforms compared to algorithmically static implementations of multigrid. Our implementation uses PetaBricks, an implicitly parallel programming language where algorithmic choices are exposed in the language. The PetaBricks compiler uses these choices to analyze, autotune, and verify the PetaBricks program. These language features, most notably the autotuner, were key in enabling our implementation to be clear, correct, and fast.National Science Foundation (U.S.) (Award CCF-0832997)GigaScale Systems Research Cente

    Maintaining High Performance Across All Problem Sizes and Parallel Scales Using Microkernel-based Linear Algebra

    Get PDF
    Linear algebra underlies a large proportion of computational problems. With the continuous increase of scale on modern hardware, performance of small sized linear algebra has become increasingly important. To overcome the shortcomings of conventional approaches, we employ a new approach using a microkernel framework provided by ATLAS to improve the performance of a few linear algebra routines for all problem sizes. Our initial research consists of improving the performance of parallel LU factorization in ATLAS for which we were able to achieve up to 2.07x and 2.66x speedup for small problems, up to 91% and 87% of theoretical peak performance for asymptotic problems on a 12-core Intel Xeon and a 32-core AMD Opteron machine, respectively, outperforming all the state-of-the-art libraries at the time. Such performance was achieved via an exhaustive search of all the tuning parameters, which could take days. This motivated us to try to develop a computational model for our LU factorization that could predict those parameters by combining some basic empirical timings and a theoretical model based on the amount of required computations. While our model provided good prediction for mid-to-asymptotic sized problems, there were some unknown factors for small problems that could possibly be answered by extending the ATLAS tuning framework. While this extension is underway, we decided to pursue the model research using simpler serial BLAS-based approach. We investigated and implemented two Level-3 BLAS routines: TRSM and TRMM that are widely used primarily by LAPACK operations like the aforementioned LU factorization. With the microkernel-based approach, we were able to improve the performance of both routines by up to 15% and 73% for square and fat problems, respectively, over prior ATLAS implementations on modern hardware. Finally, with a collaborative research with ARM Inc., we improved the performance of the most important Level-3 BLAS operation GEMM in ATLAS by up to 53% via implementing microkernels for two 64-bit ARM architectures. This automatically improves other BLAS and LAPACK routines that rely on GEMM for high performance
    corecore