45 research outputs found

    Parallel algorithms for two processors precedence constraint scheduling

    Get PDF
    The final publication is available at link.springer.comPeer ReviewedPostprint (author's final draft

    A Block Minorization--Maximization Algorithm for Heteroscedastic Regression

    Full text link
    The computation of the maximum likelihood (ML) estimator for heteroscedastic regression models is considered. The traditional Newton algorithms for the problem require matrix multiplications and inversions, which are bottlenecks in modern Big Data contexts. A new Big Data-appropriate minorization--maximization (MM) algorithm is considered for the computation of the ML estimator. The MM algorithm is proved to generate monotonically increasing sequences of likelihood values and to be convergent to a stationary point of the log-likelihood function. A distributed and parallel implementation of the MM algorithm is presented and the MM algorithm is shown to have differing time complexity to the Newton algorithm. Simulation studies demonstrate that the MM algorithm improves upon the computation time of the Newton algorithm in some practical scenarios where the number of observations is large

    FooPar: A Functional Object Oriented Parallel Framework in Scala

    Full text link
    We present FooPar, an extension for highly efficient Parallel Computing in the multi-paradigm programming language Scala. Scala offers concise and clean syntax and integrates functional programming features. Our framework FooPar combines these features with parallel computing techniques. FooPar is designed modular and supports easy access to different communication backends for distributed memory architectures as well as high performance math libraries. In this article we use it to parallelize matrix matrix multiplication and show its scalability by a isoefficiency analysis. In addition, results based on a empirical analysis on two supercomputers are given. We achieve close-to-optimal performance wrt. theoretical peak performance. Based on this result we conclude that FooPar allows to fully access Scala's design features without suffering from performance drops when compared to implementations purely based on C and MPI

    Increasing the Efficiency of Sparse Matrix-Matrix Multiplication with a 2.5D Algorithm and One-Sided MPI

    Full text link
    Matrix-matrix multiplication is a basic operation in linear algebra and an essential building block for a wide range of algorithms in various scientific fields. Theory and implementation for the dense, square matrix case are well-developed. If matrices are sparse, with application-specific sparsity patterns, the optimal implementation remains an open question. Here, we explore the performance of communication reducing 2.5D algorithms and one-sided MPI communication in the context of linear scaling electronic structure theory. In particular, we extend the DBCSR sparse matrix library, which is the basic building block for linear scaling electronic structure theory and low scaling correlated methods in CP2K. The library is specifically designed to efficiently perform block-sparse matrix-matrix multiplication of matrices with a relatively large occupation. Here, we compare the performance of the original implementation based on Cannon's algorithm and MPI point-to-point communication, with an implementation based on MPI one-sided communications (RMA), in both a 2D and a 2.5D approach. The 2.5D approach trades memory and auxiliary operations for reduced communication, which can lead to a speedup if communication is dominant. The 2.5D algorithm is somewhat easier to implement with one-sided communications. A detailed description of the implementation is provided, also for non ideal processor topologies, since this is important for actual applications. Given the importance of the precise sparsity pattern, and even the actual matrix data, which decides the effective fill-in upon multiplication, the tests are performed within the CP2K package with application benchmarks. Results show a substantial boost in performance for the RMA based 2.5D algorithm, up to 1.80x, which is observed to increase with the number of involved processes in the parallelization.Comment: In Proceedings of PASC '17, Lugano, Switzerland, June 26-28, 2017, 10 pages, 4 figure

    Parallel algorithms for boundary value problems

    Get PDF
    A general approach to solve boundary value problems numerically in a parallel environment is discussed. The basic algorithm consists of two steps: the local step where all the P available processors work in parallel, and the global step where one processor solves a tridiagonal linear system of the order P. The main advantages of this approach are two fold. First, this suggested approach is very flexible, especially in the local step and thus the algorithm can be used with any number of processors and with any of the SIMD or MIMD machines. Secondly, the communication complexity is very small and thus can be used as easily with shared memory machines. Several examples for using this strategy are discussed

    Distributed Evaluation of an Iterative Function for All Object Pairs on a SIMD Hypercube

    Get PDF
    An efficient distributed algorithm for evaluating an iterative function on all pairwise combinations of C objects on an SIMD hypercube is presented. The algorithm achieves uniform load distribution and minimal, completely local interprocessor communication

    Parallelizing algorithms in ada on clementina II : Face recognition system

    Get PDF
    In the Laboratory of Research and Development on Computer Science of the National University of La Plata, a face recognition system has been developed. This article describes a series of testings based on parallel processing, with the objective of optimizing the said system response times developed in Ada programming language on SGI Origin 2000 parallel architecture known as Clementina II. Then, the results obtained are analyzedEje: Programación concurrenteRed de Universidades con Carreras en Informática (RedUNCI

    Singular value decomposition on SIMD hypercube and shuffle-exchange computers

    Get PDF
    AbstractThis paper reports several parallel singular value decomposition (SVD) algorithms on the hypercube and shuffle-exchange SIMD computers. Unlike previously published hypercube SVD algorithms which map a column pair of a matrix onto a processor, the algorithms presented in this paper map a matrix column pair onto a column of processors. In this way, a further reduction in time complexity is achieved. The paper also introduces the concept of two-dimensional shuffle-exchange networks, and corresponding SVD algorithms for one-dimensional and two-dimensional shuffle-exchange computers are developed

    Multinode broadcast in hypercubes and rings with randomly distributed length of packets

    Get PDF
    Includes bibliographical references (p. 19-20).Cover title.Research supported by the NSF. NSF-DDM-8903385 Research supported by the ARO. DAAL03-b6-K-0171by Emmanouel A. Varvarigos and Dimitri P. Bertsekas
    corecore