45 research outputs found
Parallel algorithms for two processors precedence constraint scheduling
The final publication is available at link.springer.comPeer ReviewedPostprint (author's final draft
A Block Minorization--Maximization Algorithm for Heteroscedastic Regression
The computation of the maximum likelihood (ML) estimator for heteroscedastic
regression models is considered. The traditional Newton algorithms for the
problem require matrix multiplications and inversions, which are bottlenecks in
modern Big Data contexts. A new Big Data-appropriate minorization--maximization
(MM) algorithm is considered for the computation of the ML estimator. The MM
algorithm is proved to generate monotonically increasing sequences of
likelihood values and to be convergent to a stationary point of the
log-likelihood function. A distributed and parallel implementation of the MM
algorithm is presented and the MM algorithm is shown to have differing time
complexity to the Newton algorithm. Simulation studies demonstrate that the MM
algorithm improves upon the computation time of the Newton algorithm in some
practical scenarios where the number of observations is large
FooPar: A Functional Object Oriented Parallel Framework in Scala
We present FooPar, an extension for highly efficient Parallel Computing in
the multi-paradigm programming language Scala. Scala offers concise and clean
syntax and integrates functional programming features. Our framework FooPar
combines these features with parallel computing techniques. FooPar is designed
modular and supports easy access to different communication backends for
distributed memory architectures as well as high performance math libraries. In
this article we use it to parallelize matrix matrix multiplication and show its
scalability by a isoefficiency analysis. In addition, results based on a
empirical analysis on two supercomputers are given. We achieve close-to-optimal
performance wrt. theoretical peak performance. Based on this result we conclude
that FooPar allows to fully access Scala's design features without suffering
from performance drops when compared to implementations purely based on C and
MPI
Increasing the Efficiency of Sparse Matrix-Matrix Multiplication with a 2.5D Algorithm and One-Sided MPI
Matrix-matrix multiplication is a basic operation in linear algebra and an
essential building block for a wide range of algorithms in various scientific
fields. Theory and implementation for the dense, square matrix case are
well-developed. If matrices are sparse, with application-specific sparsity
patterns, the optimal implementation remains an open question. Here, we explore
the performance of communication reducing 2.5D algorithms and one-sided MPI
communication in the context of linear scaling electronic structure theory. In
particular, we extend the DBCSR sparse matrix library, which is the basic
building block for linear scaling electronic structure theory and low scaling
correlated methods in CP2K. The library is specifically designed to efficiently
perform block-sparse matrix-matrix multiplication of matrices with a relatively
large occupation. Here, we compare the performance of the original
implementation based on Cannon's algorithm and MPI point-to-point
communication, with an implementation based on MPI one-sided communications
(RMA), in both a 2D and a 2.5D approach. The 2.5D approach trades memory and
auxiliary operations for reduced communication, which can lead to a speedup if
communication is dominant. The 2.5D algorithm is somewhat easier to implement
with one-sided communications. A detailed description of the implementation is
provided, also for non ideal processor topologies, since this is important for
actual applications. Given the importance of the precise sparsity pattern, and
even the actual matrix data, which decides the effective fill-in upon
multiplication, the tests are performed within the CP2K package with
application benchmarks. Results show a substantial boost in performance for the
RMA based 2.5D algorithm, up to 1.80x, which is observed to increase with the
number of involved processes in the parallelization.Comment: In Proceedings of PASC '17, Lugano, Switzerland, June 26-28, 2017, 10
pages, 4 figure
Parallel algorithms for boundary value problems
A general approach to solve boundary value problems numerically in a parallel environment is discussed. The basic algorithm consists of two steps: the local step where all the P available processors work in parallel, and the global step where one processor solves a tridiagonal linear system of the order P. The main advantages of this approach are two fold. First, this suggested approach is very flexible, especially in the local step and thus the algorithm can be used with any number of processors and with any of the SIMD or MIMD machines. Secondly, the communication complexity is very small and thus can be used as easily with shared memory machines. Several examples for using this strategy are discussed
Distributed Evaluation of an Iterative Function for All Object Pairs on a SIMD Hypercube
An efficient distributed algorithm for evaluating an iterative function on all pairwise combinations of C objects on an SIMD hypercube is presented. The algorithm achieves uniform load distribution and minimal, completely local interprocessor communication
Parallelizing algorithms in ada on clementina II : Face recognition system
In the Laboratory of Research and Development on Computer Science of the National University of La Plata, a face recognition system has been developed. This article describes a series of testings based on parallel processing, with the objective of optimizing the said system response times developed in Ada programming language on SGI Origin 2000 parallel architecture known as Clementina II. Then, the results obtained are analyzedEje: Programación concurrenteRed de Universidades con Carreras en Informática (RedUNCI
Singular value decomposition on SIMD hypercube and shuffle-exchange computers
AbstractThis paper reports several parallel singular value decomposition (SVD) algorithms on the hypercube and shuffle-exchange SIMD computers. Unlike previously published hypercube SVD algorithms which map a column pair of a matrix onto a processor, the algorithms presented in this paper map a matrix column pair onto a column of processors. In this way, a further reduction in time complexity is achieved. The paper also introduces the concept of two-dimensional shuffle-exchange networks, and corresponding SVD algorithms for one-dimensional and two-dimensional shuffle-exchange computers are developed
Multinode broadcast in hypercubes and rings with randomly distributed length of packets
Includes bibliographical references (p. 19-20).Cover title.Research supported by the NSF. NSF-DDM-8903385 Research supported by the ARO. DAAL03-b6-K-0171by Emmanouel A. Varvarigos and Dimitri P. Bertsekas