97 research outputs found

    A bibliography on parallel and vector numerical algorithms

    Get PDF
    This is a bibliography of numerical methods. It also includes a number of other references on machine architecture, programming language, and other topics of interest to scientific computing. Certain conference proceedings and anthologies which have been published in book form are listed also

    A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units

    Full text link
    We present a hierarchically blocked one-sided Jacobi algorithm for the singular value decomposition (SVD), targeting both single and multiple graphics processing units (GPUs). The blocking structure reflects the levels of GPU's memory hierarchy. The algorithm may outperform MAGMA's dgesvd, while retaining high relative accuracy. To this end, we developed a family of parallel pivot strategies on GPU's shared address space, but applicable also to inter-GPU communication. Unlike common hybrid approaches, our algorithm in a single GPU setting needs a CPU for the controlling purposes only, while utilizing GPU's resources to the fullest extent permitted by the hardware. When required by the problem size, the algorithm, in principle, scales to an arbitrary number of GPU nodes. The scalability is demonstrated by more than twofold speedup for sufficiently large matrices on a Tesla S2050 system with four GPUs vs. a single Fermi card.Comment: Accepted for publication in SIAM Journal on Scientific Computin

    LU FACTORIZATION ALGORITHMS ON DISTRIBUTED-MEMORY MULTIPROCESSOR ARCHITECTURES*

    Get PDF
    Abstract. In this paper, we consider the effect that the data-storage scheme and pivoting scheme have on the efficiency of LU factorization on a distributed-memory multiprocessor. Our presentation will focus on the hypercube architecture, but most of our results are applicable to distributed-memory architectures in general. We restrict our attention to two commonly used storage schemes (storage by rows and by columns) and investigate partial pivoting both by rows and by columns, yielding four factorization algorithms. Our goal is to determine which of these four algorithms admits the most efficient parallel implementation. We analyze factors such as load distribution, pivoting cost, and potential for pipelining. We conclude that, in the absence of loop-unrolling, LU factorization with partial pivoting is most efficient when pipelining is used to mask the cost of pivoting. The two schemes that can be pipelined are pivoting by interchanging rows when the coefficient matrix is distributed to the processors by columns, and pivoting by interchanging columns when the matrix is distributed to the processors by rows

    Analysis of a parallelized nonlinear elliptic boundary value problem solver with application to reacting flows

    Get PDF
    A parallelized finite difference code based on the Newton method for systems of nonlinear elliptic boundary value problems in two dimensions is analyzed in terms of computational complexity and parallel efficiency. An approximate cost function depending on 15 dimensionless parameters is derived for algorithms based on stripwise and boxwise decompositions of the domain and a one-to-one assignment of the strip or box subdomains to processors. The sensitivity of the cost functions to the parameters is explored in regions of parameter space corresponding to model small-order systems with inexpensive function evaluations and also a coupled system of nineteen equations with very expensive function evaluations. The algorithm was implemented on the Intel Hypercube, and some experimental results for the model problems with stripwise decompositions are presented and compared with the theory. In the context of computational combustion problems, multiprocessors of either message-passing or shared-memory type may be employed with stripwise decompositions to realize speedup of O(n), where n is mesh resolution in one direction, for reasonable n

    QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

    Get PDF
    Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in high-performance computing and the use of grids for speeding up large-scale scientific problems is limited to applications exhibiting parallelism at a higher level. We have identified two performance bottlenecks in the distributed memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear algebra library. First, because ScaLAPACK assumes a homogeneous communication network, the implementations of ScaLAPACK algorithms lack locality in their communication pattern. Second, the number of messages sent in the ScaLAPACK algorithms is significantly greater than other algorithms that trade flops for communication. In this paper, we present a new approach for computing a QR factorization -- one of the main dense linear algebra kernels -- of tall and skinny matrices in a grid computing environment that overcomes these two bottlenecks. Our contribution is to articulate a recently proposed algorithm (Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in order to confine intensive communications (ScaLAPACK calls) within the different geographical sites. An experimental study conducted on the Grid'5000 platform shows that the resulting performance increases linearly with the number of geographical sites on large-scale problems (and is in particular consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed Processing Symposium 2010 in Atlanta, GA, USA.

    Review and analysis of dense linear system solver package for distributed memory machines

    Get PDF
    A dense linear system solver package recently developed at the University of Texas at Austin for distributed memory machine (e.g. Intel Paragon) has been reviewed and analyzed. The package contains about 45 software routines, some written in FORTRAN, and some in C-language, and forms the basis for parallel/distributed solutions of systems of linear equations encountered in many problems of scientific and engineering nature. The package, being studied by the Computer Applications Branch of the Analysis and Computation Division, may provide a significant computational resource for NASA scientists and engineers in parallel/distributed computing. Since the package is new and not well tested or documented, many of its underlying concepts and implementations were unclear; our task was to review, analyze, and critique the package as a step in the process that will enable scientists and engineers to apply it to the solution of their problems. All routines in the package were reviewed and analyzed. Underlying theory or concepts which exist in the form of published papers or technical reports, or memos, were either obtained from the author, or from the scientific literature; and general algorithms, explanations, examples, and critiques have been provided to explain the workings of these programs. Wherever the things were still unclear, communications were made with the developer (author), either by telephone or by electronic mail, to understand the workings of the routines. Whenever possible, tests were made to verify the concepts and logic employed in their implementations. A detailed report is being separately documented to explain the workings of these routines

    Run-time parallelization and scheduling of loops

    Get PDF
    The class of problems that can be effectively compiled by parallelizing compilers is discussed. This is accomplished with the doconsider construct which would allow these compilers to parallelize many problems in which substantial loop-level parallelism is available but cannot be detected by standard compile-time analysis. We describe and experimentally analyze mechanisms used to parallelize the work required for these types of loops. In each of these methods, a new loop structure is produced by modifying the loop to be parallelized. We also present the rules by which these loop transformations may be automated in order that they be included in language compilers. The main application area of the research involves problems in scientific computations and engineering. The workload used in our experiment includes a mixture of real problems as well as synthetically generated inputs. From our extensive tests on the Encore Multimax/320, we have reached the conclusion that for the types of workloads we have investigated, self-execution almost always performs better than pre-scheduling. Further, the improvement in performance that accrues as a result of global topological sorting of indices as opposed to the less expensive local sorting, is not very significant in the case of self-execution

    Stratified regression Monte-Carlo scheme for semilinear PDEs and BSDEs with large scale parallelization on GPUs

    No full text
    In this paper, we design a novel algorithm based on Least-Squares Monte Carlo (LSMC) in order to approximate the solution of discrete time Backward Stochastic Differential Equations (BSDEs). Our algorithm allows massive parallelization of the computations on multicore devices such as graphics processing units (GPUs). Our approach consists of a novel method of stratification which appears to be crucial for large scale parallelization

    Design and analysis of numerical algorithms for the solution of linear systems on parallel and distributed architectures

    Get PDF
    The increasing availability of parallel computers is having a very significant impact on all aspects of scientific computation, including algorithm research and software development in numerical linear algebra. In particular, the solution of linear systems, which lies at the heart of most calculations in scientific computing is an important computation found in many engineering and scientific applications. In this thesis, well-known parallel algorithms for the solution of linear systems are compared with implicit parallel algorithms or the Quadrant Interlocking (QI) class of algorithms to solve linear systems. These implicit algorithms are (2x2) block algorithms expressed in explicit point form notation. [Continues.

    A fast and stable parallel QR algorithm for symmetric tridiagonal matrices

    Get PDF
    AbstractWe present a new, fast, and practical parallel algorithm for computing a few eigenvalues of a symmetric tridiagonal matrix by the explicitQR method. We present a new divide and conquer parallel algorithm which is fast and numerically stable. The algorithm is work efficient and of low communication overhead, and it can be used to solve very large problems infeasible by sequential methods
    • …
    corecore