114 research outputs found

    Parallel implementation of QRD algorithms on the Fujitsu AP1000

    Get PDF
    This report addresses several important aspects of parallel implementation of QR decomposition of a matrix on a distributed memory MIMD machine, the Fujitsu AP1000. They include: Among various QR decomposition algorithms, which one is most suitable for implementation on the AP1000? With the total number of cells given, what is the best aspect ratio of the array to achieve optimal performance? How efficient is the AP1000 in computing the QR decomposition of a matrix? To help answer these questions we have implemented various orthogonal factorisation algorithms on a 128-cell AP1000 located at the Australian National University. After extensive experiments some interesting results have been obtained and are presented in the report

    Coherent network interfaces for fine-grain communication

    Get PDF
    Using coherence can improve performance by facilitating burst transfers of whole cache blocks and reducing control overheads. This paper describes an attempt to explore network interfaces that use coherence, i.e., coherent network interfaces (CNIs), to improve communication performance. First, it reports on the development and optimization of two mechanisms that CNIs use to communicate with processors. A taxonomy and comparison of four CNIs with a more conventional NI are then presented

    Memory sharing for interactive ray tracing on clusters

    Get PDF
    ManuscriptWe present recent results in the application of distributed shared memory to image parallel ray tracing on clusters. Image parallel rendering is traditionally limited to scenes that are small enough to be replicated in the memory of each node, because any processor may require access to any piece of the scene. We solve this problem by making all of a cluster's memory available through software distributed shared memory layers. With gigabit ethernet connections, this mechanism is sufficiently fast for interactive rendering of multi-gigabyte datasets. Object- and page-based distributed shared memories are compared, and optimizations for efficient memory use are discussed

    Improving the performance of parallel scientific applications using cache injection

    Get PDF
    Cache injection is a viable technique to improve the performance of data-intensive parallel applications. This dissertation characterizes cache injection of incoming network data in terms of parallel application performance. My results show that the benefit of this technique is dependent on: the ratio of processor speed to memory speed, the cache injection policy, and the application\u27s communication characteristics. Cache injection addresses the memory wall for I/O by writing data into a processor\u27s cache directly from the I/O bus. This technique, unlike data prefetching, reduces the number of reads served by the memory unit. This reduction is significant for data-intensive applications whose performance is dominated by compulsory cache misses and cannot be alleviated by traditional caching systems. Unlike previous work on cache injection which focused on reducing host network stack overhead incurred by memory copies, I show that applications can directly benefit from this technique based on their temporal and spatial locality in accessing incoming network data. I also show that the performance of cache injection is directly proportional to the ratio of processor speed to memory speed. In other words, systems with a memory wall can provide significantly better performance with cache injection and an appropriate injection policy. This result implies that multi-core and many-core architectures would benefit from this technique. Finally, my results show that the application\u27s communication characteristics are key to cache injection performance. For example, cache injection can improve the performance of certain collective communication operations by up to 20% as a function of message size

    Optimal load balancing techniques for block-cyclic decompositions for matrix factorization

    Get PDF
    In this paper, we present a new load balancing technique, called panel scattering, which is generally applicable for parallel block-partitioned dense linear algebra algorithms, such as matrix factorization. Here, the panels formed in such computation are divided across their length, and evenly (re-)distributed among all processors. It is shown how this technique can be eÆciently implemented for the general block-cyclic matrix distribution, requiring only the collective communication primitives that required for block-cyclic parallel BLAS. In most situations, panel scattering yields optimal load balance and cell computation speed across all stages of the computation. It has also advantages in naturally yielding good memory access patterns. Compared with traditional methods which minimize communication costs at the expense of load balance, it has a small (in some situations negative) increase in communication volume costs. It however incurs extra communication startup costs, but only by a factor not exceeding 2. To maximize load balance and minimize the cost of panel re-distribution, storage block sizes should be kept small; furthermore, in many situations of interest, there will be no significant communication startup penalty for doing so. Results will be given on the Fujitsu AP+ parallel computer, which will compare the performance of panel scattering with previously established methods, for LU, LLT and QR factorization. These are consistent with a detailed performance model for LU factorization for each method that is developed here

    A Multilevel in Space and Energy Solver for Multigroup Diffusion and Coarse Mesh Finite Difference Eigenvalue Problems

    Full text link
    In reactor physics, the efficient solution of the multigroup neutron diffusion eigenvalue problem is desired for various applications. The diffusion problem is a lower-order but reasonably accurate approximation to the higher-fidelity multigroup neutron transport eigenvalue problem. In cases where the full-fidelity of the transport solution is needed, the solution of the diffusion problem can be used to accelerate the convergence of transport solvers via methods such as Coarse Mesh Finite Difference (CMFD). The diffusion problem can have O(108) unknowns, and, despite being orders of magnitude smaller than a typical transport problem, obtaining its solution is still not a trivial task. In the Michigan Parallel Characteristics Transport (MPACT) code, the lack of an efficient CMFD solver has resulted in a computational bottleneck at the CMFD step. Solving the CMFD system can comprise 50% or more of the overall runtime in MPACT when the de facto default CMFD solver is used; addressing this bottleneck is the motivation for our work. The primary focus of this thesis is the theory, development, implementation, and testing of a new Multilevel-in-Space-and-Energy Diffusion (MSED) method for efficiently solving multigroup diffusion and CMFD eigenvalue problems. As its name suggests, MSED efficiently converges multigroup diffusion and CMFD problems by leveraging lower-order systems with coarsened energy and/or spatial grids. The efficiency of MSED is verified via various Fourier analyses of its components and via testing in a 1-D diffusion code. In the later chapters of this thesis, the MSED method is tested on a variety of reactor problems in MPACT. Compared to the default CMFD solver, our implementation of MSED in MPACT has resulted in an ~8-12x reduction in the CMFD runtime required by MPACT for single statepoint calculations on 3-D, full-core, 51-group reactor models. The number of transport sweeps is also typically reduced by the use of MSED, which is able to better converge the CMFD system than the default CMFD solver. This leads to a further savings in overall runtime that is not captured by the differences in CMFD runtime.PHDNuclear Engineering & Radiological SciencesUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/146075/1/bcyee_1.pd
    • …
    corecore