298 research outputs found

    Memory Access Optimizations for High-Performance Computing

    Get PDF
    This paper discusses the importance of memory access optimizations which are shown to be highly effective on the MasPar architecture. The study is based on two MasPar machines, a 16K-processor MP-1 and a 4K-processor MP-2. A software pipelining technique overlaps memory accesses with computation and/or communication. Another optimization, called the register window technique reduces the number of loads in a loop. These techniques are evaluated using three parallel matrix multiplication algorithms on both the MasPar machines. The matrix multiplication study shows that for a highly computation intensive problem, reducing the interprocessor communication can become a secondary issue compared to memory access optimization. Also, it is shown that memory access optimizations can play a more important role than the choice of a superior parallel algorithm. Keywords: load/store architecture, memory accesses, matrix multiplication, parallel programming

    Expanded delta networks for very large parallel computers

    Get PDF
    In this paper we analyze a generalization of the traditional delta network, introduced by Patel [21], and dubbed Expanded Delta Network (EDN). These networks provide in general multiple paths that can be exploited to reduce contention in the network resulting in increased performance. The crossbar and traditional delta networks are limiting cases of this class of networks. However, the delta network does not provide the multiple paths that the more general expanded delta networks provide, and crossbars are to costly to use for large networks. The EDNs are analyzed with respect to their routing capabilities in the MIMD and SIMD models of computation.The concepts of capacity and clustering are also addressed. In massively parallel SIMD computers, it is the trend to put a larger number processors on a chip, but due to I/O constraints only a subset of the total number of processors may have access to the network. This is introduced as a Restricted Access Expanded Delta Network of which the MasPar MP-1 router network is an example

    Image analysis by integration of disparate information

    Get PDF
    Image analysis often starts with some preliminary segmentation which provides a representation of the scene needed for further interpretation. Segmentation can be performed in several ways, which are categorized as pixel based, edge-based, and region-based. Each of these approaches are affected differently by various factors, and the final result may be improved by integrating several or all of these methods, thus taking advantage of their complementary nature. In this paper, we propose an approach that integrates pixel-based and edge-based results by utilizing an iterative relaxation technique. This approach has been implemented on a massively parallel computer and tested on some remotely sensed imagery from the Landsat-Thematic Mapper (TM) sensor

    A simple parallel prefix algorithm for compact finite-difference schemes

    Get PDF
    A compact scheme is a discretization scheme that is advantageous in obtaining highly accurate solutions. However, the resulting systems from compact schemes are tridiagonal systems that are difficult to solve efficiently on parallel computers. Considering the almost symmetric Toeplitz structure, a parallel algorithm, simple parallel prefix (SPP), is proposed. The SPP algorithm requires less memory than the conventional LU decomposition and is highly efficient on parallel machines. It consists of a prefix communication pattern and AXPY operations. Both the computation and the communication can be truncated without degrading the accuracy when the system is diagonally dominant. A formal accuracy study was conducted to provide a simple truncation formula. Experimental results were measured on a MasPar MP-1 SIMD machine and on a Cray 2 vector machine. Experimental results show that the simple parallel prefix algorithm is a good algorithm for the compact scheme on high-performance computers

    Introduction to Multiprocessor I/O Architecture

    Get PDF
    The computational performance of multiprocessors continues to improve by leaps and bounds, fueled in part by rapid improvements in processor and interconnection technology. I/O performance thus becomes ever more critical, to avoid becoming the bottleneck of system performance. In this paper we provide an introduction to I/O architectural issues in multiprocessors, with a focus on disk subsystems. While we discuss examples from actual architectures and provide pointers to interesting research in the literature, we do not attempt to provide a comprehensive survey. We concentrate on a study of the architectural design issues, and the effects of different design alternatives

    Parametric micro-level performance models for parallel computing and parallel implementation of hydrostatic MM5

    Get PDF
    This dissertation presents Parametric micro-level performance models and Parallel implementation of the hydrostatic version of MM5;Parametric micro-level (PM) performance models are introduced to address the important issue of how to realistically model parallel performance. These models can be used to predict execution times and identify performance bottlenecks. The accurate prediction and analysis of execution times is achieved by incorporating precise details of interprocessor communication, memory operations, auxiliary instructions, and effects of communication and computation schedules. The parameters provide the flexibility to study various algorithmic and architectural issues. The development and verification process, parameters and the scope of applicability of these models are discussed. A coherent view of performance is obtained from the execution profiles generated by PM models. The models are targeted at a large class numerical algorithms commonly implemented on both SIMD and MIMD machines. Specific models are presented for matrix multiplication, LU decomposition, and FFT on a 2-D processor array with distributed memory. A case study includes comparison of parallel machines and parallel algorithms. In a comparison of parallel machines, PM models are used to analyze execution times so as to relate the performance to architectural attributes of a machine. In a comparison of parallel algorithms, PM models are used to study performance of two LU decomposition algorithms: non-blocked and blocked. Two algorithms are compared to identify the tradeoffs between them. This analysis is useful to determine an optimum block size for the blocked algorithm. The case study is done on MasPar MP-1 and MP-2 machines;The dissertation also describes the parallel implementation of the hydrostatic version of MM5 (the fifth generation of Mesoscale Model), which has been widely used for climate studies. The model was parallelized in machine-independent manner using the Runtime System Library (RSL), a runtime library for handling message-passing and index transformation. The dissertation discusses validation of the parallel implementation of MM5 using field data and presents performance results. The parallel model was tested on the IBM SP1, a distributed memory parallel computer

    Compiling machine-independent parallel programs

    Get PDF
    • …
    corecore