9 research outputs found

    Parallelization of sequential programs: distribution of arrays among processors and structurization of communications

    Get PDF
    Data distribution functions are introduced. They are matced with scheduling functions. The processors and iterations are determined that use an array element at its fixed position in a statement. This makes it possible to obtain the initial data distribution and also information on the data volume for every processor and on the structure of required communication

    Performance modeling and optimization techniques for heterogeneous computing

    Get PDF
    Since Graphics Processing Units (CPUs) have increasingly gained popularity amoung non-graphic and computational applications, known as General-Purpose computation on GPU (GPGPU), CPUs have been deployed in many clusters, including the world\u27s fastest supercomputer. However, to make the most efficiency from a GPU system, one should consider both performance and reliability of the system. This dissertation makes four major contributions. First, the two-level checkpoint/restart protocol that aims to reduce the checkpoint and recovery costs with a latency hiding strategy in a system between a CPU (Central Processing Unit) and a GPU is proposed. The experimental results and analysis reveals some benefits, especially in a long-running application. Second, a performance model for estimating GPGPU execution time is proposed. This performance model improves operation cost estimation over existing ones by considering varied memory latencies. The proposed model also considers the effects of thread synchronization functions. In addition, the impacts of various issues in GPGPU programming such as bank conflicts in shared memory and branch divergence are also discussed. Third, the interplay between GPGPU application performance and system reliability of a large GPU system is explored. This includes a checkpoint scheduling model for a certain GPGPU application. The effects of a checkpoint/restart mechanism on the application performance is also discussed. Finally, optimization techniques to remedy uncoalesced memory access in GPU\u27s global memory are proposed. These techniques are memory rearrangement using 2-dimensional matrix transpose and 3-dimensional matrix permutation. The analytical results show that the proposed technique can reduce memory access time, especially when the transformed array/matrix is frequently accessed

    Automatic Data and Computation Mapping for Distributed-Memory Machines.

    Get PDF
    Distributed memory parallel computers offer enormous computation power, scalability and flexibility. However, these machines are difficult to program and this limits their widespread use. An important characteristic of these machines is the difference in the access time for data in local versus non-local memory; non-local memory accesses are much slower than local memory accesses. This is also a characteristic of shared memory machines but to a less degree. Therefore it is essential that as far as possible, the data that needs to be accessed by a processor during the execution of the computation assigned to it reside in its local memory rather than in some other processor\u27s memory. Several research projects have concluded that proper mapping of data is key to realizing the performance potential of distributed memory machines. Current language design efforts such as Fortran D and High Performance Fortran (HPF) are based on this. It is our thesis that for many practical codes, it is possible to derive good mappings through a combination of algorithms and systematic procedures. We view mapping as consisting of wo phases, alignment followed by distribution. For the alignment phase we present three constraint-based methods--one based on a linear programming formulation of the problem; the second formulates the alignment problem as a constrained optimization problem using Lagrange multipliers; the third method uses a heuristic to decide which constraints to leave unsatisfied (based on the penalty of increased communication incurred in doing so) in order to find a mapping. In addressing the distribution phase, we have developed two methods that integrate the placement of computation--loop nests in our case--with the mapping of data. For one distributed dimension, our approach finds the best combination of data and computation mapping that results in low communication overhead; this is done by choosing a loop order that allows message vectorization. In the second method, we introduce the distribution preference graph and the operations on this graph allow us to integrate loop restructuring transformations and data mapping. These techniques produce mappings that have been used in efficient hand-coded implementations of several benchmark codes

    Outils et méthodes pour le traitement parallèle de calculs sur des tableaux

    Get PDF
    Modèle de parallélisme de HPF -- Architecture du SIMD de pulse -- Partitionnement de boucles imbriquées -- Partitionnement de code de haut niveau -- Génération d'adresse pour les tableaux -- Langages de haut niveau pour la programmation SIMD -- Méthodes itératives -- Méthodes directes -- Contexte matériel -- Mise en oeuvre d'un objet-tableau C++ -- Générations d'adresses -- Patrons d'adresses -- Algorithme -- Implantation logicielle -- Implantation matérielle -- Transformations -- Paramètres -- Un langage de haut niveau pour les ordinateurs SIMD -- Description du langage -- Sémantique -- Tampons circulaires -- Exemple de programme -- Analyse des performances obtenues -- Génération automatique de directives HPF -- Cadre conceptuel et algorithmes -- Implantation -- Généralisation et formalisation du modèle de partitionnement -- Classe de distribution -- Algorithmes

    Compiler Techniques for Optimizing Communication and Data Distribution for Distributed-Memory Computers

    Get PDF
    Advanced Research Projects Agency (ARPA)National Aeronautics and Space AdministrationOpe

    Solving Alignment using Elementary Linear Algebra

    No full text
    Data and computation alignment is an important part of compiling sequential programs to architectures with non-uniform memory access times. In this paper, we show that elementary matrix methods can be used to determine communication-free alignment of code and data. We also solve the problem of replicating read-only data to eliminate communication. Our matrix-based approach leads to algorithms which are simpler and faster than existing algorithms for the alignment problem. 1 Introduction: A key problem in generating code for non-uniform memory access (NUMA) parallel machines is data and computation placement --- that is, determining what work each processor must do, and what data must reside in each local memory. The goal of placement is to exploit parallelism by spreading the work across the processors, and to exploit locality by spreading data so that memory accesses are local whenever possible. The problem of determining a good placement for a program is usually solved in two phases called alignment and distribution

    Solving alignment using elementary linear algebra

    No full text

    Solving Alignment Using Elementary Linear Algebra

    No full text
    this paper, weshow that elementary matrix methods can be used to determine communication-free alignmentofcodeanddata.Wealsosolve the problem of replicating data to eliminate communication. Our matrix-based approach leads to algorithms whichwork well for a variety of applications, and which are simpler and faster than other matrix-based algorithms in the literatur
    corecore