8 research outputs found

    PASSION: Parallel And Scalable Software for Input-Output

    Get PDF
    We are developing a software system called PASSION: Parallel And Scalable Software for Input-Output which provides software support for high performance parallel I/O. PASSION provides support at the language, compiler, runtime as well as file system level. PASSION provides runtime procedures for parallel access to files (read/write), as well as for out-of-core computations. These routines can either be used together with a compiler to translate out-of-core data parallel programs written in a language like HPF, or used directly by application programmers. A number of optimizations such as Two-Phase Access, Data Sieving, Data Prefetching and Data Reuse have been incorporated in the PASSION Runtime Library for improved performance. PASSION also provides an initial framework for runtime support for out-of-core irregular problems. The goal of the PASSION compiler is to automatically translate out- of-core data parallel programs to node programs for distributed memory machines, with calls to the PASSION Runtime Library. At the language level, PASSION suggests extensions to HPF for out-of-core programs. At the file system level, PASSION provides support for buffering and prefetching data from disks. A portable parallel file system is also being developed as part of this project, which can be used across homogeneous or heterogeneous networks of workstations. PASSION also provides support for integrating data and task parallelism using parallel I/O techniques. We have used PASSION to implement a number of out-of-core applications such as a Laplace\u27s equation solver, 2D FFT, Matrix Multiplication, LU Decomposition, image processing applications as well as unstructured mesh kernels in molecular dynamics and computational fluid dynamics. We are currently in the process of using PASSION in applications in CFD (3D turbulent flows), molecular structure calculations, seismic computations, and earth and space science applications such as Four-Dimensional Data Assimilation. PASSION is currently available on the Intel Paragon, Touchstone Delta and iPSC/860. Efforts are underway to port it to the IBM SP-1 and SP-2 using the Vesta Parallel File System

    Compiler Optimization Techniques for Scheduling and Reducing Overhead

    Get PDF
    Exploiting parallelism in loops in programs is an important factor in realizing the potential performance of processors today. This dissertation develops and evaluates several compiler optimizations aimed at improving the performance of loops on processors. An important feature of a class of scientific computing problems is the regularity exhibited by their access patterns. Chapter 2 presents an approach of optimizing the address generation of these problems that results in the following: (i) elimination of redundant arithmetic computation by recognizing and exploiting the presence of common sub-expressions across different iterations in stencil codes; and (ii) conversion of as many array references to scalar accesses as possible, which leads to reduced execution time, decrease in address arithmetic overhead, access to data in registers as opposed to caches, etc. With the advent of VLIW processors, the exploitation of fine-grain instruction-level parallelism has become a major challenge to optimizing compilers. Fine-grain scheduling of inner loops has received a lot of attention, little work has been done in the area of applying it to nested loops. Chapter 3 presents an approach to fine-grain scheduling of nested loops by formulating the problem of finding theminimum iteration initiation interval as one of finding a rational affine schedule for each statement in the body of a perfectly nested loop which is then solved using linear programming. Frequent synchronization on multiprocessors is expensive due to its high cost. Chapter 4 presents a method for eliminating redundant synchronization for nested loops. In nested loops, a dependence may be redundant in only a portion of the iteration space. A characterization of the non-uniformity of the redundancy of a dependence is developed in terms of the relation between the dependences and the shape and size of the iteration space. Exploiting locality is critical for achieving high level of performance on a parallel machine. Chapter 5 presents an approach using the concept of affinity regions to find transformations such that a suitable iteration-to-processor mapping can be found for a sequence of loop nests accessing shared arrays. This not only improves the data locality but significantly reduces communication overhead

    Compile-Time Estimation of Communication Costs in Multicomputers

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryOffice of Naval Research / N00014-91-J-1096National Science Foundation / NSF MIP 86-57563 PYINational Aeronautics and Space Administration / NASA NAG 1-61

    Pricing Python Parallelism: A Dynamic Language Cost Model for Heterogeneous Platforms

    Get PDF
    Execution times may be reduced by offloading parallel loop nests to a GPU. Auto-parallelizing compilers are common for static languages, often using a cost model to determine when the GPU execution speed will outweigh the offload overheads. Nowadays scientific software is increasingly written in dynamic languages and would benefit from compute accelerators. The ALPyNA framework analyses moderately complex Python loop nests and automatically JIT compiles code for heterogeneous CPU and GPU architectures. We present the first analytical cost model for auto-parallelizing loop nests in a dynamic language on heterogeneous architectures. Predicting execution time in a language like Python is extremely challenging, since aspects like the element types, size of the iteration space, and amenability to parallelization can only be determined at runtime. Hence the cost model must be both staged, to combine compile and run-time information, and lightweight to minimize runtime overhead. GPU execution time prediction must account for factors like data transfer, block-structured execution, and starvation. We show that a comparatively simple, staged analytical model can accurately determine during execution when it is profitable to offload a loop nest. We evaluate our model on three heterogeneous platforms across 360 experiments with 12 loop-intensive Python benchmark programs. The results show small misprediction intervals and a mean slowdown of just 13.6%, relative to the optimal (oracular) offload strategy

    Automatic Data and Computation Mapping for Distributed-Memory Machines.

    Get PDF
    Distributed memory parallel computers offer enormous computation power, scalability and flexibility. However, these machines are difficult to program and this limits their widespread use. An important characteristic of these machines is the difference in the access time for data in local versus non-local memory; non-local memory accesses are much slower than local memory accesses. This is also a characteristic of shared memory machines but to a less degree. Therefore it is essential that as far as possible, the data that needs to be accessed by a processor during the execution of the computation assigned to it reside in its local memory rather than in some other processor\u27s memory. Several research projects have concluded that proper mapping of data is key to realizing the performance potential of distributed memory machines. Current language design efforts such as Fortran D and High Performance Fortran (HPF) are based on this. It is our thesis that for many practical codes, it is possible to derive good mappings through a combination of algorithms and systematic procedures. We view mapping as consisting of wo phases, alignment followed by distribution. For the alignment phase we present three constraint-based methods--one based on a linear programming formulation of the problem; the second formulates the alignment problem as a constrained optimization problem using Lagrange multipliers; the third method uses a heuristic to decide which constraints to leave unsatisfied (based on the penalty of increased communication incurred in doing so) in order to find a mapping. In addressing the distribution phase, we have developed two methods that integrate the placement of computation--loop nests in our case--with the mapping of data. For one distributed dimension, our approach finds the best combination of data and computation mapping that results in low communication overhead; this is done by choosing a loop order that allows message vectorization. In the second method, we introduce the distribution preference graph and the operations on this graph allow us to integrate loop restructuring transformations and data mapping. These techniques produce mappings that have been used in efficient hand-coded implementations of several benchmark codes

    A specification-based design tool for artificial neural networks.

    Get PDF
    Wong Wai.Thesis (M.Phil.)--Chinese University of Hong Kong, 1992.Includes bibliographical references (leaves 78-80).Chapter 1. --- Introduction --- p.1Chapter 1.1. --- Specification Environment --- p.2Chapter 1.2. --- Specification Analysis --- p.2Chapter 1.3. --- Outline --- p.3Chapter 2. --- Survey --- p.4Chapter 2.1. --- Concurrence Specification --- p.4Chapter 2.1.1. --- Sequential Approach --- p.5Chapter 2.1.2. --- Mapping onto Concurrent Architecture --- p.6Chapter 2.1.3. --- Automatic Concurrence Introduction --- p.7Chapter 2.2. --- Specification Analysis --- p.8Chapter 2.2.1. --- Motivation --- p.8Chapter 2.2.2. --- Cyclic Dependency --- p.8Chapter 3. --- The Design Tool --- p.11Chapter 3.1. --- Specification Environment --- p.11Chapter 3.1.1. --- Framework --- p.11Chapter 3.1.1.1. --- Formal Neurons --- p.12Chapter 3.1.1.2. --- Configuration --- p.12Chapter 3.1.1.3. --- Control Neuron --- p.13Chapter 3.1.2. --- Dataflow Specification --- p.14Chapter 3.1.2.1. --- Absence of Control Information --- p.14Chapter 3.1.2.2. --- Single-Valued Variables & Explicit Time Indices --- p.14Chapter 3.1.2.3. --- Explicit Notations --- p.15Chapter 3.1.3. --- User Interface --- p.15Chapter 3.2. --- Specification Analysis --- p.16Chapter 3.2.1. --- Data Dependency Analysis --- p.16Chapter 3.2.2. --- Attribute Analysis --- p.16Chapter 4. --- BP-Net Specification --- p.18Chapter 4.1. --- BP-Net Paradigm --- p.18Chapter 4.1.1. --- Neurons of a BP-Net --- p.18Chapter 4.1.2. --- Configuration of BP-Net --- p.20Chapter 4.2. --- Constant Declarations --- p.20Chapter 4.3. --- Formal Neuron Specification --- p.21Chapter 4.3.1. --- Mapping the Paradigm --- p.22Chapter 4.3.1.1. --- Mapping Symbols onto Parameter Names --- p.22Chapter 4.3.1.2. --- Mapping Neuron Equations onto Internal Functions --- p.22Chapter 4.3.2. --- Form Entries --- p.23Chapter 4.3.2.1. --- Neuron Type Entry --- p.23Chapter 4.3.2.2. --- "Input, Output and Internal Parameter Entries" --- p.23Chapter 4.3.2.3. --- Initial Value Entry --- p.25Chapter 4.3.2.4. --- Internal Function Entry --- p.25Chapter 4.4. --- Configuration Specification --- p.28Chapter 4.4.1. --- Fonn Entries --- p.29Chapter 4.4.1.1. --- Neuron Label Entry --- p.29Chapter 4.4.1.2. --- Neuron Character Entry --- p.30Chapter 4.4.1.3. --- Connection Pattern Entry --- p.31Chapter 4.4.2. --- Characteristics of the Syntax --- p.33Chapter 4.5. --- Control Neuron Specification --- p.34Chapter 4.5.1. --- Form Entries --- p.35Chapter 4.5.1.1. --- "Global Input, Output, Parameter & Initial Value Entries" --- p.35Chapter 4.5.1.2. --- Input & Output File Entries --- p.36Chapter 4.5.1.3. --- Global Function Entry --- p.36Chapter 5. --- Data Dependency Analysis_ --- p.40Chapter 5.1. --- Graph Construction --- p.41Chapter 5.1.1. --- Simplification and Normalization --- p.41Chapter 5.1.1.1. --- Removing Non-Esscntial Information --- p.41Chapter 5.1.1.2. --- Removing File Record Parameters --- p.42Chapter 5.1.1.3. --- Rearranging Temporal offset --- p.42Chapter 5.1.1.4. --- Conservation of Temporal Relationship --- p.43Chapter 5.1.1.5. --- Zero/Negative Offset for Determining Parameters --- p.43Chapter 5.1.2. --- Internal Dependency Graphs (IDGs) --- p.43Chapter 5.1.3. --- IDG of Control Neuron (CnIDG) --- p.45Chapter 5.1.4. --- Global Dependency Graphs (GDGs) --- p.45Chapter 5.2. --- Cycle Detection --- p.48Chapter 5.2.1. --- BP-Net --- p.48Chapter 5.2.2. --- Other Examples --- p.49Chapter 5.2.2.1. --- The Perceptron --- p.50Chapter 5.2.2.2. --- The Boltzmann Machinc --- p.51Chapter 5.2.3. --- Number of Cycles --- p.52Chapter 5.2.3.1. --- Different Number of Layers --- p.52Chapter 5.2.3.2. --- Different Network Types --- p.52Chapter 5.2.4. --- Cycle Length --- p.53Chapter 5.2.4.1. --- Different Number of Layers --- p.53Chapter 5.2.4.2. --- Comparison Among Different Networks --- p.53Chapter 5.2.5. --- Difficulties in Analysis --- p.53Chapter 5.3. --- Dependency Cycle Analysis --- p.54Chapter 5.3.1. --- Temporal Index Analysis --- p.54Chapter 5.3.2. --- Non-Temporal Index Analysis --- p.55Chapter 5.3.2.1. --- A Simple Example --- p.55Chapter 5.3.2.2. --- Single Parameter --- p.56Chapter 5.3.2.3. --- Multiple Parameters --- p.57Chapter 5.3.3. --- Combined Method --- p.58Chapter 5.3.4. --- Scheduling --- p.58Chapter 5.3.4.1. --- Algorithm --- p.59Chapter 5.3.4.2. --- Schedule for the BP-Net --- p.59Chapter 5.4. --- Symmetry in Graph Construction --- p.60Chapter 5.4.1. --- Basic Approach --- p.60Chapter 5.4.2. --- Construction of the BP-Net GDG --- p.61Chapter 5.4.3. --- Limitation --- p.63Chapter 6. --- Attribute Analysis__ --- p.64Chapter 6.1. --- Parameter Analysis --- p.64Chapter 6.1.1. --- Internal Dependency Graphs (IDGs) --- p.65Chapter 6.1.1.1. --- Correct Properties of Parameters in IDGs --- p.65Chapter 6.1.1.2. --- Example --- p.65Chapter 6.1.2. --- Combined Internal Dependency Graphs (CIDG) --- p.66Chapter 6.1.2.1. --- Tests on Parameters of CIDG --- p.66Chapter 6.1.2.2. --- Example --- p.67Chapter 6.1.3. --- Finalized Neuron Obtained --- p.67Chapter 6.1 4. --- CIDG of the BP-Net --- p.68Chapter 6.2. --- Constraint Checking --- p.68Chapter 6.2.1. --- "Syntactic, Semantic and Simple Checkings" --- p.68Chapter 6.2.1.1. --- The Syntactic & Semantic Techniques --- p.68Chapter 6.2.1.2. --- Simple Matching --- p.70Chapter 6.2.2. --- Constraints --- p.71Chapter 6.2.2.1. --- Constraints on Formal Neuron --- p.71Chapter 6.2.2.2. --- Constraints on Configuration --- p.72Chapter 6.2.2.3. --- Constraints on Control Neuron --- p.73Chapter 6.3. --- Complete Checking Procedure --- p.73Chapter 7. --- Conclusions_ --- p.75Chapter 7.1. --- Limitations --- p.76Chapter 7.1.1. --- Exclusive Conditional Dependency Cycles --- p.76Chapter 7.1.2. --- Maximum Parallelism --- p.77Reference --- p.78Appendix --- p.1Chapter I. --- Form Syntax --- p.1Chapter A. --- Syntax Conventions --- p.1Chapter B. --- Form Definition --- p.1Chapter 1. --- Form Structure --- p.1Chapter 2. --- Constant Declaration --- p.1Chapter 3. --- Formal Neuron Declaration --- p.1Chapter 4. --- Configuration Declaration --- p.2Chapter 5. --- Control Neuron --- p.2Chapter 6. --- Supplementary Definition --- p.3Chapter II. --- Algorithms --- p.4Chapter III. --- Deadlock & Dependency Cycles --- p.14Chapter A. --- Deadlock Prevention --- p.14Chapter 1. --- Necessary Conditions for Deadlock --- p.14Chapter 2. --- Resource Allocation Graphs --- p.15Chapter 3. --- Cycles and Blocked Requests --- p.15Chapter B. --- Deadlock in ANN Systems --- p.16Chapter 1. --- Shared resources --- p.16Chapter 2. --- Presence of the Necessary Conditions for Deadlocks --- p.16Chapter 3. --- Operation Constraint for Communication --- p.16Chapter 4. --- Checkings Required --- p.17Chapter C. --- Data Dependency Graphs --- p.17Chapter 1. --- Simplifying Resource Allocation Graphs --- p.17Chapter 2. --- Expanding into Parameter Level --- p.18Chapter 3. --- Freezing the Request Edges --- p.18Chapter 4. --- Reversing the Edge Directions --- p.18Chapter 5. --- Mutual Dependency Cycles --- p.18Chapter IV. --- Case Studies --- p.19Chapter A. --- BP-Net --- p.19Chapter 1. --- Specification Forms --- p.19Chapter 2. --- Results After Simple Checkings --- p.21Chapter 3. --- Internal Dependency Graphs Construction --- p.21Chapter 4. --- Results From Parameter Analysis --- p.21Chapter 5. --- Global Dependency Graphs Construction --- p.21Chapter 6. --- Cycles Detection --- p.21Chapter 7. --- Time Subscript Analysis --- p.21Chapter 8. --- Subscript Analysis --- p.21Chapter 9. --- Scheduling --- p.21Chapter B. --- Perceptron --- p.21Chapter 1. --- Specification Forms --- p.22Chapter 2. --- Results After Simple Checkings --- p.24Chapter 3. --- Internal Dependency Graphs Construction --- p.24Chapter 4. --- Results From Parameter Analysis --- p.25Chapter 5. --- Global Dependency Graph Construction --- p.25Chapter 6. --- Cycles Detection --- p.25Chapter 7. --- Time Subscript Analysis --- p.25Chapter 8. --- Subscript Analysis --- p.25Chapter 9. --- Scheduling --- p.25Chapter C. --- Boltzmann Machine --- p.26Chapter 1. --- Specification Forms --- p.26Chapter 2. --- Results After Simple Checkings --- p.35Chapter 3. --- Graphs Construction --- p.35Chapter 4. --- Results From Parameter Analysis --- p.36Chapter 5. --- Global Dependency Graphs Construction --- p.36Chapter 6. --- Cycle Detection --- p.36Chapter 7. --- Time Subscript Analysis --- p.36Chapter 8. --- Subscript Analysis --- p.36Chapter 9. --- Scheduling --- p.3

    Combinatorial Design and Analysis of Optimal Multiple Bus Systems for Parallel Algorithms.

    Get PDF
    This dissertation develops a formal and systematic methodology for designing optimal, synchronous multiple bus systems (MBSs) realizing given (classes of) parallel algorithms. Our approach utilizes graph and group theoretic concepts to develop the necessary model and procedural tools. By partitioning the vertex set of the graphical representation CFG of the algorithm, we extract a set of interconnection functions that represents the interprocessor communication requirement of the algorithm. We prove that the optimal partitioning problem is NP-Hard. However, we show how to obtain polynomial time solutions by exploiting certain regularities present in many well-behaved parallel algorithms. The extracted set of interconnection functions is represented by an edge colored, directed graph called interconnection function graph (IFG). We show that the problem of constructing an optimal MBS to realize an IFG is NP-Hard. We show important special cases where polynomial time solutions exist. In particular, we prove that polynomial time solutions exist when the IFG is vertex symmetric. This is the case of interest for the vast majority of important interconnection function sets, whether extracted from algorithms or correspond to existing interconnection networks. We show that an IFG is vertex symmetric if and only if it is the Cayley color graph of a finite group Γ\Gamma and its generating set Δ.\Delta. Using this property, we present a particular scheme to construct a symmetric MBS M(Γ,Δ)MBS\ M(\Gamma,\Delta) with minimum number of buses as well as minimum number of interfaces realizing a vertex symmetric IFG. We demonstrate several advantages of the optimal MBS M(Γ,Δ)MBS\ M(\Gamma,\Delta) in terms of its symmetry, number of ports per processor, number of neighbors per processor, and the diameter. We also investigate the fault tolerant capabilities and performance degradation of M(Γ,Δ)M(\Gamma,\Delta) in the case of a single bus failure, single driver failure, single receiver failure, and single processor failure. Further, we address the problem of designing an optimal MBS realizing a class of algorithms when the number of buses and/or processors in the target MBS are specified. The optimality criteria are maximizing the speed and minimizing the number of interfaces
    corecore