67 research outputs found

    Prefetching on the Cray-T3E: a model and its evaluation

    Get PDF
    In many parallel applications, network latency causes a dramatic loss in processor utilization. This paper examines software controlled access pipelining (SCAP) as a technique for hiding network latency. An analytic model of SCAP briefly describes basic operation techniques and performance improvements. Results are quantified with benchmarks on the Cray-T3E. The benchmarks used are Jacobi-iteration, parts of the Livermore Loop kernels, and others representing six different parallel algorithm classes. These were parallelized and optimized by hand to show the performance tradeoff of severals pipelining techniques. Our results show that SCAP on the Cray-T3E improves performance compared to a blocking execution by a factor of 2.1 to 38. It also got a performance speed-up against HPF of at least 12% to a factor of 3.1 dependent on the algorithm class

    Hardware-only stream prediction + cache prefetching + dynamic access ordering

    Get PDF
    Journal ArticleThe speed gap between processors and memory system is becoming the performance bottleneck for many applications, and computations with strided access patterns are among those that suffer most. The vectors used in such applications lack temporal and often spatial locality, and are usually too large to cache. In spite of their poor cache behavior, these access patterns have the advantage of being, predictable, which can be exploited to improve the efficiency of the memory subsystem. As a promising technique to relieve memory system bottleneck, prefetching has been studied in its various forms, and so is dynamic memory scheduling. This study builds on these results, combining a stride-based reference prediction table, a mechanism that prefetches L2 cache lines, and a memory controller that dynamically schedules accesses to a Direct Rambus memory subsystem. We find that such a system delivers impressive speedups for scientific applications with regular access patterns (reducing execution time by almost a factor of two) without negatively affecting the performance of non-streaming programs

    Latenzzeitverbergung in datenparallelen Sprachen = [Latency Hiding in Dataparallel Languages]

    Get PDF
    Das ungünstige Verhältnis von Kommunikations- zu Rechenleistung fast aller Parallelrechner, das sich in Kommunikationslatenzzeiten von mehreren hundert bis tausend Prozessortaktzyklen manifestiert, verhindert in vielen Fällen die effziente Ausführung von kommunikationsintensiven feingranularen datenparallelen Programmen. Zur Lösung dieses Problems untersucht diese Arbeit Techniken zur Latenzzeitverbergung, die durch Vorladeoperationen die Kommunikationszeit des Netzwerkes verdecken. Der vorgeschlagene Ansatz VSCAP (Software Controlled Access Pipelining with Vector commands) erweitert bestehende Techniken um Vektorbefehle und kann die anfallenden Latenzzeiten für eine große Anzahl von Anwendungen fast vollständig verbergen. Meine Beiträge sind: - Modellierung von VSCAP, einer Erweiterung von SCAP mit Vektorbefehlen. - Entwurf von Konzepten, mit denen Kommunikationsaufträge in datenparallelen Programmen in Datenfließbänder des VSCAP-Verfahrens überführt werden können. - Implementierung dieser Konzepte und Integration in den Prototypübersetzer Kar-HPFn. Die Leistungen von VSCAP bei der Latenzzeitverbergung wurden durch Modellierung und Laufzeittests von 25 Programmen, darunter 3 kompletten Anwendungen, untersucht. Die Ergebnisse sind: - Nachweis der praktischen Einsetzbarkeit von VSCAP (und damit als Spezialfall auch SCAP) auf einem realen Rechner. - Berechnung des Grades der Latenzzeitverbergung von VSCAP und Bestätigung der Modellierung durch automatisch generierte Programme. - Bestätigung der Beschleunigung von VSCAP gegenüber SCAP um einen Faktor gleich der Vektorlänge L durch Modellierung und Messungen. - Erster Übersetzer auf Parallelrechnerarchitekturen mit gemeinsamem Adreßraum, der zur Kommunikation nur Vorladeoperationen einsetzt. - Nachweis der automatischen, für den Programmierer transparenten und effizienten Übersetzung von datenparallelen Applikationen in Programme, die zur Kommunikation das VSCAP-Verfahren anwenden, am Beispiel von HPF. - Vergleichbare Leistung von KarHPFn-generiertem VSCAP und der hochoptimierten Kommunikationsbibliothek auf der Cray T3E, bei dynamischen Kommunikationsmustern sogar ein mehr als 6-facher Laufzeitgewinn von VSCAP. - 3- bis mehr als 5-facher Laufzeitgewinn von KarHPFn-generiertem VSCAP gegenüber Portland Group HPF beim Test von drei Applikationen (Veltran, FIRE und PDE1) auf bis zu 128 Prozessoren mit identischen HPF-Quellen, bei Programmen mit großem Kommunikationsaufwand sogar mehr als ein Faktor 15

    Porting the Sisal functional language to distributed-memory multiprocessors

    Get PDF
    Parallel computing is becoming increasingly ubiquitous in recent years. The sizes of application problems continuously increase for solving real-world problems. Distributed-memory multiprocessors have been regarded as a viable architecture of scalable and economical design for building large scale parallel machines. While these parallel machines can provide computational capabilities, programming such large-scale machines is often very difficult due to many practical issues including parallelization, data distribution, workload distribution, and remote memory latency. This thesis proposes to solve the programmability and performance issues of distributed-memory machines using the Sisal functional language. The programs written in Sisal will be automatically parallelized, scheduled and run on distributed-memory multiprocessors with no programmer intervention. Specifically, the proposed approach consists of the following steps. Given a program written in Sisal, the front end Sisal compiler generates a directed acyclic graph(DAG) to expose parallelism in the program. The DAG is partitioned and scheduled based on loop parallelism. The scheduled DAG is then translated to C programs with machine specific parallel constructs. The parallel C programs are finally compiled by the target machine specific compilers to generate executables. A distributed-memory parallel machine, the 80-processor ETL EM-X, has been chosen to perform experiments. The entire procedure has been implemented on the EMX multiprocessor. Four problems are selected for experiments: bitonic sorting, search, dot-product and Fast Fourier Transform. Preliminary execution results indicate that automatic parallelization of the Sisal programs based on loop parallelism is effective. The speedup for these four problems is ranging from 17 to 60 on a 64-processor EM-X. Preliminary experimental results further indicate that programming distributed-memory multiprocessors using a functional language indeed frees the programmers from lowl-evel programming details while allowing them to focus on algorithmic performance improvement

    Scalable Parallel Computers for Real-Time Signal Processing

    Get PDF
    We assess the state-of-the-art technology in massively parallel processors (MPPs) and their variations in different architectural platforms. Architectural and programming issues are identified in using MPPs for time-critical applications such as adaptive radar signal processing. We review the enabling technologies. These include high-performance CPU chips and system interconnects, distributed memory architectures, and various latency hiding mechanisms. We characterize the concept of scalability in three areas: resources, applications, and technology. Scalable performance attributes are analytically defined. Then we compare MPPs with symmetric multiprocessors (SMPs) and clusters of workstations (COWs). The purpose is to reveal their capabilities, limits, and effectiveness in signal processing. We evaluate the IBM SP2 at MHPCC, the Intel Paragon at SDSC, the Gray T3D at Gray Eagan Center, and the Gray T3E and ASCI TeraFLOP system proposed by Intel. On the software and programming side, we evaluate existing parallel programming environments, including the models, languages, compilers, software tools, and operating systems. Some guidelines for program parallelization are provided. We examine data-parallel, shared-variable, message-passing, and implicit programming models. Communication functions and their performance overhead are discussed. Available software tools and communication libraries are also introducedpublished_or_final_versio

    Application of Pfortran and Co-Array Fortran in the parallelization of the GROMOS96 molecular dynamics module

    Get PDF
    After at least a decade of parallel tool development, parallelization of scientific applications remains a significant undertaking. Typically parallelization is a specialized activity supported only partially by the programming tool set, with the programmer involved with parallel issues in addition to sequential ones. The details of concern range from algorithm design down to low-level data movement details. The aim of parallel programming tools is to automate the latter without sacrificing performance and portability, allowing the programmer to focus on algorithm specification and development. We present our use of two similar parallelization tools, Pfortran and Cray's Co-Array Fortran, in the parallelization of the GROMOS96 molecular dynamics module. Our parallelization started from the GROMOS96 distribution's sharedmemory implementation of the replicated algorithm, but used little of that existing parallel structure. Consequently, our parallelization was close to starting with the sequential version. We found the intuitive extensions to Pfortran and Co-Array Fortran helpful in the rapid parallelization of the project. We present performance figures for both the Pfortran and CoArray Fortran parallelizations showing linear speedup within the range expected by these parallelization methods

    Numerics of High Performance Computers and Benchmark Evaluation of Distributed Memory Computers

    Get PDF
    The internal representation of numerical data, their speed of manipulation to generate the desired result through efficient utilisation of central processing unit, memory, and communication links are essential steps of all high performance scientific computations. Machine parameters, in particular, reveal accuracy and error bounds of computation, required for performance tuning of codes. This paper reports diagnosis of machine parameters, measurement of computing power of several workstations, serial and parallel computers, and a component-wise test procedure for distributed memory computers. Hierarchical memory structure is illustrated by block copying and unrolling techniques. Locality of reference for cache reuse of data is amply demonstrated by fast Fourier transform codes. Cache and register-blocking technique results in their optimum utilisation with consequent gain in throughput during vector-matrix operations. Implementation of these memory management techniques reduces cache inefficiency loss, which is known to be proportional to the number of processors. Of the two Linux clusters-ANUP16, HPC22 and HPC64, it has been found from the measurement of intrinsic parameters and from application benchmark of multi-block Euler code test run that ANUP16 is suitable for problems that exhibit fine-grained parallelism. The delivered performance of ANUP16 is of immense utility for developing high-end PC clusters like HPC64 and customised parallel computers with added advantage of speed and high degree of parallelism

    Cost-Effective Clustering

    Get PDF
    Small Beowulf clusters can effectively serve as personal or group supercomputers. In such an environment, a cluster can be optimally designed for a specific problem (or a small set of codes). We discuss how theoretical analysis of the code and benchmarking on similar hardware lead to optimal systems.Comment: 7 pages, 2 figures (one in color). Color version of paper to be published as part of proceedings of CCP2000 (Brisbane) in a special isssue of Computer Physics Communication
    corecore