169 research outputs found

    HPC Rankings Based on Real Applications

    Get PDF
    Extended abstractPerformance benchmarks are used to stress test hardware and software of large scale computing systems. A corporation known as SPEC has developed a benchmark suite, SPEC ACCEL, consisting of test codes representative of kernels in large applications. This project ranks the published results from ACCEL based on different criteria. The goal is to prepare a ranking website for the work-in-progress real-world SPEC HPG benchmark suite, HPC2021 that will soon be released (time frame 2020-2021)

    Parallel programming environment for OpenMP

    Get PDF
    We present our effort to provide a comprehensive parallel programming environment for the OpenMP parallel directive language. This environment includes a parallel programming methodology for the OpenMP programming model and a set of tools ( Ursa Minor and InterPol) that support this methodology. Our toolset provides automated and interactive assistance to parallel programmers in time-consuming tasks of the proposed methodology. The features provided by our tools include performance and program structure visualization, interactive optimization, support for performance modeling, and performance advising for finding and correcting performance problems. The presented evaluation demonstrates that our environment offers significant support in general parallel tuning efforts and that the toolset facilitates many common tasks in OpenMP parallel programming in an efficient manner

    COMPILER OPTIMIZATION ORCHESTRATION FOR PEAK PERFORMANCE

    Get PDF
    Although compile-time optimizations generally improve program performance, degradations caused by individual techniques are to be expected. Feedback-directed optimizations have recently begun to address this issue, by factoring runtime information into the decision process of which compiler optimization to apply where and when. While improvements for small sets of optimization techniques have been demonstrated, little empirical knowledge exists on the performance behavior of the large number of today\u27s optimization techniques. This is especially true for the interaction of such techniques, which we have found to be of significant importance in navigating the search space of the best combination of techniques. The contribution of this paper is in (1) providing such empirical knowledge and (2) developing algorithms for efficiently navigating and pruning the search space. To this end, we evaluate the optimization techniques of GCC on both a Pentium IV machine and a SPARC II machine, by measuring the performance of the SPEC CPU2000 benchmarks under different compiler ags. We analyze the performance losses that result from individual optimizations. We then present three heuristic algorithms that search for the best combination of compiler techniques using measured runtime as feedback

    Implementing Tomorrow’s Programming Languages

    No full text
    Compilers are the critical translators that convert a human-readable program into the code understood by the machine. While this transformation is already sophisticated today, tomorrow’s compilers face a tremendous challenge. There is a demand to provide languages that are much higher level than today’s C, Fortran, or Java. On the other hand, tomorrow’s machines are more complex than today; they involve multiple cores and may span the planet via compute Grids. How can we expect compilers to provide efficient implementations? I will describe a number of related research efforts that try to tackle this problem. Composition builds a way towards higher-level programming languages. Automatic translation of shared-address-space models to distributed-memory architectures may lead to higher productivity than current message passing paradigms. Advanced symbolic analysis techniques equips compilers with capabilities to reason about programs in abstract terms. Last but not least, through auto-tuning, compilers make effective decisions, even through there may be insufficient information at compile tim

    An Algorithm for Register-Synchronized Precomputation In Intelligent Memory Systems

    Get PDF
    This paper presents a novel compiler algorithm for selecting program slices that prefetch load values concurrently with program execution. The algorithm is evaluated in the context of an intelligent memory system. The architecture consists of a main processor and a simple memory processor. The intelligent memory system pre-executes program slices and forwards values of critical loads to the main processor ahead of their use. The compiler algorithm selects program slices for memory processor execution, and inserts synchronization instructions that synchronize main and memory processors. Experimental results of the generated code on a cycle-accurate simulator show a speedup of up to 1.33 (1.13 on average) over an aggressively latency-optimized system running fully optimized code

    Data Forwarding Through In-Memory Precomputation Threads

    Get PDF
    In modern architectures, memory access latency is an increasingly performance-limiting factor. To reduce this latency, we propose concepts and implementation of a new technique that uses an in-memory processor to precompute future, critical load addresses and forward the computed values to the main processor. The acronym for this technique is IMPT for In-Memory Precomputation-based forwarding Threads. IMPT combines the advantages of precomputationbased techniques with the low memory access latency of processing-in-memory. To evaluate IMPT, we use a cycle-accurate simulation of an aggressive out-of-order processor with accurate simulation of bus and memory contention. The results show a performance gain of up to 1.47 (1.21 on average) over an aggressive superscalar processor. The average load access latency decreases by up to 55% (32% on average)

    Parallel Architectures and How to Program Them

    No full text
    This article discusses issues in designing and exploiting parallel architectures as seen from a research group that has been focusing on the development of compilers for such machines. For scienti c and engineering applications, Fortran languages have been the primary means of expressing algorithms. Current programmers of high-performance parallel machines have to make a choice between sequential, portable programs and e cient, parallel codes that come at a signi cantly higher development cost. The much needed compiler technology for parallelizing sequential programs automatically is currently being developed, but has not yet delivered tools that consistently yield e cient code. Programmers have to specify parallelism explicitly to make up for this shortcoming. Currently, users are faced with two rather di erent parallel programming models, both re ecting the underlying machine organization: shared-address space machines provide a name space for data that needs to be seen by multiple parallel tasks; message-passing machines provide separate address spaces and data needs to be exchanged by explicitly sending and receiving messages. The question of which model will dominate in the long term is one of the issues currently being debated.
    • …
    corecore