1,627 research outputs found

    A Study of Energy and Locality Effects using Space-filling Curves

    Full text link
    The cost of energy is becoming an increasingly important driver for the operating cost of HPC systems, adding yet another facet to the challenge of producing efficient code. In this paper, we investigate the energy implications of trading computation for locality using Hilbert and Morton space-filling curves with dense matrix-matrix multiplication. The advantage of these curves is that they exhibit an inherent tiling effect without requiring specific architecture tuning. By accessing the matrices in the order determined by the space-filling curves, we can trade computation for locality. The index computation overhead of the Morton curve is found to be balanced against its locality and energy efficiency, while the overhead of the Hilbert curve outweighs its improvements on our test system.Comment: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW

    Efficient dot product over word-size finite fields

    Full text link
    We want to achieve efficiency for the exact computation of the dot product of two vectors over word-size finite fields. We therefore compare the practical behaviors of a wide range of implementation techniques using different representations. The techniques used include oating point representations, discrete logarithms, tabulations, Montgomery reduction, delayed modulus

    Auto-tuning compiler options for HPC

    Get PDF

    Vectorized register tiling

    Get PDF
    In the last years, there has been much effort in commercial compilers (icc, gcc) to exploit efficiently the SIMD capabilities and the memory hierarchy that the current processors offer. However, the small numbers of compilers that can automatically exploit these characteristics achieve in most cases unsatisfactory results. Therefore, the programmers often need to apply by hand the optimizations to the source code, write manually the code in assembly or use compiler built-in functions (such intrinsics) to achieve high performance. In this work, we present source-to-source transformations that help commercial compilers exploiting the memory hierarchy and generating efficient SIMD code. Results obtained on our experiments show that our solutions achieve as excellent performance as hand-optimized vendor-supplied numerical libraries (written in assembly).Peer ReviewedPreprin

    Psort: automated code tuning

    Get PDF
    This thesis describes the design and implementation of an automated code tuner for psort, a fast sorting library for large datasets. Our work, motivated by the necessity of guaranteeing a high performance while keeping a low cost on the end user, provides a reusable and portable framework that can be easily extended to automatically tune virtually every portion of the source code, including code that has not yet been written. Experiments show that our system produces code which is significantly faster than original code, suggesting that psort should include it among its tools SOMMARIO Questa tesi descrive la progettazione e la realizzazione di un ottimizzatore di codice automatico per psort, una libreria di ordinamento veloce per grandi moli di dati. Il nostro lavoro, motivato dalla necessità di garantire alte prestazioni mantenendo un basso costo sull'utente finale, fornisce una infrastruttura rius- abile e portabile che può essere facilmente estesa per ottimizzare in maniera automatica virtualmente ogni porzione di codice sorgente, incluso codice che ancora non è stato scritto. Gli esperimenti mostrano che il nostro sistema pro- duce codice che è significativamente più veloce del codice originale, suggerendo che psort dovrebbe includerlo tra i suoi strument

    Learning from the Success of MPI

    Full text link
    The Message Passing Interface (MPI) has been extremely successful as a portable way to program high-performance parallel computers. This success has occurred in spite of the view of many that message passing is difficult and that other approaches, including automatic parallelization and directive-based parallelism, are easier to use. This paper argues that MPI has succeeded because it addresses all of the important issues in providing a parallel programming model.Comment: 12 pages, 1 figur
    corecore