1,627 research outputs found
A Study of Energy and Locality Effects using Space-filling Curves
The cost of energy is becoming an increasingly important driver for the
operating cost of HPC systems, adding yet another facet to the challenge of
producing efficient code. In this paper, we investigate the energy implications
of trading computation for locality using Hilbert and Morton space-filling
curves with dense matrix-matrix multiplication. The advantage of these curves
is that they exhibit an inherent tiling effect without requiring specific
architecture tuning. By accessing the matrices in the order determined by the
space-filling curves, we can trade computation for locality. The index
computation overhead of the Morton curve is found to be balanced against its
locality and energy efficiency, while the overhead of the Hilbert curve
outweighs its improvements on our test system.Comment: Proceedings of the 2014 IEEE International Parallel & Distributed
Processing Symposium Workshops (IPDPSW
Efficient dot product over word-size finite fields
We want to achieve efficiency for the exact computation of the dot product of
two vectors over word-size finite fields. We therefore compare the practical
behaviors of a wide range of implementation techniques using different
representations. The techniques used include oating point representations,
discrete logarithms, tabulations, Montgomery reduction, delayed modulus
Vectorized register tiling
In the last years, there has been much effort in commercial compilers (icc, gcc) to exploit efficiently the SIMD capabilities and the memory hierarchy that the current processors offer. However, the small numbers of compilers that can automatically exploit these characteristics achieve in most cases unsatisfactory results. Therefore, the programmers often need to apply by hand the optimizations to the source code, write manually the code in assembly or use compiler built-in functions (such intrinsics) to achieve high performance. In this work, we present source-to-source transformations that help commercial compilers exploiting the memory hierarchy and generating efficient SIMD code. Results obtained on our experiments show that our solutions achieve as excellent performance as hand-optimized vendor-supplied numerical libraries (written in assembly).Peer ReviewedPreprin
Psort: automated code tuning
This thesis describes the design and implementation of an automated code tuner for psort, a fast sorting library for large datasets. Our work, motivated by the necessity of guaranteeing a high performance while keeping a low cost on the end user, provides a reusable and portable framework that can be easily extended to automatically tune virtually every portion of the source code, including code that has not yet been written. Experiments show that our system produces code which is significantly faster than original code, suggesting that psort should include it among its tools
SOMMARIO
Questa tesi descrive la progettazione e la realizzazione di un ottimizzatore di
codice automatico per psort, una libreria di ordinamento veloce per grandi moli
di dati. Il nostro lavoro, motivato dalla necessità di garantire alte prestazioni
mantenendo un basso costo sull'utente finale, fornisce una infrastruttura rius-
abile e portabile che può essere facilmente estesa per ottimizzare in maniera
automatica virtualmente ogni porzione di codice sorgente, incluso codice che
ancora non è stato scritto. Gli esperimenti mostrano che il nostro sistema pro-
duce codice che è significativamente più veloce del codice originale, suggerendo
che psort dovrebbe includerlo tra i suoi strument
Learning from the Success of MPI
The Message Passing Interface (MPI) has been extremely successful as a
portable way to program high-performance parallel computers. This success has
occurred in spite of the view of many that message passing is difficult and
that other approaches, including automatic parallelization and directive-based
parallelism, are easier to use. This paper argues that MPI has succeeded
because it addresses all of the important issues in providing a parallel
programming model.Comment: 12 pages, 1 figur
- …