1,212 research outputs found
Cache-aware Performance Modeling and Prediction for Dense Linear Algebra
Countless applications cast their computational core in terms of dense linear
algebra operations. These operations can usually be implemented by combining
the routines offered by standard linear algebra libraries such as BLAS and
LAPACK, and typically each operation can be obtained in many alternative ways.
Interestingly, identifying the fastest implementation -- without executing it
-- is a challenging task even for experts. An equally challenging task is that
of tuning each routine to performance-optimal configurations. Indeed, the
problem is so difficult that even the default values provided by the libraries
are often considerably suboptimal; as a solution, normally one has to resort to
executing and timing the routines, driven by some form of parameter search. In
this paper, we discuss a methodology to solve both problems: identifying the
best performing algorithm within a family of alternatives, and tuning
algorithmic parameters for maximum performance; in both cases, we do not
execute the algorithms themselves. Instead, our methodology relies on timing
and modeling the computational kernels underlying the algorithms, and on a
technique for tracking the contents of the CPU cache. In general, our
performance predictions allow us to tune dense linear algebra algorithms within
few percents from the best attainable results, thus allowing computational
scientists and code developers alike to efficiently optimize their linear
algebra routines and codes.Comment: Submitted to PMBS1
Performance Modeling and Prediction for Dense Linear Algebra
This dissertation introduces measurement-based performance modeling and
prediction techniques for dense linear algebra algorithms. As a core principle,
these techniques avoid executions of such algorithms entirely, and instead
predict their performance through runtime estimates for the underlying compute
kernels. For a variety of operations, these predictions allow to quickly select
the fastest algorithm configurations from available alternatives. We consider
two scenarios that cover a wide range of computations:
To predict the performance of blocked algorithms, we design
algorithm-independent performance models for kernel operations that are
generated automatically once per platform. For various matrix operations,
instantaneous predictions based on such models both accurately identify the
fastest algorithm, and select a near-optimal block size.
For performance predictions of BLAS-based tensor contractions, we propose
cache-aware micro-benchmarks that take advantage of the highly regular
structure inherent to contraction algorithms. At merely a fraction of a
contraction's runtime, predictions based on such micro-benchmarks identify the
fastest combination of tensor traversal and compute kernel
A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels
It is universally known that caching is critical to attain high- performance
implementations: In many situations, data locality (in space and time) plays a
bigger role than optimizing the (number of) arithmetic floating point
operations. In this paper, we show evidence that at least for linear algebra
algorithms, caching is also a crucial factor for accurate performance modeling
and performance prediction.Comment: Submitted to the Ninth International Workshop on Automatic
Performance Tuning (iWAPT2014
On the Performance Prediction of BLAS-based Tensor Contractions
Tensor operations are surging as the computational building blocks for a
variety of scientific simulations and the development of high-performance
kernels for such operations is known to be a challenging task. While for
operations on one- and two-dimensional tensors there exist standardized
interfaces and highly-optimized libraries (BLAS), for higher dimensional
tensors neither standards nor highly-tuned implementations exist yet. In this
paper, we consider contractions between two tensors of arbitrary dimensionality
and take on the challenge of generating high-performance implementations by
resorting to sequences of BLAS kernels. The approach consists in breaking the
contraction down into operations that only involve matrices or vectors. Since
in general there are many alternative ways of decomposing a contraction, we are
able to methodically derive a large family of algorithms. The main contribution
of this paper is a systematic methodology to accurately identify the fastest
algorithms in the bunch, without executing them. The goal is instead
accomplished with the help of a set of cache-aware micro-benchmarks for the
underlying BLAS kernels. The predictions we construct from such benchmarks
allow us to reliably single out the best-performing algorithms in a tiny
fraction of the time taken by the direct execution of the algorithms.Comment: Submitted to PMBS1
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
Previous studies have reported that common dense linear algebra operations do
not achieve speed up by using multiple geographical sites of a computational
grid. Because such operations are the building blocks of most scientific
applications, conventional supercomputers are still strongly predominant in
high-performance computing and the use of grids for speeding up large-scale
scientific problems is limited to applications exhibiting parallelism at a
higher level. We have identified two performance bottlenecks in the distributed
memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear
algebra library. First, because ScaLAPACK assumes a homogeneous communication
network, the implementations of ScaLAPACK algorithms lack locality in their
communication pattern. Second, the number of messages sent in the ScaLAPACK
algorithms is significantly greater than other algorithms that trade flops for
communication. In this paper, we present a new approach for computing a QR
factorization -- one of the main dense linear algebra kernels -- of tall and
skinny matrices in a grid computing environment that overcomes these two
bottlenecks. Our contribution is to articulate a recently proposed algorithm
(Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in
order to confine intensive communications (ScaLAPACK calls) within the
different geographical sites. An experimental study conducted on the Grid'5000
platform shows that the resulting performance increases linearly with the
number of geographical sites on large-scale problems (and is in particular
consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed
Processing Symposium 2010 in Atlanta, GA, USA.
Scheduling data flow program in xkaapi: A new affinity based Algorithm for Heterogeneous Architectures
Efficient implementations of parallel applications on heterogeneous hybrid
architectures require a careful balance between computations and communications
with accelerator devices. Even if most of the communication time can be
overlapped by computations, it is essential to reduce the total volume of
communicated data. The literature therefore abounds with ad-hoc methods to
reach that balance, but that are architecture and application dependent. We
propose here a generic mechanism to automatically optimize the scheduling
between CPUs and GPUs, and compare two strategies within this mechanism: the
classical Heterogeneous Earliest Finish Time (HEFT) algorithm and our new,
parametrized, Distributed Affinity Dual Approximation algorithm (DADA), which
consists in grouping the tasks by affinity before running a fast dual
approximation. We ran experiments on a heterogeneous parallel machine with six
CPU cores and eight NVIDIA Fermi GPUs. Three standard dense linear algebra
kernels from the PLASMA library have been ported on top of the Xkaapi runtime.
We report their performances. It results that HEFT and DADA perform well for
various experimental conditions, but that DADA performs better for larger
systems and number of GPUs, and, in most cases, generates much lower data
transfers than HEFT to achieve the same performance
Analytic Performance Modeling and Analysis of Detailed Neuron Simulations
Big science initiatives are trying to reconstruct and model the brain by
attempting to simulate brain tissue at larger scales and with increasingly more
biological detail than previously thought possible. The exponential growth of
parallel computer performance has been supporting these developments, and at
the same time maintainers of neuroscientific simulation code have strived to
optimally and efficiently exploit new hardware features. Current state of the
art software for the simulation of biological networks has so far been
developed using performance engineering practices, but a thorough analysis and
modeling of the computational and performance characteristics, especially in
the case of morphologically detailed neuron simulations, is lacking. Other
computational sciences have successfully used analytic performance engineering
and modeling methods to gain insight on the computational properties of
simulation kernels, aid developers in performance optimizations and eventually
drive co-design efforts, but to our knowledge a model-based performance
analysis of neuron simulations has not yet been conducted.
We present a detailed study of the shared-memory performance of
morphologically detailed neuron simulations based on the Execution-Cache-Memory
(ECM) performance model. We demonstrate that this model can deliver accurate
predictions of the runtime of almost all the kernels that constitute the neuron
models under investigation. The gained insight is used to identify the main
governing mechanisms underlying performance bottlenecks in the simulation. The
implications of this analysis on the optimization of neural simulation software
and eventually co-design of future hardware architectures are discussed. In
this sense, our work represents a valuable conceptual and quantitative
contribution to understanding the performance properties of biological networks
simulations.Comment: 18 pages, 6 figures, 15 table
Power efficient job scheduling by predicting the impact of processor manufacturing variability
Modern CPUs suffer from performance and power consumption variability due to the manufacturing process. As a result, systems that do not consider such variability caused by manufacturing issues lead to performance degradations and wasted power. In order to avoid such negative impact, users and system administrators must actively counteract any manufacturing variability.
In this work we show that parallel systems benefit from taking into account the consequences of manufacturing variability when making scheduling decisions at the job scheduler level. We also show that it is possible to predict the impact of this variability on specific applications by using variability-aware power prediction models. Based on these power models, we propose two job scheduling policies that consider the effects of manufacturing variability for each application and that ensure that power consumption stays under a system-wide power budget. We evaluate our policies under different power budgets and traffic scenarios, consisting of both single- and multi-node parallel applications, utilizing up to 4096 cores in total. We demonstrate that they decrease job turnaround time, compared to contemporary scheduling policies used on production clusters, up to 31% while saving up to 5.5% energy.Postprint (author's final draft
- …