19,696 research outputs found
Cache-aware Performance Modeling and Prediction for Dense Linear Algebra
Countless applications cast their computational core in terms of dense linear
algebra operations. These operations can usually be implemented by combining
the routines offered by standard linear algebra libraries such as BLAS and
LAPACK, and typically each operation can be obtained in many alternative ways.
Interestingly, identifying the fastest implementation -- without executing it
-- is a challenging task even for experts. An equally challenging task is that
of tuning each routine to performance-optimal configurations. Indeed, the
problem is so difficult that even the default values provided by the libraries
are often considerably suboptimal; as a solution, normally one has to resort to
executing and timing the routines, driven by some form of parameter search. In
this paper, we discuss a methodology to solve both problems: identifying the
best performing algorithm within a family of alternatives, and tuning
algorithmic parameters for maximum performance; in both cases, we do not
execute the algorithms themselves. Instead, our methodology relies on timing
and modeling the computational kernels underlying the algorithms, and on a
technique for tracking the contents of the CPU cache. In general, our
performance predictions allow us to tune dense linear algebra algorithms within
few percents from the best attainable results, thus allowing computational
scientists and code developers alike to efficiently optimize their linear
algebra routines and codes.Comment: Submitted to PMBS1
Performance Modeling and Prediction for Dense Linear Algebra
This dissertation introduces measurement-based performance modeling and
prediction techniques for dense linear algebra algorithms. As a core principle,
these techniques avoid executions of such algorithms entirely, and instead
predict their performance through runtime estimates for the underlying compute
kernels. For a variety of operations, these predictions allow to quickly select
the fastest algorithm configurations from available alternatives. We consider
two scenarios that cover a wide range of computations:
To predict the performance of blocked algorithms, we design
algorithm-independent performance models for kernel operations that are
generated automatically once per platform. For various matrix operations,
instantaneous predictions based on such models both accurately identify the
fastest algorithm, and select a near-optimal block size.
For performance predictions of BLAS-based tensor contractions, we propose
cache-aware micro-benchmarks that take advantage of the highly regular
structure inherent to contraction algorithms. At merely a fraction of a
contraction's runtime, predictions based on such micro-benchmarks identify the
fastest combination of tensor traversal and compute kernel
A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels
It is universally known that caching is critical to attain high- performance
implementations: In many situations, data locality (in space and time) plays a
bigger role than optimizing the (number of) arithmetic floating point
operations. In this paper, we show evidence that at least for linear algebra
algorithms, caching is also a crucial factor for accurate performance modeling
and performance prediction.Comment: Submitted to the Ninth International Workshop on Automatic
Performance Tuning (iWAPT2014
Are You Networked Yet? On Dialogues within European Judicial Networks
This dissertation incorporates two research projects: performance modeling and prediction for dense linear algebra algorithms, and high-performance computing on clouds. The first project is focused on dense matrix computations, which are often used as computational kernels for numerous scientific applications. To solve a particular mathematical operation, linear algebra libraries provide a variety of algorithms. The algorithm of choice depends, obviously, on its performance. Performance of such algorithms is affected by a set of parameters, which characterize the features of the computing platform, the algorithm implementation, the size of the operands, and the data storage format. Because of this complexity, predicting algorithmic performance is a challenging task, to the point that developers are often forced to rely on extensive trial-and-error technique. We approach this problem from a different perspective. Instead of performing exhaustive tests, we introduce two techniques for modeling performance: one is based on measurements and requires a limited number of sampling tests, while the other needs neither the execution of the algorithms nor parts of them. Both techniques employ a bottom-up approach: we first model the performance of the BLAS kernels. Then, using these results we create models for higher-level algorithms, e.g. the LU factorization, that are built on top of the BLAS kernels. As a result, by using the hierarchical and modular structure of linear algebra libraries, we develop two techniques for hierarchical performance modeling; both techniques yielding accurate predictions. The second project is concerned with high-performance computing on commercial cloud environments. We empirically study the computational efficiency of compute-intensive scientific applications in such an environment, where resources are shared under high contention. Although high performance is occasionally obtained, contention for the CPU and cache space degrades the expected performance and introduces significant variance. Using the matrix-matrix multiplication kernel of BLAS and the LINPACK benchmark, we show that the underutilization of resources substantially improves the expected performance. For instance, for a number of cluster configurations, the solution is reached considerably faster when the available resources are underutilized. Finally, since the performance measurements of scientific applications show high fluctuation, we propose alternatives, such as expected performance, expected execution time, and cost per GFlop, to the standard definitions of efficiency
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
Previous studies have reported that common dense linear algebra operations do
not achieve speed up by using multiple geographical sites of a computational
grid. Because such operations are the building blocks of most scientific
applications, conventional supercomputers are still strongly predominant in
high-performance computing and the use of grids for speeding up large-scale
scientific problems is limited to applications exhibiting parallelism at a
higher level. We have identified two performance bottlenecks in the distributed
memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear
algebra library. First, because ScaLAPACK assumes a homogeneous communication
network, the implementations of ScaLAPACK algorithms lack locality in their
communication pattern. Second, the number of messages sent in the ScaLAPACK
algorithms is significantly greater than other algorithms that trade flops for
communication. In this paper, we present a new approach for computing a QR
factorization -- one of the main dense linear algebra kernels -- of tall and
skinny matrices in a grid computing environment that overcomes these two
bottlenecks. Our contribution is to articulate a recently proposed algorithm
(Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in
order to confine intensive communications (ScaLAPACK calls) within the
different geographical sites. An experimental study conducted on the Grid'5000
platform shows that the resulting performance increases linearly with the
number of geographical sites on large-scale problems (and is in particular
consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed
Processing Symposium 2010 in Atlanta, GA, USA.
- …