19,696 research outputs found

    Cache-aware Performance Modeling and Prediction for Dense Linear Algebra

    Full text link
    Countless applications cast their computational core in terms of dense linear algebra operations. These operations can usually be implemented by combining the routines offered by standard linear algebra libraries such as BLAS and LAPACK, and typically each operation can be obtained in many alternative ways. Interestingly, identifying the fastest implementation -- without executing it -- is a challenging task even for experts. An equally challenging task is that of tuning each routine to performance-optimal configurations. Indeed, the problem is so difficult that even the default values provided by the libraries are often considerably suboptimal; as a solution, normally one has to resort to executing and timing the routines, driven by some form of parameter search. In this paper, we discuss a methodology to solve both problems: identifying the best performing algorithm within a family of alternatives, and tuning algorithmic parameters for maximum performance; in both cases, we do not execute the algorithms themselves. Instead, our methodology relies on timing and modeling the computational kernels underlying the algorithms, and on a technique for tracking the contents of the CPU cache. In general, our performance predictions allow us to tune dense linear algebra algorithms within few percents from the best attainable results, thus allowing computational scientists and code developers alike to efficiently optimize their linear algebra routines and codes.Comment: Submitted to PMBS1

    Performance Modeling and Prediction for Dense Linear Algebra

    Full text link
    This dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide range of computations: To predict the performance of blocked algorithms, we design algorithm-independent performance models for kernel operations that are generated automatically once per platform. For various matrix operations, instantaneous predictions based on such models both accurately identify the fastest algorithm, and select a near-optimal block size. For performance predictions of BLAS-based tensor contractions, we propose cache-aware micro-benchmarks that take advantage of the highly regular structure inherent to contraction algorithms. At merely a fraction of a contraction's runtime, predictions based on such micro-benchmarks identify the fastest combination of tensor traversal and compute kernel

    A Study on the Influence of Caching: Sequences of Dense Linear Algebra Kernels

    Full text link
    It is universally known that caching is critical to attain high- performance implementations: In many situations, data locality (in space and time) plays a bigger role than optimizing the (number of) arithmetic floating point operations. In this paper, we show evidence that at least for linear algebra algorithms, caching is also a crucial factor for accurate performance modeling and performance prediction.Comment: Submitted to the Ninth International Workshop on Automatic Performance Tuning (iWAPT2014

    Are You Networked Yet? On Dialogues within European Judicial Networks

    Get PDF
    This dissertation incorporates two research projects: performance modeling and prediction for dense linear algebra algorithms, and high-performance computing on clouds. The first project is focused on dense matrix computations, which are often used as computational kernels for numerous scientific applications. To solve a particular mathematical operation, linear algebra libraries provide a variety of algorithms. The algorithm of choice depends, obviously, on its performance. Performance of such algorithms is affected by a set of parameters, which characterize the features of the computing platform, the algorithm implementation, the size of the operands, and the data storage format. Because of this complexity, predicting algorithmic performance is a challenging task, to the point that developers are often forced to rely on extensive trial-and-error technique. We approach this problem from a different perspective. Instead of performing exhaustive tests, we introduce two techniques for modeling performance: one is based on measurements and requires a limited number of sampling tests, while the other needs neither the execution of the algorithms nor parts of them. Both techniques employ a bottom-up approach: we first model the performance of the BLAS kernels. Then, using these results we create models for higher-level algorithms, e.g. the LU factorization, that are built on top of the BLAS kernels. As a result, by using the hierarchical and modular structure of linear algebra libraries, we develop two techniques for hierarchical performance modeling; both techniques yielding accurate predictions. The second project is concerned with high-performance computing on commercial cloud environments. We empirically study the computational efficiency of compute-intensive scientific applications in such an environment, where resources are shared under high contention. Although high performance is occasionally obtained, contention for the CPU and cache space degrades the expected performance and introduces significant variance. Using the matrix-matrix multiplication kernel of BLAS and the LINPACK benchmark, we show that the underutilization of resources substantially improves the expected performance. For instance, for a number of cluster configurations, the solution is reached considerably faster when the available resources are underutilized. Finally, since the performance measurements of scientific applications show high fluctuation, we propose alternatives, such as expected performance, expected execution time, and cost per GFlop, to the standard definitions of efficiency

    QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

    Get PDF
    Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in high-performance computing and the use of grids for speeding up large-scale scientific problems is limited to applications exhibiting parallelism at a higher level. We have identified two performance bottlenecks in the distributed memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear algebra library. First, because ScaLAPACK assumes a homogeneous communication network, the implementations of ScaLAPACK algorithms lack locality in their communication pattern. Second, the number of messages sent in the ScaLAPACK algorithms is significantly greater than other algorithms that trade flops for communication. In this paper, we present a new approach for computing a QR factorization -- one of the main dense linear algebra kernels -- of tall and skinny matrices in a grid computing environment that overcomes these two bottlenecks. Our contribution is to articulate a recently proposed algorithm (Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in order to confine intensive communications (ScaLAPACK calls) within the different geographical sites. An experimental study conducted on the Grid'5000 platform shows that the resulting performance increases linearly with the number of geographical sites on large-scale problems (and is in particular consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed Processing Symposium 2010 in Atlanta, GA, USA.
    • …
    corecore