3,338 research outputs found
Generating Fast Sparse Matrix Vector Multiplication From a High Level Generic Functional IR
Usage of high-level intermediate representations promises the generation of fast code from a high-level description, improving the productivity of developers while achieving the performance traditionally only reached with low-level programming approaches.
High-level IRs come in two flavors: 1) domain-specific IRs designed only for a specific application area; or 2) generic high-level IRs that can be used to generate high-performance code across many domains. Developing generic IRs is more challenging but offers the advantage of reusing a common compiler infrastructure across various applications.
In this paper, we extend a generic high-level IR to enable efficient computation with sparse data structures. Crucially, we encode sparse representation using reusable dense building blocks already present in the high-level IR. We use a form of dependent types to model sparse matrices in CSR format by expressing the relationship between multiple dense arrays explicitly separately storing the length of rows, the column indices, and the non-zero values of the matrix.
We achieve high-performance compared to sparse low-level library code using our extended generic high-level code generator. On an Nvidia GPU, we outperform the highly tuned Nvidia cuSparse implementation of spmv multiplication across 28 sparse matrices of varying sparsity on average by 1.7×
Automatic matching of legacy code to heterogeneous APIs: An idiomatic approach
Heterogeneous accelerators often disappoint. They provide
the prospect of great performance, but only deliver it when
using vendor specific optimized libraries or domain specific
languages. This requires considerable legacy code modifications,
hindering the adoption of heterogeneous computing.
This paper develops a novel approach to automatically
detect opportunities for accelerator exploitation. We focus
on calculations that are well supported by established APIs:
sparse and dense linear algebra, stencil codes and generalized
reductions and histograms. We call them idioms and use a
custom constraint-based Idiom Description Language (IDL)
to discover them within user code. Detected idioms are then
mapped to BLAS libraries, cuSPARSE and clSPARSE and two
DSLs: Halide and Lift.
We implemented the approach in LLVM and evaluated
it on the NAS and Parboil sequential C/C++ benchmarks,
where we detect 60 idiom instances. In those cases where
idioms are a significant part of the sequential execution time,
we generate code that achieves 1.26× to over 20× speedup
on integrated and external GPUs
Parallel Unsmoothed Aggregation Algebraic Multigrid Algorithms on GPUs
We design and implement a parallel algebraic multigrid method for isotropic
graph Laplacian problems on multicore Graphical Processing Units (GPUs). The
proposed AMG method is based on the aggregation framework. The setup phase of
the algorithm uses a parallel maximal independent set algorithm in forming
aggregates and the resulting coarse level hierarchy is then used in a K-cycle
iteration solve phase with a -Jacobi smoother. Numerical tests of a
parallel implementation of the method for graphics processors are presented to
demonstrate its effectiveness.Comment: 18 pages, 3 figure
USLV: Unspanned Stochastic Local Volatility Model
We propose a new framework for modeling stochastic local volatility, with
potential applications to modeling derivatives on interest rates, commodities,
credit, equity, FX etc., as well as hybrid derivatives. Our model extends the
linearity-generating unspanned volatility term structure model by Carr et al.
(2011) by adding a local volatility layer to it. We outline efficient numerical
schemes for pricing derivatives in this framework for a particular four-factor
specification (two "curve" factors plus two "volatility" factors). We show that
the dynamics of such a system can be approximated by a Markov chain on a
two-dimensional space (Z_t,Y_t), where coordinates Z_t and Y_t are given by
direct (Kroneker) products of values of pairs of curve and volatility factors,
respectively. The resulting Markov chain dynamics on such partly "folded" state
space enables fast pricing by the standard backward induction. Using a
nonparametric specification of the Markov chain generator, one can accurately
match arbitrary sets of vanilla option quotes with different strikes and
maturities. Furthermore, we consider an alternative formulation of the model in
terms of an implied time change process. The latter is specified
nonparametrically, again enabling accurate calibration to arbitrary sets of
vanilla option quotes.Comment: Sections 3.2 and 3.3 are re-written, 3 figures adde
Fast Matrix Multiplication via Compiler-only Layered Data Reorganization and Intrinsic Lowering
The resurgence of machine learning has increased the demand for
high-performance basic linear algebra subroutines (BLAS), which have long
depended on libraries to achieve peak performance on commodity hardware.
High-performance BLAS implementations rely on a layered approach that consists
of tiling and packing layers, for data (re)organization, and micro kernels that
perform the actual computations. The creation of high-performance micro kernels
requires significant development effort to write tailored assembly code for
each architecture. This hand optimization task is complicated by the recent
introduction of matrix engines by IBM's POWER10 MMA, Intel AMX, and Arm ME to
deliver high-performance matrix operations. This paper presents a compiler-only
alternative to the use of high-performance libraries by incorporating, to the
best of our knowledge and for the first time, the automatic generation of the
layered approach into LLVM, a production compiler. Modular design of the
algorithm, such as the use of LLVM's matrix-multiply intrinsic for a clear
interface between the tiling and packing layers and the micro kernel, makes it
easy to retarget the code generation to multiple accelerators. The use of
intrinsics enables a comprehensive performance study. In processors without
hardware matrix engines, the tiling and packing delivers performance up to 22x
(Intel), for small matrices, and more than 6x (POWER9), for large matrices,
faster than PLuTo, a widely used polyhedral optimizer. The performance also
approaches high-performance libraries and is only 34% slower than OpenBLAS and
on-par with Eigen for large matrices. With MMA in POWER10 this solution is, for
large matrices, over 2.6x faster than the vector-extension solution, matches
Eigen performance, and achieves up to 96% of BLAS peak performance
Polyhedral+Dataflow Graphs
This research presents an intermediate compiler representation that is designed for optimization, and emphasizes the temporary storage requirements and execution schedule of a given computation to guide optimization decisions. The representation is expressed as a dataflow graph that describes computational statements and data mappings within the polyhedral compilation model. The targeted applications include both the regular and irregular scientific domains.
The intermediate representation can be integrated into existing compiler infrastructures. A specification language implemented as a domain specific language in C++ describes the graph components and the transformations that can be applied. The visual representation allows users to reason about optimizations. Graph variants can be translated into source code or other representation. The language, intermediate representation, and associated transformations have been applied to improve the performance of differential equation solvers, or sparse matrix operations, tensor decomposition, and structured multigrid methods
- …