10 research outputs found
Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis
Sympiler is a domain-specific code generator that optimizes sparse matrix
computations by decoupling the symbolic analysis phase from the numerical
manipulation stage in sparse codes. The computation patterns in sparse
numerical methods are guided by the input sparsity structure and the sparse
algorithm itself. In many real-world simulations, the sparsity pattern changes
little or not at all. Sympiler takes advantage of these properties to
symbolically analyze sparse codes at compile-time and to apply inspector-guided
transformations that enable applying low-level transformations to sparse codes.
As a result, the Sympiler-generated code outperforms highly-optimized matrix
factorization codes from commonly-used specialized libraries, obtaining average
speedups over Eigen and CHOLMOD of 3.8X and 1.5X respectively.Comment: 12 page
Automatic Generation of Efficient Sparse Tensor Format Conversion Routines
This paper shows how to generate code that efficiently converts sparse
tensors between disparate storage formats (data layouts) such as CSR, DIA, ELL,
and many others. We decompose sparse tensor conversion into three logical
phases: coordinate remapping, analysis, and assembly. We then develop a
language that precisely describes how different formats group together and
order a tensor's nonzeros in memory. This lets a compiler emit code that
performs complex remappings of nonzeros when converting between formats. We
also develop a query language that can extract statistics about sparse tensors,
and we show how to emit efficient analysis code that computes such queries.
Finally, we define an abstract interface that captures how data structures for
storing a tensor can be efficiently assembled given specific statistics about
the tensor. Disparate formats can implement this common interface, thus letting
a compiler emit optimized sparse tensor conversion code for arbitrary
combinations of many formats without hard-coding for any specific combination.
Our evaluation shows that the technique generates sparse tensor conversion
routines with performance between 1.00 and 2.01 that of hand-optimized
versions in SPARSKIT and Intel MKL, two popular sparse linear algebra
libraries. And by emitting code that avoids materializing temporaries, which
both libraries need for many combinations of source and target formats, our
technique outperforms those libraries by 1.78 to 4.01 for CSC/COO to
DIA/ELL conversion.Comment: Presented at PLDI 202
Scheduling Transformation and Dependence Tests for Recursive Programs
Scheduling transformations reorder the execution of operations in a program to improve locality and/or parallelism. The polyhedral model provides a general framework for performing instance-wise scheduling transformations for regular programs, reordering the iterations of loops that operate over dense arrays through transformations like tiling. There is no analogous framework for recursive programs—despite recent interest in optimizations like tiling and fusion for recursive applications. This paper presents PolyRec, the first general framework for applying scheduling transformations—like inlining, interchange, and code motion—to nested recursive programs and reasoning about their correctness. We describe the phases of PolyRec—representing dynamic instances, applying transformations, reasoning about correctness—and show that PolyRec is able to apply sophisticated, composed transformations to complex, nested recursive programs and improve performance through enhanced locality
SparseAuto: An Auto-Scheduler for Sparse Tensor Computations Using Recursive Loop Nest Restructuring
Automated code generation and performance optimizations for sparse tensor
algebra are cardinal since they have become essential in many real-world
applications like quantum computing, physics, chemistry, and machine learning.
General sparse tensor algebra compilers are not always versatile enough to
generate asymptotically optimal code for sparse tensor contractions. This paper
shows how to optimize and generate asymptotically better schedules for complex
tensor expressions using kernel fission and fusion. We present a generalized
loop transformation to achieve loop nesting for minimized memory footprint and
reduced asymptotic complexity.
Furthermore, we present an auto-scheduler that uses a partially ordered
set-based cost model that uses both time and auxiliary memory complexities in
its pruning stages. In addition, we highlight the use of SMT solvers in sparse
auto-schedulers to prune the Pareto frontier of schedules to the smallest
number of possible schedules with user-defined constraints available at compile
time. Finally, we show that our auto-scheduler can select asymptotically better
schedules that use our compiler transformation to generate optimized code. Our
results show that the auto-scheduler achieves orders of magnitude speedup
compared to the TACO-generated code for several real-world tensor algebra
computations on different real-world inputs
A polyhedral compilation framework for loops with dynamic data-dependent bounds
International audienceWe study the parallelizing compilation and loop nest optimization of an important class of programs where counted loops have a dynamic data-dependent upper bound. Such loops are amenable to a wider set of transformations than general while loops with inductively defined termination conditions: for example, the substitution of closed forms for induction variables remains applicable, removing the loop-carried data dependences induced by termination conditions. We propose an automatic compilation approach to parallelize and optimize dynamic counted loops. Our approach relies on affine relations only, as implemented in state-of-the-art polyhedral libraries. Revisiting a state-of-the-art framework to parallelize arbitrary while loops, we introduce additional control dependences on data-dependent predicates. Our method goes beyond the state of the art in fully automating the process, specializing the code generation algorithm to the case of dynamic counted loops and avoiding the introduction of spurious loop-carried dependences. We conduct experiments on representative irregular computations, from dynamic programming, computer vision and finite element methods to sparse matrix linear algebra. We validate that the method is applicable to general affine transformations for locality optimization, vectorization and parallelization
Efficient Tiled Sparse Matrix Multiplication through Matrix Signatures
International audienceTiling is a key technique to reduce data movement in matrix computations. While tiling is well understood and widely used for dense matrix/tensor computations, effective tiling of sparse matrix computations remains a challenging problem. This paper proposes a novel method to efficiently summarize the impact of the sparsity structure of a matrix on achievable data reuse as a one-dimensional signature, which is then used to build an analytical cost model for tile size optimization for sparse matrix computations. The proposed model-driven approach to sparse tiling is evaluated on two key sparse matrix kernels: Sparse Matrix-Dense Matrix Multiplication (SpMM) and Sampled Dense-Dense Matrix Multiplication (SDDMM). Experimental results demonstrate that model-based tiled SpMM and SDDMM achieve high performance relative to the current state-of-the-art
Doctor of Philosophy
dissertationSparse matrix codes are found in numerous applications ranging from iterative numerical solvers to graph analytics. Achieving high performance on these codes has however been a significant challenge, mainly due to array access indirection, for example, of the form A[B[i]]. Indirect accesses make precise dependence analysis impossible at compile-time, and hence prevent many parallelizing and locality optimizing transformations from being applied. The expert user relies on manually written libraries to tailor the sparse code and data representations best suited to the target architecture from a general sparse matrix representation. However libraries have limited composability, address very specific optimization strategies, and have to be rewritten as new architectures emerge. In this dissertation, we explore the use of the inspector/executor methodology to accomplish the code and data transformations to tailor high performance sparse matrix representations. We devise and embed abstractions for such inspector/executor transformations within a compiler framework so that they can be composed with a rich set of existing polyhedral compiler transformations to derive complex transformation sequences for high performance. We demonstrate the automatic generation of inspector/executor code, which orchestrates code and data transformations to derive high performance representations for the Sparse Matrix Vector Multiply kernel in particular. We also show how the same transformations may be integrated into sparse matrix and graph applications such as Sparse Matrix Matrix Multiply and Stochastic Gradient Descent, respectively. The specific constraints of these applications, such as problem size and dependence structure, necessitate unique sparse matrix representations that can be realized using our transformations. Computations such as Gauss Seidel, with loop carried dependences at the outer most loop necessitate different strategies for high performance. Specifically, we organize the computation into level sets or wavefronts of irregular size, such that iterations of a wavefront may be scheduled in parallel but different wavefronts have to be synchronized. We demonstrate automatic code generation of high performance inspectors that do explicit dependence testing and level set construction at runtime, as well as high performance executors, which are the actual parallelized computations. For the above sparse matrix applications, we automatically generate inspector/executor code comparable in performance to manually tuned libraries
An approach for code generation in the Sparse Polyhedral Framework
Applications that manipulate sparse data structures contain memory reference patterns that are unknown at compile time due to indirect accesses such as A[B[i]]. To exploit parallelism and improve locality in such applications, prior work has developed a number of Run-Time Reordering Transformations (RTRTs). This paper presents the Sparse Polyhedral Framework (SPF) for specifying RTRTs and compositions thereof and algorithms for automatically generating efficient inspector and executor code to implement such transformations. Experimental results indicate that the performance of automatically generated inspectors and executors competes with the performance of hand-written ones when further optimization is done.We thank Jon Roelofs for his implementation of the IEGenCC tool, which converts C programs into the specification format IEGen expects as input. We thank Christopher Krieger, Andrew Stone, Tomofumi Yuki, and anonymous reviewers for their careful reading and suggestions. This work was sponsored by NSF CAREER Grant CCF-0746693, DOE Early Career Grant DE-SC3956, the CSCAPES Institute DOE Grant 7F-00323, and the CACHE project DOE Grant DE-SC04030.Available online 4 March 2016. 24 month embargo.This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]
Recommended from our members
An approach for code generation in the Sparse Polyhedral Framework
Applications that manipulate sparse data structures contain memory reference patterns that are unknown at compile time due to indirect accesses such as A[B[i]]. To exploit parallelism and improve locality in such applications, prior work has developed a number of Run-Time Reordering Transformations (RTRTs). This paper presents the Sparse Polyhedral Framework (SPF) for specifying RTRTs and compositions thereof and algorithms for automatically generating efficient inspector and executor code to implement such transformations. Experimental results indicate that the performance of automatically generated inspectors and executors competes with the performance of hand-written ones when further optimization is done.We thank Jon Roelofs for his implementation of the IEGenCC tool, which converts C programs into the specification format IEGen expects as input. We thank Christopher Krieger, Andrew Stone, Tomofumi Yuki, and anonymous reviewers for their careful reading and suggestions. This work was sponsored by NSF CAREER Grant CCF-0746693, DOE Early Career Grant DE-SC3956, the CSCAPES Institute DOE Grant 7F-00323, and the CACHE project DOE Grant DE-SC04030.Available online 4 March 2016. 24 month embargo.This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]