6 research outputs found
Automatically Harnessing Sparse Acceleration
Sparse linear algebra is central to many scientific programs, yet compilers
fail to optimize it well. High-performance libraries are available, but
adoption costs are significant. Moreover, libraries tie programs into
vendor-specific software and hardware ecosystems, creating non-portable code.
In this paper, we develop a new approach based on our specification Language
for implementers of Linear Algebra Computations (LiLAC). Rather than requiring
the application developer to (re)write every program for a given library, the
burden is shifted to a one-off description by the library implementer. The
LiLAC-enabled compiler uses this to insert appropriate library routines without
source code changes.
LiLAC provides automatic data marshaling, maintaining state between calls and
minimizing data transfers. Appropriate places for library insertion are
detected in compiler intermediate representation, independent of source
languages.
We evaluated on large-scale scientific applications written in FORTRAN;
standard C/C++ and FORTRAN benchmarks; and C++ graph analytics kernels. Across
heterogeneous platforms, applications and data sets we show speedups of
1.1 to over 10 without user intervention.Comment: Accepted to CC 202
Recommended from our members
Automatic Sparse Computation Parallelization by Utilizing Domain-Specific Knowledge in Data Dependence Analysis
Sparse vectors, matrices, and tensors are commonly used to
compress nonzero values of big data manipulated in data analytics,
scientific simulations, and machine learning computations.
As with general computations, parallelization of loops in sparse computations,
codes manipulating sparse structures, is essential to efficiently utilize
available parallel architectures.
The sparse computations often exhibit partial parallelism
in loops that are sequential in the corresponding dense computation
due to sparsity of data dependencies coming from indirect memory access through
index arrays, e.g. {\tt col} in .
Such dependencies can only be discovered at runtime when content of index arrays are available.
Consequently, performance programmers typically use the inspector/executor strategy
to take advantage of partial parallelism in sparse computation.
There, programmers implement an inspector code that creates iteration dependency
graph at runtime from which wavefronts of iterations are extracted
and fed into a parallel version of the computation called an executor.
The executor executes iteration waves sequentially to respect sparse dependencies
while executing iterations inside each wavefront in parallel.
To automate the generation of the inspector and executor code,
compiler-based loop-carried data dependency analysis is needed.
However, straightforward automatically generated inspectors
typically have significantly higher overhead than hand written optimized ones.
Consequently, the specific problem that I am addressing in this dissertation is
how can we automate the strategies used by expert programmers to generate
efficient runtime inspectors for parallelizing sparse computation.
The overarching contribution of this dissertation is
an approach for encoding index array properties for individual
index arrays and relationships between index arrays as
universally quantified constraints and using them
in compiler-based data dependence analysis.
The dependence analysis is then evaluated in the context of
finding wavefront parallelism in sparse computations.
More specifically,
one contribution is an approach to automatically
use index array properties
to prove more data dependencies unsatisfiable,
removing the need for inspecting them at runtime.
Other contributions are methods to use the same properties to simplify
compile-time-satisfiable dependences by finding equalities and subset relationships
enabling generation of faster runtime inspectors.
The last contribution includes compile-time methods for
expanding opportunities for array privatization in sparse computations
by defining an array as private if its contents start and end
each iteration with the same value.
Evaluation results show my approach is able to find seven
fully parallel loops in seven sparse computations where
previous compiler-based approach could not, and
efficiently extract partial parallelism from outer most
loops of five out of six sparse computations
Automating Wavefront Parallelization for Sparse Matrix Computations
This paper presents a compiler and runtime framework for parallelizing sparse matrix computations that have loop-carried dependences. Our approach automatically generates a runtime inspector to collect data dependence information and achieves wavefront parallelization of the computation, where iterations within a wavefront execute in parallel, and synchronization is required across wavefronts. A key contribution of this paper involves dependence simplification, which reduces the time and space overhead of the inspector. This is implemented within a polyhedral compiler framework, extended for sparse matrix codes. Results demonstrate the feasibility of using automatically-generated inspectors and executors to optimize ILU factorization and symmetric Gauss-Seidel relaxations, which are part of the Preconditioned Conjugate Gradient (PCG) computation. Our implementation achieves a median speedup of 2.97x on 12 cores over the reference sequential PCG implementation, significantly outperforms PCG parallelized using Intel's Math Kernel Library (MKL), and is within 6% of the median performance of manually-parallelized PCG.Scientific Discovery through Advanced Computing (SciDAC) program - U.S. Department of Energy Office of Advanced Scientific Computing Research [DE-SC0006947]; NSF [CNS-1302663, CCF-1564074]This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]
Sparse Matrix Code Dependence Analysis Simplification at Compile Time
Analyzing array-based computations to determine data dependences is useful
for many applications including automatic parallelization, race detection,
computation and communication overlap, verification, and shape analysis. For
sparse matrix codes, array data dependence analysis is made more difficult by
the use of index arrays that make it possible to store only the nonzero entries
of the matrix (e.g., in A[B[i]], B is an index array). Here, dependence
analysis is often stymied by such indirect array accesses due to the values of
the index array not being available at compile time. Consequently, many
dependences cannot be proven unsatisfiable or determined until runtime.
Nonetheless, index arrays in sparse matrix codes often have properties such as
monotonicity of index array elements that can be exploited to reduce the amount
of runtime analysis needed. In this paper, we contribute a formulation of array
data dependence analysis that includes encoding index array properties as
universally quantified constraints. This makes it possible to leverage existing
SMT solvers to determine whether such dependences are unsatisfiable and
significantly reduces the number of dependences that require runtime analysis
in a set of eight sparse matrix kernels. Another contribution is an algorithm
for simplifying the remaining satisfiable data dependences by discovering
equalities and/or subset relationships. These simplifications are essential to
make a runtime-inspection-based approach feasible