858 research outputs found
Performance Analysis and Optimization of Sparse Matrix-Vector Multiplication on Modern Multi- and Many-Core Processors
This paper presents a low-overhead optimizer for the ubiquitous sparse
matrix-vector multiplication (SpMV) kernel. Architectural diversity among
different processors together with structural diversity among different sparse
matrices lead to bottleneck diversity. This justifies an SpMV optimizer that is
both matrix- and architecture-adaptive through runtime specialization. To this
direction, we present an approach that first identifies the performance
bottlenecks of SpMV for a given sparse matrix on the target platform either
through profiling or by matrix property inspection, and then selects suitable
optimizations to tackle those bottlenecks. Our optimization pool is based on
the widely used Compressed Sparse Row (CSR) sparse matrix storage format and
has low preprocessing overheads, making our overall approach practical even in
cases where fast decision making and optimization setup is required. We
evaluate our optimizer on three x86-based computing platforms and demonstrate
that it is able to distinguish and appropriately optimize SpMV for the majority
of matrices in a representative test suite, leading to significant speedups
over the CSR and Inspector-Executor CSR SpMV kernels available in the latest
release of the Intel MKL library.Comment: 10 pages, 7 figures, ICPP 201
Julia: A Fresh Approach to Numerical Computing
Bridging cultures that have often been distant, Julia combines expertise from
the diverse fields of computer science and computational science to create a
new approach to numerical computing. Julia is designed to be easy and fast.
Julia questions notions generally held as "laws of nature" by practitioners of
numerical computing:
1. High-level dynamic programs have to be slow.
2. One must prototype in one language and then rewrite in another language
for speed or deployment, and
3. There are parts of a system for the programmer, and other parts best left
untouched as they are built by the experts.
We introduce the Julia programming language and its design --- a dance
between specialization and abstraction. Specialization allows for custom
treatment. Multiple dispatch, a technique from computer science, picks the
right algorithm for the right circumstance. Abstraction, what good computation
is really about, recognizes what remains the same after differences are
stripped away. Abstractions in mathematics are captured as code through another
technique from computer science, generic programming.
Julia shows that one can have machine performance without sacrificing human
convenience.Comment: 37 page
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
Fast Linear Programming through Transprecision Computing on Small and Sparse Data
A plethora of program analysis and optimization techniques rely on linear programming at their heart. However, such techniques are often considered too slow for production use. While today’s best solvers are optimized for complex problems with thousands of dimensions, linear programming, as used in compilers, is typically applied to small and seemingly trivial problems, but to many instances in a single compilation run. As a result, compilers do not benefit from decades of research on optimizing large-scale linear programming. We design a simplex solver targeted at compilers. A novel theory of transprecision computation applied from individual elements to full data-structures provides the computational foundation. By carefully combining it with optimized representations for small and sparse matrices and specialized small-coefficient algorithms, we (1) reduce memory traffic, (2) exploit wide vectors, and (3) use low-precision arithmetic units effectively. We evaluate our work by embedding our solver into a state-of-the-art integer set library and implement one essential operation, coalescing, on top of our transprecision solver. Our evaluation shows more than an order-of-magnitude speedup on the core simplex pivot operation and a mean speedup of 3.2x (vs. GMP) and 4.6x (vs. IMath) for the optimized coalescing operation. Our results demonstrate that our optimizations exploit the wide SIMD instructions of modern microarchitectures effectively. We expect our work to provide foundations for a future integer set library that uses transprecision arithmetic to accelerate compiler analyses.ISSN:2475-142
Sparse Automatic Differentiation for Large-Scale Computations Using Abstract Elementary Algebra
Most numerical solvers and libraries nowadays are implemented to use
mathematical models created with language-specific built-in data types (e.g.
real in Fortran or double in C) and their respective elementary algebra
implementations. However, built-in elementary algebra typically has limited
functionality and often restricts flexibility of mathematical models and
analysis types that can be applied to those models. To overcome this
limitation, a number of domain-specific languages with more feature-rich
built-in data types have been proposed. In this paper, we argue that if
numerical libraries and solvers are designed to use abstract elementary algebra
rather than language-specific built-in algebra, modern mainstream languages can
be as effective as any domain-specific language. We illustrate our ideas using
the example of sparse Jacobian matrix computation. We implement an automatic
differentiation method that takes advantage of sparse system structures and is
straightforward to parallelize in MPI setting. Furthermore, we show that the
computational cost scales linearly with the size of the system.Comment: Submitted to ACM Transactions on Mathematical Softwar
On Optimal Partitioning For Sparse Matrices In Variable Block Row Format
The Variable Block Row (VBR) format is an influential blocked sparse matrix
format designed to represent shared sparsity structure between adjacent rows
and columns. VBR consists of groups of adjacent rows and columns, storing the
resulting blocks that contain nonzeros in a dense format. This reduces the
memory footprint and enables optimizations such as register blocking and
instruction-level parallelism. Existing approaches use heuristics to determine
which rows and columns should be grouped together. We adapt and optimize a
dynamic programming algorithm for sequential hypergraph partitioning to produce
a linear time algorithm which can determine the optimal partition of rows under
an expressive cost model, assuming the column partition remains fixed.
Furthermore, we show that the problem of determining an optimal partition for
the rows and columns simultaneously is NP-Hard under a simple linear cost
model.
To evaluate our algorithm empirically against existing heuristics, we
introduce the 1D-VBR format, a specialization of VBR format where columns are
left ungrouped. We evaluate our algorithms on all 1626 real-valued matrices in
the SuiteSparse Matrix Collection. When asked to minimize an empirically
derived cost model for a sparse matrix-vector multiplication kernel, our
algorithm produced partitions whose 1D-VBR realizations achieve a speedup of at
least 1.18 over an unblocked kernel on 25% of the matrices, and a speedup of at
least 1.59 on 12.5% of the matrices. The 1D-VBR representation produced by our
algorithm had faster SpMVs than the 1D-VBR representations produced by any
existing heuristics on 87.8% of the test matrices
Adapt or Become Extinct!:The Case for a Unified Framework for Deployment-Time Optimization
The High-Performance Computing ecosystem consists of a large variety of execution platforms that demonstrate a wide diversity in hardware characteristics such as CPU architecture, memory organization, interconnection network, accelerators, etc. This environment also presents a number of hard boundaries (walls) for applications which limit software development (parallel programming wall), performance (memory wall, communication wall) and viability (power wall). The only way to survive in such a demanding environment is by adaptation. In this paper we discuss how dynamic information collected during the execution of an application can be utilized to adapt the execution context and may lead to performance gains beyond those provided by static information and compile-time adaptation. We consider specialization based on dynamic information like user input, architectural characteristics such as the memory hierarchy organization, and the execution profile of the application as obtained from the execution platform\u27s performance monitoring units. One of the challenges of future execution platforms is to allow the seamless integration of these various kinds of information with information obtained from static analysis (either during ahead-of-time or just-in-time) compilation. We extend the notion of information-driven adaptation and outline the architecture of an infrastructure designed to enable information flow and adaptation through-out the life-cycle of an application
- …