GHOST: Building blocks for high performance sparse linear algebra on
  heterogeneous systems by Kreutzer, Moritz et al.
ar
X
iv
:1
50
7.
08
10
1v
3 
 [c
s.D
C]
  1
5 F
eb
 20
16
Noname manuscript No.
(will be inserted by the editor)
GHOST: Building Blocks for High Performance
Sparse Linear Algebra on Heterogeneous Systems
Moritz Kreutzer · Jonas Thies · Melven
Ro¨hrig-Zo¨llner · Andreas Pieper · Faisal
Shahzad · Martin Galgon · Achim
Basermann · Holger Fehske · Georg Hager ·
Gerhard Wellein
the date of receipt and acceptance should be inserted later
Abstract While many of the architectural details of future exascale-class high per-
formance computer systems are still a matter of intense research, there appears to be
a general consensus that they will be strongly heterogeneous, featuring “standard”
as well as “accelerated” resources. Today, such resources are available as multicore
processors, graphics processing units (GPUs), and other accelerators such as the Intel
Xeon Phi. Any software infrastructure that claims usefulness for such environments
must be able to meet their inherent challenges: massive multi-level parallelism, topol-
ogy, asynchronicity, and abstraction. The “General, Hybrid, and Optimized Sparse
Toolkit” (GHOST) is a collection of building blocks that targets algorithms dealing
with sparse matrix representations on current and future large-scale systems. It imple-
ments the “MPI+X” paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous paral-
lelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the
details of its design with respect to the challenges posed by modern heterogeneous
M. Kreutzer · F. Shahzad · G. Hager
Erlangen Regional Computing Center, Friedrich-Alexander-Universita¨t Erlangen-Nu¨rnberg, 91058 Erlan-
gen, Germany
E-mail: {moritz.kreutzer, faisal.shahzad, georg.hager}@fau.de
J. Thies · M. Ro¨hrig-Zo¨llner · A. Basermann
German Aerospace Center (DLR), Simulation and Software Technology, 51147 Ko¨ln, Germany
E-mail: {jonas.thies, melven.roehrig-zoellner, achim.basermann}@dlr.de
A. Pieper · H. Fehske
Institute of Physics, Ernst-Moritz-Arndt-Universita¨t Greifswald, 17489 Greifswald, Germany
E-mail: {pieper, fehske}@physik.uni-greifswald.de
Martin Galgon
Bergische Universita¨t Wuppertal, 42097 Wuppertal, Germany
E-mail: galgon@math.uni-wuppertal.de
G. Wellein
Department of Computer Science, Friedrich-Alexander-Universita¨t Erlangen-Nu¨rnberg, 91058 Erlangen,
Germany
E-mail: gerhard.wellein@fau.de
2 Moritz Kreutzer et al.
supercomputers and recent algorithmic developments. Implementation details which
are indispensable for achieving high efficiency are pointed out and their necessity is
justified by performance measurements or predictions based on performance mod-
els. The library code and several applications are available as open source. We also
provide instructions on how to make use of GHOST in existing software packages,
together with a case study which demonstrates the applicability and performance of
GHOST as a component within a larger software stack.
Keywords sparse linear algebra, heterogeneous computing, software library, task
parallelism, large scale computing
1 Introduction and related work
1.1 Sparse solvers on heterogeneous hardware
Users of modern supercomputers are facing several obstacles on their way to highly
efficient software. Out of those, probably the most prominent is the ever increasing
level of parallelism in hardware architectures. Increasing the parallelism on the chips
– both in terms of the number of cores as well as inside the core itself – is currently
the only way to increase the maximum performance while keeping the energy con-
sumption at a reasonable level. The parallelization of hardware architectures peaks in
the use of accelerators, coprocessors, or graphics processing units (GPUs) for gen-
eral purpose computations. Those devices trade off core sophistication against a very
high core count, achieving an extremely high level of parallelism with unmatched
peak floating point performance per Watt. Today, more than 20% of all TOP500 [50]
systems are heterogeneous and the accelerators in those installations account for al-
most a third of the entire aggregated TOP500 performance. This evolution has led
to the emergence of a large scientific community dealing with various aspects of ac-
celerator programming and a considerable amount of accelerator-enabled software
packages. However, “heterogeneous software” often means “accelerator software”.
A fact which is frequently not being considered is that also the CPU part of a het-
erogeneous system can contribute significantly to a program’s performance. On the
other hand, even in simpler CPU-only machines, properties like memory hierarchy,
ccNUMA effects and thread affinity play an important role in high performance code
development. The addition of accelerators further increases the level of complexity
in the system topology. At the same time, the world’s largest compute clusters exhibit
a steady growth in terms of cores and nodes as a result of continuously increasing re-
quirements from application users. This poses challenges to algorithms and software
in terms of scalability.
There is a wide range of applications which require computation with very large
sparse matrices. The development of sparse linear and eigenvalue solvers that achieve
extreme parallelism therefore remains an important field of research. One example is
the analysis of novel materials like graphene [37] and topological insulators [45] in
the field of solid state physics, which is a driving force in the GHOST development
GHOST: High Performance Sparse Linear Algebra on Heterogeneous Systems 3
within the ESSEX project1. These eigenvalue problems require the computational
power of full petascale systems for many hours, so it is crucial to achieve optimal
performance at all levels.
Recent work in the area of sparse matrix algorithms can roughly be subdivided
into three categories. Methods that increase computational intensity (i.e. reduce/avoid
communication), methods that hide communication by overlapping it with computa-
tion, and fully asynchronous algorithms. To the first category belong, among others,
block Krylov methods and the communication avoiding GMRES (CA-GMRES [10])
method, which require optimized block vector kernels. The second category includes
the pipelined CG and GMRES methods [16]. An example for the third category
is the asynchronous ILU preconditioner by Chow and Patel [9]. Methods from the
latter two categories benefit from an easy-to-use tasking model that delivers high
performance. The novel implementation of ILU methods in [9] replaces the poorly
scaling forward/backward substitution by a matrix polynomial, increasing the per-
formance requirements of the sparse matrix-vector multiplication in preconditioned
Krylov methods.
1.2 Related work
There is a large interest in efficient heterogeneous software driven by the develop-
ments in modern supercomputer architectures described above. Many efforts follow
a task-parallel approach, which strives to map a heterogeneous workload to hetero-
geneous hardware in an efficient way. The most prominent software package imple-
menting task-parallel heterogeneous execution is MAGMA [27]. A major drawback
of MAGMA is the absence of built-in MPI support, i.e., users have to implement
MPI parallelism around MAGMA on their own. Under the hood, MAGMA uses
the StarPU runtime system as proposed by Augnoett et al. [2] for automatic task-
based work scheduling on heterogeneous systems. Another significant attempt to-
wards heterogeneous software is ViennaCL [42]. Being based on CUDA, OpenCL
or OpenMP, this software package can execute the same code on a wide range of
compute architectures. However, concurrent use of different architectures for a sin-
gle operation is not supported. Besides, ViennaCL has limited support for complex
numbers, which is problematic for many applications. The same applies to the C++
framework LAMA [25], a library based on MPI+OpenMP/CUDA with special focus
on large sparse matrices. PETSc [5] is an MPI-parallel library for the scalable solu-
tion of scientific applications modeled by partial differential equations. Its intended
programming model is pure MPI, with MPI+X support for GPUs (‘X’ being CUDA
or OpenCL) and some limited support for threading. It also lacks support of hetero-
geneous computation of single operations. Another library containing sparse iterative
solvers and preconditioners is PARALUTION [35]. However, the multi-node and
complex number support is restricted to the non-free version of this software. The
Trilinos packages Kokkos and Tpetra [3] implement an MPI+X approach similar to
the one used in GHOST. Being implemented in C++, they clearly separate the MPI
1 http://http://blogs.fau.de/essex
4 Moritz Kreutzer et al.
level (Tpetra package) from the node level (Kokkos package), whereas GHOST can
benefit from tighter integration for, e.g., improved asynchronous MPI communica-
tion (cf. Section 4.2). In Section 6.1 we will provide a performance comparison of
GHOST vs. Trilinos for an eigenvalue solver.
While all of these libraries certainly improve the accessibility of heterogeneous
hardware to a wide range of applications, they do not fit our purpose of extreme
scale eigenvalue computations in an optimal way. In particular, we believe that a
single library for building blocks integrating well-tuned kernels, communication on
all levels and good performance on heterogeneous systems ‘out of the box’ is key
to satisfying the needs of scientists who are trying to tackle problems at the edge of
‘what can be done’.
1.3 Contribution
In this work, we present the software package GHOST (General, Hybrid and Op-
timized Sparse Toolkit). As summarized in Section 1.2, there is a range of efforts
towards efficient sparse linear algebra on heterogeneous hardware driven by modern
hardware architectural developments. GHOST can be classified as an approach to-
wards a highly scalable, and truly heterogeneous sparse linear algebra toolkit with
a key target in the development process being optimal performance on all parts of
heterogeneous systems. In close collaboration with experts from the application side
we focus on a few key operations often needed in sparse eigenvalue solvers and pro-
vide highly optimized and performance-engineered implementations for those. We
show that disruptive changes of data structures may be necessary to achieve effi-
ciency on modern CPUs and accelerators featuring wide single instruction multiple
data (SIMD) units and multiple cores.
One may argue about whether performance should be the primary goal in a CS&E
software library and whether it is worth the effort to optimize a few core operations
for a two-digit percentage gain in performance. Our efforts are targeted at large-scale
supercomputers in the petaflop range and beyond, and computing time is a valuable
resource there. Even a performance gain below an order of magnitude can become
significant in terms of time, energy, and money spent on the large scale. Needless to
say, the kernels we provide can be used on smaller clusters or single workstations as
well. GHOST does not give up generality or extensibility for this purpose, rather we
aim to provide performance-optimized (guided by performance models) kernels for
some commonly used algorithms (e. g. the kernel polynomial method [24], the block
Jacobi-Davidson method [41] or Chebyshev filter diagonalization [38]). The success-
ful implementations of these methods (which are very popular in fields like material
physics and quantum chemistry) will serve as blueprints for other techniques such as
advanced preconditioners needed in other CS&E disciplines. In the application areas
we consider right now, methods such as incomplete factorization or multigrid can
usually not be applied straightforwardly. The matrices that appear may not have an
interpretation as physical quantities discretized on a mesh, they may be completely
indefinite, and they may have relatively small diagonal entries and/or random ele-
ments [13].
GHOST: High Performance Sparse Linear Algebra on Heterogeneous Systems 5
A key feature of GHOST is the transparency to the user when it comes to het-
erogeneous execution. In contrast to other heterogeneous software packages (cf. Sec-
tion 1.2), GHOST uses a data-parallel approach for work distribution among compute
devices. While a task-parallel approach is well-suited for workloads with complex de-
pendency graphs, the data-parallel approach used by GHOST may be favorable for
uniform workloads (i.e., algorithms where all parts have similar resource require-
ments) or algorithms where an efficient task-parallel implementation is unfeasible.
On the process level, GHOST’s tasking mechanism still allows for flexible work dis-
tribution beyond pure data parallelism.
GHOST unifies optimized low-level kernels whose development process is being
guided by performance modelling into a high-level toolkit which allows resource-
efficient execution on modern heterogeneous hardware. Note that for uniform work-
loads, performance models like the roofline model [53] are a suitable tool to check
an implementation’s efficiency. In recent work, GHOST has proven to scale up to the
petaflop level, extending the scaling studies presented in [24]. A list of challenges we
are addressing specifically and the corresponding sections in this paper can be given
as follows:
(i) Emerging asynchronous sparse solver algorithms are in need of a light-weight,
affinity-aware and threading-friendly task-based execution model. In this con-
text, the high relevance of OpenMP should be noted, which requires the tasking
model to be compatible to OpenMP-parallel codes. See Section 4.2.
(ii) Existing software rarely uses all components of heterogeneous systems in an
efficient manner. See Sections 4.1 and 5.1.
(iii) The potential performance of compute libraries is often limited by the require-
ment of high generality, leading to a lack of application-specific kernels. See
Section 5.3.
(iv) The possibilities for application developers to feed their knowledge into com-
pute libraries for higher performance are often limited. See Section 5.4.
(v) The applicability of optimization techniques like vector blocking is often lim-
ited due to restrictions in existing data structures. Fundamental changes to data
structures are often hard to integrate in existing software packages. See Sec-
tions 5.1 and 5.2.
GHOST is available as a BSD-licensed open source download [15]. Along with
it, a list of sample application based on GHOST (e.g., a Conjugate Gradient solver
and a Lanczos eigensolver) can be downloaded. On top of that, the iterative solver
package and sister project of GHOST named PHIST [36] can use GHOST to execute
more sophisticated algorithms like, e.g., the block Jacobi-Davidson eigensolver as
described in [41], and blocked versions of the MinRes and GMRES linear solvers.
1.4 Testbed
All experiments in this paper have been conducted on the Emmy2 cluster located at
the Erlangen Regional Computing Center. Table 1 summarizes the hardware com-
2 http://www.rrze.fau.de/dienste/arbeiten-rechnen/hpc/systeme/emmy-cluster.shtml
6 Moritz Kreutzer et al.
Alias Model Clock SIMD Cores/ b Ppeak
(MHz) (Bytes) SMX (GB/s) (Gflop/s)
CPU Intel Xeon E5-2660 v2 2200 32 10 50 176
GPU Nvidia Tesla K20m 706 128 . . . 512∗ 13 150 1174
PHI Intel Xeon Phi 5110P 1050 64 60 150 1008
Table 1: Relevant properties of all architectures used in this paper. The attainable
memory bandwidth as measured with the STREAM [29] benchmark is denoted by
b and Ppeak is the theoretical peak floating point performance. Turbo mode was acti-
vated on the CPU and the GPU was configured with ECC enabled.
∗: SIMD processing is done by 32 threads. Hence, the SIMD width in bytes depends
on the data type in use: 128 bytes is valid for 4-byte (single precision floating point)
data while 512 bytes corresponds to complex double precision data.
ponents used in this cluster. The Intel C/C++ compiler in version 14 and CUDA in
version 6.5 have been used for compilation. Intel MKL 11.1 was used as the BLAS
library on the CPU.
2 Design principles
In this section, fundamental design decisions of the GHOST development are dis-
cussed and justified. This includes the support of certain hardware architectures as
well as fundamental parallelization paradigms.
2.1 Supported architectures and programming models
Many modern compute clusters comprise heterogeneous nodes. Usually, such nodes
consist of one or more multi-core CPUs and one or more accelerator cards. In the
TOP500 [50] systems of November 2014, 96% of all accelerator performance stems
from NVIDIA Fermi/Kepler GPUs or Intel Xeon Phi (to be called “PHI”) copro-
cessors. Hence, we decided to limit the accelerator support in GHOST to those two
architectures. Instead of implementing support for any kind of hardware architec-
ture, our primary goal is to stick to the dominant platforms and to develop properly
performance-engineered code for those.
Although GPUs and PHIs share the name accelerators, there are significant dif-
ferences in how those devices are operated. GPUs can only be driven in accelerator
mode, i.e., data transfers and compute kernels must be launched explicitly from a
main program running on a host CPU. The PHI can be operated in accelerator mode,
too. However, in addition to this, the PHI can also be driven in native mode, i.e., in
the same way as a multicore CPU would be used. In GHOST only native execution
on the PHI is supported, i.e., the PHI hosts its own process. Hence, the PHI can be
considered as a CPU node on its own. With regard to the PHI as a multi-(many-)core
CPU, it has to be taken into account that serial code may run at very low performance
due to its very simple core architecture.
GHOST: High Performance Sparse Linear Algebra on Heterogeneous Systems 7
MEM 1
20
21
SOCKET 1
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
MEM 0
0
1
SOCKET 0
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Intel Xeon PhiNvidia GPU
(a) Heterogeneous node
MEM 1
20
21
SOCKET 1
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
MEM 0
0
1
SOCKET 0
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Intel Xeon PhiNvidia GPU
Process 0
Process 1 Process 3
Process 2
(b) Process placement
Fig. 1: Heterogeneous compute node and sensible process placement as suggested by
GHOST.
2.2 Parallelism in GHOST
For illustration of the principles, we consider a heterogeneous node as shown in
Fig. 1a. This node contains two multicore CPU sockets with ten cores and two-way
hyper-threading each. In total, there are 20 hardware threads or processing units (PUs)
available per socket. In addition to that, one GPU and one PHI are attached to the
node as accelerators. Note that a node with two different accelerator architectures is
unlikely to be installed in a production system.
In terms of parallelization, GHOST implements the “MPI+X” paradigm, i.e.,
coarse-grained parallelism is done by means of MPI accompanied with fine-grained
and device-specific parallelization mechanisms (“X”). One may certainly omit the
“X” part and go with plain MPI altogether if the hardware can be efficiently utilized
in this way; the plain fact that modern hardware exhibits complex topologies does not
mean that a hybrid programming model is required in all cases, and it may sometimes
even be counterproductive. However, interesting opportunities in terms of load bal-
ancing, additional levels of parallelism, communication hiding, etc., arise from com-
bining MPI with a threading model [40]. This is why GHOST supports OpenMP for
the “X” component on CPUs. Further down the hardware hierarchy, implicit vector-
ization by compiler-friendly code and pragmas as well as explicit vectorization using
Single Instruction Multiple Data (SIMD) intrinsics provide efficient single-threaded
code. On Nvidia GPUs, CUDA is used as the “X” parallelization layer.
In general, parallelism in compute applications can be categorized into data and
task parallelism. The term data parallelism describes a number of workers operat-
ing on the same data set, each having assigned a certain amount of work. The term
task parallelism describes workers working on independent tasks at the same time.
GHOST implements both data (between processes) as described in Section 4.1 and
task parallelism (inside a process) as analyzed in Section 4.2.
In many cases, algorithms from sparse linear algebra are centered around a single
and potentially large sparse system matrix. Hence, the distribution of work in GHOST
is done in a matrix-centered way. More precisely, the system matrix is distributed
8 Moritz Kreutzer et al.
row-wise across the MPI processes. The amount of work per process can either be
expressed by the number of rows or the number of nonzero elements. Details on the
implementation are given in Section 4.1.
3 Available data structures
There are two major data structures in GHOST, namely sparse and dense matri-
ces (ghost sparsemat and ghost densemat, respectively). Dense vectors are
represented as dense matrices with a single column. Both data structures implement
a row-wise distribution among MPI processes. We do not support 2D partitionings
of these data structures or direct conversion routines between them, but this may be
added in the future.
3.1 Sparse matrices
GHOST supports the SELL-C-σ sparse matrix storage format as introduced in [24].
Note that this is not necessarily a restriction, as the well-known CRS storage format
can be expressed as SELL-1-1. Further special cases of SELL-C-σ will be listed
in Section 5.1. More details on sparse matrix storage are given in Section 5.1. As
mentioned in Section 2.2, the sparse system matrix is the central data structure in
GHOST.
A significant performance bottleneck for highly scalable sparse solvers may be
the generation of the system matrix. In GHOST this matrix can be stored in a file,
either in the Matrix Market [28] format or a binary format which resembles the CRS
data format. However, the scalability of this approach is intrinsically limited. The pre-
ferred method of matrix construction in GHOST is via a callback function provided
by the user, which allows to construct a matrix row by row. The function must have
the following signature:
int mat(ghost_gidx row, ghost_lidx *len, ghost_gidx *col, void *val, void *arg);
GHOST passes the global matrix row to the function. The user should then store the
number of non-zeros of this row in the argument rowlen and the column indices
and values of the non-zeros in col and val. Any further arguments can be passed in
arg. The maximum number of non-zeros must be set in advance such that GHOST
can reserve enough space for the col and val arrays.
There are several reasons which make it necessary to permute rows of the sparse
system matrix. A global (inter-process) permutation of matrix rows can be applied
in order to minimize the communication volume of, e.g., the sparse matrix vector
multiplication (SpMV) kernel and to enforce more cache-friendly memory access
patterns. Currently, GHOST can be linked against PT-SCOTCH [8] for this purpose.
A matrix’ row lengths and column indices are passed to PT-SCOTCH which results
in a permutation vector on each process containing global indices. Afterwards, the
matrix is assembled on each process according to the global permutation. Our exper-
iments revealed that this approach is limited in terms of scalability. For that reason,
GHOST: High Performance Sparse Linear Algebra on Heterogeneous Systems 9
we are going to include support for more global permutation schemes that improve
communication reduction in future work, such as the parallel hypergraph partitioner
as implemented in Zoltan [12].
In addition to the global permutation, a local (intra-process) permutation can be
applied, e.g., to minimize the storage overhead of the SELL-C-σ sparse matrix format
(cf. Section 5.1. Another potential reason for a local matrix permutation is row col-
oring. GHOST has the possibility to permute a sparse matrix according to a coloring
scheme obtained from ColPack [14]. This kind of re-ordering may be necessary for
the parallelization of, e.g., the Kaczmarz [21] algorithm or a Gauß-Seidel smoother
as present in the HPCG benchmark.
Note that an application-based permutation, e.g., by optimizing the numbering
of nodes in a mesh-based problem, usually leads to better overall performance and
should be preferred over an a posteriori permutation, e.g., with PT-SCOTCH. In
GHOST the former can be achieved by the user by a sensible implementation of
the matrix construction via the callback interface.
3.2 Dense matrices
GHOST is a framework for sparse linear algebra. Dense matrices are mainly oc-
curring as dense vectors (dense matrices with a single column) or blocks of dense
vectors (to be referred to as block vectors). Section 5.2 will cover the aspect of block
vectors in more detail. Block vectors can be considered as tall and skinny dense ma-
trices, i.e., dense matrices with a large row count (in the dimension of the sparse
system matrix) but relatively few (at most a few hundred) columns. Furthermore, a
ghost densemat can be used to represent small local or replicated matrices, e.g.
the result of an inner product of two (distributed) block vectors.
O   r C	
 t
S
 e t
Fig. 2: Views of a dense matrix.
Instead of allocating its own memory, a
dense matrix can also be created as a view of
another dense matrix or a view of arbitrary
data in memory. This makes it easily possi-
ble to let a function work on a sub-matrix
or a subset of vectors in a larger block vec-
tor without having to copy any data. Addi-
tionally, by viewing “raw” data in memory it
is possible to integrate GHOST into existing
code (cf. Section 6). A potential disadvantage of using non-GHOST data structures is
the violation of data alignment which may result in a performance loss. GHOST im-
plements different kinds of views, as shown in Fig. 2. In general, compact views allow
vectorized computation with the matrix data. This is not the case for scattered views
due to the “gaps” in memory layout in the leading dimension caused by columns not
included in the view. In this case, it may be favorable to create a compact clone of the
scattered view before executing the computation.
Dense matrices can be chosen to be stored in a (locally) row- or column-major
manner. In many cases, row-major storage (which corresponds to interleaved storage
10 Moritz Kreutzer et al.
of block vectors) yields better performance and should be preferred over column-
major storage (cf. Section 5.2). On the other hand, column-major storage may be
required for easy integration of GHOST into existing software. GHOST offers mech-
anisms to change the storage layout either in-place or out-of-place, while copying a
block vector.
4 Runtime features
In this section we describe runtime features which are deeply woven into the software
architecture and constitute GHOST’s unique feature set. In contrast to the so-called
performance features which will be introduced in Section 5, they are fundamentally
built into the library and hard to apply to other approaches.
4.1 Transparent and data-parallel heterogeneous execution
The distribution of work among the heterogeneous components is done on a per-
process basis where each process (MPI rank) is bound to a fixed set of PUs. This
allows flexible scaling and adaption to various kinds of heterogeneous systems. The
sets of PUs on a single node are disjoint, i.e., the compute resources are exclusively
available to a process. For the example node shown in Fig. 1a, the minimum amount
of processes for heterogeneous execution on the full node is three.
Application developers are frequently confused by the ccNUMA memory struc-
ture of modern compute nodes and how to handle it to avoid performance penalties.
Although the required programming strategies are textbook knowledge today, estab-
lishing perfect local memory access may be tricky if a multithreaded process spans
multiple ccNUMA domains even if proper thread-core affinity is in place and parallel
first-touch initialization is performed [18]. A simple way to avoid ccNUMA problems
is to create one process per multicore CPU socket, which would result in a process
count of four as illustrated for the example node in Fig. 1b. Processes 0 and 2 cover
one CPU socket each. Process 1 drives the GPU. As this has to be done in accelerator
mode, this process also occupies one core of the host system. Note that this core is
located on the socket whose PCI Express bus the GPU is attached to and thus has to
be subtracted from Process 0’s CPU set. Process 3 is used for the PHI. The process
can directly be located on the accelerator which is used in native mode, i.e., no host
resources are used for driving the PHI.
For each numerical function in the GHOST library, implementations for the dif-
ferent architectures are present. However, the choice of the specific implementation
of a kernel does not have to be made by the user. Consequently, in almost all usage
cases, no changes to the code are necessary when switching between different hard-
ware architectures. An exception of this rule is, e.g., the creation of a dense matrix
view from plain data: If the dense matrix is located on a GPU, the plain data must be
valid GPU memory.
An intrinsic property of heterogeneous systems is that the components differ in
terms of performance. For efficient heterogeneous execution it is important that the
GHOST: High Performance Sparse Linear Algebra on Heterogeneous Systems 11
Proc. GB/s
0
1
2 50
150
50
3
150
MM 
20
21
 1
22
23
24
25
26
2
28
29
30
31
32
33
34
35
36
3
38
39
MM 
0
1
 0
2
3
4
5
6

8
9
10
11
12
13
14
15
16
1
18
19
I ff fiflffiNvidia GPU
Process 0
Process 1 Process 3
Process 2
(1)
(2)
(3)
Fig. 3: Heterogeneous row-wise distribution of a sparse matrix. Step (1) is the deter-
mination of process weights according to the device’s peak memory bandwidths. In
step (2), a partial sparse matrix is created on each process. In order to avoid integer
overflows and storage overhead in the communication buffers, the column indices of
elements in the remote matrix part (pale colors) are compressed in step (3).
performance differences get reflected in the work distribution. In GHOST the under-
lying sparse system matrix gets divided on a row-wise basis among all processes. For
example, if component A is expected to have twice the performance of component B,
process A will get assigned a twice as large chunk (either in terms of rows or in terms
of nonzero elements) of the system matrix as process B. Figure 3 illustrates the row-
wise distribution of a sparse matrix among the example processes shown in Fig. 1b.
As the performance of sparse solvers is often bound by main memory bandwidth, the
device-specific maximum attainable bandwidth, as given in Table 1, has been chosen
as the work distribution criterion in this example. Note that an arbitrary work share
for each process/architecture can easily be specified at runtime.
Internally, each process gets assigned a type which allows to define the compute
platform used by an executable. Valid types are CPU and GPU. The type can be set
explicitly at runtime either via API calls or by specifying an environment variable. If
multiple processes are launched on a node containing CPUs and GPUs, the type gets
selected automatically if not explicitly specified. In this case, Process 0 is always of
type CPU, initially covering all CPUs in the node. Processes 1 to N are of type GPU
where N is equal to the number of GPUs attached to the node. For each GPU process
getting added to a node, a small CPU set (usually a single core) gets subtracted from
Process 0’s resources. If any more than (1 + “Number of GPUs”) processes get placed
on a node, the addition of any further processes causes a division of Process 0’s CPU
set into equally sized smaller CPU sets. A good number of processes to put on a node
is (“Number of CPUs” + “Number of GPUs”), which is an easy way to avoid NUMA
locality problems by having one process per CPU socket.
In the following we demonstrate the heterogeneous execution capabilities on our
example node using a simple program which measures the SpMV performance for
a given matrix and storage format (downloadable from the GHOST website [15]).
In this case, we used the Janna/ML Geer matrix 3 (dimension n = 1,504,002,
3 http://www.cise.ufl.edu/research/sparse/matrices/Janna/ML_Geer.html
12 Moritz Kreutzer et al.
number of non-zeros nnz = 110,686,677) stored in SELL-32-1. Performance will be
reported in Gflop/s, with 1 Gflop/s corresponding to a minimum memory bandwidth
of 6 GByte/s. This relation is founded on the minimum code balance of the SpMV
kernel. If we want to perform computations on the CPU only and use one process per
CPU socket, the type has to be set explicitly and a suitable number of processes has
to be launched on the host:
> GHOST TYPE=CPU mpiexec -nopin -np 2 -ppn 2 ./spmvbench -v -m ML_Geer.mtx \
-f SELL-32-1
[GHOST] PERFWARNING: The number of MPI ranks (1) on this node is not optimal!
Suggested number: 3 (2 NUMA domains + 1 CUDA device)
[GHOST] PERFWARNING: There is 1 Xeon Phi in the set of active nodes but only 0
are used!
Region | Calls | P_max | P_skip10
-----------------------------------------
spmv (GF/s) | 100 | 1.64e+01 | 1.64e+01
The overall number of processes is set via the -np flag and the number of processes
per host is set using -ppn. Note that automatic thread/process pinning by the MPI
startup script has been suppressed by -nopin. This should always be done to avoid
conflicts with GHOST’s resource management. The maximum performance over all
100 runs is given in P max. P skip10 shows the average performance over all but
the first ten iterations. The performance warnings (omitted in the following listings)
issued by GHOST indicate that the node is not used to its full extent. The suggested
process count of three is in accordance with the knowledge about the node architec-
ture; each node contains two CPU sockets and one GPU. The Intel PHI attached to
this node has to be considered as a node on its own. The achieved performance of
16.4 Gflop/s matches the prediction of a simple roofline model for this algorithm and
two CPU sockets. If the example program should use the GPU for computation, the
following command has to be invoked:
> GHOST TYPE=GPU ./spmvbench -v -m ML_Geer.mtx -f SELL-32-1
Region | Calls | P_max | P_skip10
-----------------------------------------
spmv (GF/s) | 100 | 2.28e+01 | 2.27e+01
From the single-device runs we can easily deduce that the GPU execution was 2.75
times as fast as the execution on a single CPU socket. In the current implementation
the weight, i.e., the amount of work, assigned to each process in heterogeneous runs,
has to be specified manually. Future work will include automatic weight selection
based on micro-benchmarks and dynamic adaption of weights at runtime (cf. 7.2).
Starting the example program using three processes and a work ratio between CPU
and GPU of 1:2.75 yields the following:
> mpiexec -nopin -np 3 -ppn 3 ./spmvbench -v -m ML_Geer.mtx -f SELL-32-1 \
-w 1:2.75
[GHOST] PE0 INFO: Setting GHOST type to CPU.
[GHOST] PE1 INFO: Setting GHOST type to GPU.
[GHOST] PE2 INFO: Setting GHOST type to CPU.
Region | Calls | P_max | P_skip10
-----------------------------------------
spmv (GF/s) | 100 | 3.11e+01 | 3.09e+01
The information log messages indicate that the process types have automatically been
set as described above. The achieved performance is less than the accumulated single-
GHOST: High Performance Sparse Linear Algebra on Heterogeneous Systems 13
device performances. This is due to the MPI communication of input vector data
which is done in each SpMV iteration. For testing purposes, it is possible to suppress
the communication by selecting an appropriate “pseudo SpMV” routine. Note that
this does not give the correct result for the SpMV operation if the input vector data
changes between successive iterations.
> mpiexec -nopin -np 3 -ppn 3 ./spmvbench -v -m ML_Geer.mtx -f SELL-32-1 \
-w 1:2.75 -s nocomm
...
Region | Calls | P_max | P_skip10
-----------------------------------------
spmv (GF/s) | 100 | 3.85e+01 | 3.73e+01
Now, the heterogeneous performance sums up to the accumulated single-device per-
formances. In order to include the node’s PHI in the computation, the library and
executable have to be compiled for the MIC architecture, resulting in an additional
executable file ./spmvbench.mic. For setting up heterogeneous execution using
the PHI, the following has to be done:
> # Assemble machine file for three MPI ranks on the host and one on the PHI
> echo -e "$(hostname -s):3\n$(hostname -s)-mic0:1" > machinefile
> export I_MPI_MIC=1 # Enable MPI on the PHI
> export I_MPI_MIC_POSTFIX=.mic # Specify the postfix of the MIC executable
Using all parts of the heterogeneous node for computing the “pseudo SpMV”, the
following is obtained:
> mpiexec -nopin -np 4 -machinefile machinefile \
./spmvbench -v -m ML_Geer.mtx -f SELL-32-1 -w 1:2.75:2.75 -s nocomm
...
Region | Calls | P_max | P_skip10
-----------------------------------------
spmv (GF/s) | 100 | 5.64e+01 | 5.47e+01
The PHI got assigned the same share of work as the GPU. Note that there may still
be optimization potential regarding the load balance. The total node performance
adds up to approximately 55 Gflop/s, which indicates a good use of the aggregated
memory bandwidth of all resources. If the actual SpMV function is computed, the
optimal weights are slightly different, due to a higher communication effort on the
GPU and PHI.
> mpiexec -nopin -np 4 -machinefile machinefile \
./spmvbench -v -m ML_Geer.mtx -f SELL-32-1 -w 1:2.3:2.1
...
Region | Calls | P_max | P_skip10
-----------------------------------------
spmv (GF/s) | 100 | 3.97e+01 | 3.88e+01
Note that the inclusion of the PHI barely leads to a performance benefit over the
CPU+GPU run. This is due to the small amount of work on each device and an
increasing dominance of communication over the slow PCI express bus as a result
of this. The impact of communication may be reduced by matrix re-ordering (cf.
Section 3.1) or more sophisticated communication mechanisms (using GPUdirect,
pipelined communication, etc., cf. Section 7.2).
14 Moritz Kreutzer et al.
4.2 Affinity-aware resource management
Although GHOST follows a data-parallel approach across processes for heteroge-
neous execution, work is organized in tasks on the process level. The increasing
asynchronicity of algorithms together with the necessity for sensible hardware affin-
ity and the avoidance of resource conflicts constitute the need for a unit of work
which is aware of the underlying hardware has to be used: a GHOST task. Affinity
and resource management is implemented by means of the hwloc library [7] which
is, besides a BLAS library, the only build dependency of GHOST.
There are existing solutions for task parallelism. Apart from the ones named in
Section 1.2, OpenMP tasks or Intel’s Threading Building Blocks can be mentioned
here. However, a crucial requirement in the design phase of GHOST was to support
affinity-aware OpenMP parallelism in user-defined tasks. As we could not find a light-
weight existing solution which meets our requirements, we decided to implement
an appropriate tasking model from scratch. For example, both Intel TBB and Cilk
Plus warn about using those frameworks together with OpenMP in their user manu-
als. This is due to potential performance issues caused by core over-subscription. As
OpenMP is in widespread use in scientific codes, this limitation disqualifies the inte-
gration of TBB and Cilk Plus tasks for many existing applications. Note that GHOST
tasks lack a list of features compared to existing solutions, such as intelligent resolu-
tion of dependencies in complex scenarios. Yet, for most of our usage scenarios they
work well enough. In our opinion, a holistic performance engineering approach is
a key to optimal performance for complex scenarios. Thus, we decided to make the
resource management a part of the GHOST library.
Generally speaking, a task’s work can be an arbitrary user-defined function in
which OpenMP can be used without having to worry about thread affinity or resource
conflicts. The threads of a task are pinned to exclusive compute resources, if not
specified otherwise. GHOST tasks are used in the library itself, e.g., for explicitly
overlapping communication and computation in a parallel SpMV (see below). How-
ever, the mechanism is designed in a way that allows easy integration of the tasking
capabilities also into user code. The user-relevant properties of a task are as follows:
typedef struct {
void * (*func) (void *); /* callback function where work is done */
void * arg; /* arguments to the work function */
void * ret; /* return value of the work function */
struct ghost_task **depends; /* list of tasks on which this task depends */
int ndepends; /* length of dependency list */
int nthreads; /* number of threads for this task */
int numanode; /* preferred NUMA node for this task */
ghost_task_flags flags; /* flags to configure this task */
} ghost_task;
The user-defined callback function containing the task’s work and its arguments have
to be provided in the func and arg fields. The function’s return value will be
stored by GHOST in ret. If the execution of this task depends on the completion
of ndepends other tasks, those can be specified as a list of tasks called depends.
The number of threads for the task has to be specified in nthreads. Usually, a suit-
able number of PUs will be reserved for this task. The field numanode specifies the
preferred NUMA node of PUs reserved for this task, which is important for situations
GHOST: High Performance Sparse Linear Algebra on Heterogeneous Systems 15
where different tasks work with the same data in main memory in a process which
spans several NUMA nodes. In this situation, and assuming a NUMA first touch pol-
icy, one can enforce a task which works on specific data to run on the same NUMA
node as the task which initialized this data. The flags can be a combination of the
following:
typedef enum {
GHOST_TASK_DEFAULT = 0; /* no special properties */
GHOST_TASK_PRIO_HIGH = 1; /* enqueue task to head of the task queue */
GHOST_TASK_NUMANODE_STRICT = 2; /* run task _only_ on the given NUMA node */
GHOST_TASK_NOT_ALLOW_CHILD = 4; /* disallow child tasks to use task’s PUs */
GHOST_TASK_NOT_PIN = 8; /* neither reserve PUs nor pin threads */
} ghost_task_flags;
pthread_cond_wait()
GHOSTApplication
#pragma omp parallel
s

 
!

 
"
d
 t
h
re
a
d
s
pthread_create()
pthread_cond_signal()
sched_setaffinity()
pumap_set_busy()
task->ret = task->func(task->arg)
w
#
"
$
e
rs
a
%
&
s

 
!
'
(

re
a
d
sched_setaffinity()
pumap_set_idle()
#pragma omp parallel
pthread_cond_wait()
ghost_task_enqueue()
ghost_init()
T)*+ p),)--.-/*0
m
a
4
%
5
6
th
re
a
d
7)0. p/88.9 :;,+er threads as before
<
p.8
=>
p),)--.- ,.
?
/;8 /8 :;,+
@A
8
BD
/;8
ghost_finalize()
pthread_join()
exit(EXIT_SUCCESS)
Fig. 4: Program flow of an exam-
ple application using a single GHOST
task for asynchronous task paral-
lelism.
Figure 4 shows a simple flow chart of
the execution of a GHOST application using
task parallelism with a single task. In the ini-
tialization phase GHOST creates a number
of shepherd threads which will immediately
wait on a condition. As a task gets enqueued,
this condition gets signalled which causes an
arbitrary shepherd thread to be woken up.
Note that the enqueue() function returns
immediately. A decision whether this task
can be executed is now made by the shep-
herd thread based on the task’s resource re-
quirements. In case the task can be executed,
an initial OpenMP parallel region is opened
in which all threads of the task get pinned to
their exclusive PU and each PU is set to busy
in the pumap. The task’s work function now
is called by the shepherd thread and is exe-
cuted in parallel to the user code which fol-
lowed the call to enqueue(). Due to phys-
ical persistence of OpenMP threads, a par-
allel region in the task function will be ex-
ecuted by the same threads as those which
have been pinned by GHOST. Note that this
persistence is not required by any standard.
However, our experiments have shown that
the most relevant OpenMP implementations
GOMP and Intel OpenMP work like this,
which makes the assumption of persistent OpenMP threads realistic in practice. As-
suming that this assumption is invalid, one could still confine the children of a shep-
herd thread to, e.g., a specific NUMA domain, by setting the shepherd thread’s affin-
ity. This is possible since children created via fork() inherit their parent’s CPU
affinity mask. Note that this fallback mechanism involves a coarser level of affin-
ity than the original approach and might be inappropriate for scenarios where a task
spans several physical affinity (e.g., NUMA) domains. After completion of the task’s
16 Moritz Kreutzer et al.
work, the PUs are freed and threads are unpinned in another OpenMP parallel region.
At finalization time, the shepherd threads are terminated.
It is possible to create nested tasks, i.e., tasks running inside other tasks. A par-
ent task can be configured such that none of its children steals resources of it by
specifying the GHOST TASK NOT ALLOW CHILD flag. If this flag is not set, it is
expected that parents wait for their child tasks and thus, the children can occupy the
parent’s resources (as demonstrated in the task-mode SpMV example below). In the
simplest case, there is only a single task which includes all the work done in the en-
tire application. This “main task” should be created for all GHOST applications for
controlled thread placement and the avoidance of resource conflicts. Moreover, while
conducting performance analyses using hardware performance counters, controlled
placement of threads is inevitable for making sense of the measurements. On top of
this, tasks can be used to implement task-level parallelism by having several tasks
running concurrently. Due to the fact that starting a task is a non-blocking operation,
asynchronous execution of work is inherently supported by GHOST tasks. Normally,
each task uses its own set of resources (= PUs) which is not shared with other tasks.
However, as mentioned above, a task can also be requested to not reserve any compute
resources. The PUs and their busy/idle state are managed process-wide in a bitmap
called pumap. The PUs available to GHOST can be set at the initialization phase.
This feature can be used, e.g., for integration with third-party resource managers that
deliver a CPU set to be used.
A realistic usage scenario for task level parallelism is communication hiding via
explicit overlap of communication and computation. This can be done, e.g., in a paral-
lel SpMV routine, which will be called task-mode SpMV. The following code snippet
shows the implementation of a task-mode SpMV using GHOST tasks.
ghost_task *curTask, *localcompTask, *commTask;
/* get the task which I am currently in (will be split into child tasks) */
ghost_task_cur(&curTask);
/* create a heavy-weight task for computation of the local matrix part */
ghost_task_create(&localcompTask, &localcompFunc, &localcompArg,
curTask->nthreads-1, GHOST_NUMANODE_ANY);
/* create a light-weight task for communication */
ghost_task_create(&commTask, &commFunc, &commArg, 1, GHOST_NUMANODE_ANY);
/* task-parallel execution of communication and local computation */
ghost_task_enqueue(commTask); ghost_task_enqueue(localcompTask);
ghost_task_wait(commTask); ghost_task_wait(localcompTask);
/* use the current (parent) task for remote computation */
remotecompFunc(remotecompArgs);
/* destroy the child tasks and proceed with current (parent) task */
ghost_task_destroy(localcompTask); ghost_task_destroy(commTask);
In this example, a main task is being split up into two child tasks. Communica-
tion and local computation are being overlapped explicitly. In principle, this could
also be done via non-blocking MPI calls. However, experience has shown that even
nowadays some MPI implementations do not fulfill non-blocking communication re-
quests in an asynchronous way. This has been discussed in various publications where
also several attempts to solving this problem have been proposed by, among others,
GHOST: High Performance Sparse Linear Algebra on Heterogeneous Systems 17
No overlap
Naïve overlap
GHOST overlap
0.0e+00 5.0e-03 1.0e-02 1.5e-02 2.0e-02
Time contribution (s)
Init Combined computationCommunication
Local comp. + comm. Remote comp.Init
Init Local comp. + comm. Remote comp.
Fig. 5: Runtime contributions in different SpMV variants. The “No Overlap” mode
communicates input vector data synchronously before it computes the full SpMV.
The “Overlap” modes communicate input vector data and at the same time compute
the process-local part of the SpMV. Here, the “Naı¨ve” version relies on the asyn-
chronicity of non-blocking MPI calls, whereas the “GHOST” version uses explicit
overlap by means of GHOST tasks.
Wittmann et al. [54] and Denis [11]. Thus, in order to create an assured overlap,
independent of the MPI library, GHOST’s tasking mechanism could be used. Note
that in many application scenarios based on GHOST tasks, an MPI implementation
supporting the MPI THREAD MULTIPLE compatibility level is required.
Figure 5 depicts the potential performance gain by using GHOST tasks. In this
example, 100 parallel SpMV operations on 4 CPU-only compute nodes using the
vanHeukelum/cage154 test case (n = 5,154,859, nnz = 99,199,551) stored in
SELL-32-1024 have been performed. Note that both overlapped variants require a
splitting of the process-local matrix into a local and a remote part, where the remote
part contains entries with column indices who require communication of input vector
data. Important observations are:
(i) Overlapping communication and computation pays off in this case. The run-
time for the two overlapped variants is significantly lower than for the non-
overlapped variant. Note that this may not always be the case: The overlapped
versions require the result vector to be stored twice and the cost of this may be
higher than the benefit from communication hiding.
(ii) The MPI library apparently features asynchronous point-to-point communica-
tion routines for this problem. The execution time for overlapped local compu-
tation and communication indicates that those operations are really overlapping.
Note that this may not be the case in general, even for this MPI library. It is as
well possible that the communication volume is below the “eager limit” and
larger messages would not be transferred asynchronously.
(iii) Affinity matters. Although one would not expect the task mode variant to per-
form any better than the MPI-overlapped variant, the execution time for local
computation and communication is lower for the version using GHOST tasks.
This can be explained by explicit thread placement.
4 http://www.cise.ufl.edu/research/sparse/matrices/vanHeukelum/cage15
18 Moritz Kreutzer et al.
5 Performance features
In this section we present several features of GHOST that constitute a unique fea-
ture set leading to high performance for a wide range of applications. The goal of
GHOST is neither to provide a “Swiss army knife” for sparse matrix computations
nor to re-invent the wheel. Instead, existing implementations are used and integrated
into the GHOST interface whenever possible and feasible. In contrast to the runtime
features presented in Section 4, the described performance features may be available
in other libraries as well. In order to justify their implementation in GHOST, short
benchmarks or performance models will be shown to demonstrate the potential or
measurable benefit over standard solutions.
5.1 Sparse matrix storage
For the SpMV operation, the choice of a proper sparse matrix storage format is a
crucial ingredient for high performance. In order to account for the heterogeneous
design of GHOST and simplify heterogeneous programming, SELL-C-σ is chosen
to be the only storage format implemented in GHOST. This is no severe restriction,
since SELL-C-σ can “interpolate” between several popular formats (see below). We
briefly review the format here. A detailed and model-guided performance analysis of
the SpMV kernel using the SELL-C-σ format can be found in [23].
SELL-C-σ features the hardware-specific tuning parameter C and the matrix-
specific tuning parameter σ . The sparse matrix is cut into chunks, each containing C
rows where C should be a multiple of the hardware’s SIMD width. In a heterogeneous
environment, the relevant SIMD width should be the maximum SIMD width over all
architectures. For instance, considering our example node’s properties as given in
Table 1 and using 4-byte values (single precision) and indices, the minimum value of
C should be
[max(32,64,128) Bytes]/[4 Bytes] = 32.
The rows in a chunk are padded with trailing zeros to the length of the chunk’s
longest row. The chunk entries are stored column-/diagonal-wise. Additionally, in
order to avoid excessive storage overhead for matrices with strongly varying row
lengths, σ rows are sorted according to their nonzero count before chunk assembly.
As this is a local operation, it can be trivially parallelized (which is also done in
GHOST). Note that, due to its general formulation, a range of further storage formats
can be represented by SELL-C-σ :
– SELL-1-1 =̂ CRS/CSR
– SELL-n-1 =̂ ITPACK/ELLPACK [34]
– SELL-C-1 =̂ Sliced ELLPACK [31]
Thus, a considerable selection of known sparse matrix storage formats is supported by
GHOST. A single storage format for all architectures greatly facilitates truly hetero-
geneous programming and enables quick (matrix) data migration without conversion
overhead.
GHOST: High Performance Sparse Linear Algebra on Heterogeneous Systems 19
0 0.5 1 1.5 2 2.5 3 3.5
Nvidia Tesla K20c
             ("Kepler")
Intel Xeon Phi 5110P
     ("Knights Corner")
Intel Xeon E5-2680
    ("Sandy Bridge")
avg
avg
avg
Fig. 6: Performance of SpMV using the unified SELL-C-σ storage format relative
to vendor-supplied libraries with the device-specific data formats CRS for CPU/PHI
and HYB for GPU (figure taken from [23]).
Figure 6 shows the relative performance of the SELL-C-σ SpMV against the
vendor-supplied libraries Intel MKL and NVIDIA cuSPARSE using their device-
specific sparse matrix storage formats (CRS in MKL and HYB in cuSPARSE). It
turns out that the performance of SELL-C-σ is on par with or better than the standard
formats for most test matrices.
For easy integration in existing software stacks, GHOST allows to construct a
SELL-C-σ matrix from raw CRS data, i.e., row pointers, column indices, and val-
ues. A common case in CS&E applications is the subsequent appearance of multiple
sparse matrices with the same sparsity pattern but different values. Let us assume that
we want to use GHOST and SELL-C-σ for computations with a CRS matrix obtained
from another source. Obviously, gathering row lengths and column indices as well as
the assembly of communication data structures and permutation vectors only has to
be done for the first read-in in this case. Given the ML Geer matrix (cf. Section 4.1)
present in CRS, we want to perform SpMV using GHOST with SELL-32-128 on two
CPU sockets with one MPI rank each. We find that an initial complete construction
of this matrix in GHOST (including communication buffer setup and SELL permu-
tation) costs as much as 48 SpMV operations. Note that the communication buffer
setup, which has to be done independently of the library or the sparse matrix for-
mat, accounts for 78% of the construction time. Each subsequent matrix construction
only needs to update the matrix values. Hence, all values need to be read from the
CRS data and written to the SELL-C-σ matrix. Taking into account the additional
read operation due to the write-allocate of the SELL-C-σ matrix, we have at least
3×nnz matrix elements to transfer. The relative cost depends on the matrix data type.
Considering double precision data (and 32-bit indices), subsequent CRS-to-SELL-C-
σ conversions should cost as much as two SpMV operations. This performance can
also be observed in GHOST. A possible future feature may be sparse matrix views.
Using a view, a SELL-1-1 matrix could just point to existing CRS data and GHOST
could be used for computation with existing matrices at no conversion cost.
Note that GHOST differentiates between local and global indices. Even for sparse
system matrices of moderately large size, it may be necessary to have 64-bit inte-
ger numbers for global indices. However, for the process-local part of the system
matrix, 32-bit integers may still be sufficient. As data movement should be mini-
mized, especially for (often bandwidth-bound) sparse solvers, it is possible to config-
ure GHOST with 64-bit indices for global quantities (ghost gidx) and with 32-bit
indices for local quantities (ghost lidx). Thus, the column and row indices of the
20 Moritz Kreutzer et al.
entire process-local sparse matrix can be stored using 32-bit integers. Note that com-
pression of the remote column indices as shown in Fig. 3 is inevitable in this case.
Considering the minimum amount of data transfers for the SpMV operation, using
32-bit instead of 64-bit column indices for the sparse matrix results in a reduction
of data transfers between 16 % (complex double precision data) and 33 % (single
precision data).
It is also possible to incorporate matrix-free methods into GHOST. The SpMV
routine is stored as a function pointer in the ghost sparsemat. A user can replace
this function pointer by a custom function that performs the SpMV in any (possibly
matrix-free) way, while GHOST handles other kernels and communication of vector
data.
5.2 Block vectors
The architectural performance bottleneck for sparse linear algebra computations is
the main memory bandwidth for a wide range of algorithms. Hence, reducing the
movement of data through the memory interface often improves performance. One
well-known way to achieve this is to process multiple vectors at once in a SpMV
operation if allowed by the algorithm. Classic block algorithms are, e.g., the block
Conjugate Gradient (CG) method proposed by O’Leary et al. [33] and the block GM-
RES method introduced by Vital [51]. The continued relevance of this optimization
technique is seen in recent publications, e.g., by Ro¨hrig-Zo¨llner et al. [41] in which
the authors present a block Jacobi-Davidson method. Vector blocking is also very
relevant in the field of eigenvalue solvers for many inner eigenpairs. For example,
the FEAST algorithm [39] and Chebyshev filter diagonalization [44] profit from us-
ing block vector operations. Basic work on potential performance benefits from us-
ing block vectors has been conducted by Gropp et al. [17], where a performance
model for the Sparse Matrix Multiple Vector Multiplication algorithm (SpMMV) has
been established. Support for block vectors (which are also represented by objects
of ghost densemat) has been implemented for many mathematical operations in
GHOST.
Generally speaking, block vectors resemble tall and skinny dense matrices, i.e.,
matrices with many rows and few columns. Although they are represented by gen-
eral dense matrices, it has turned out that existing BLAS implementations tend to
deliver poor performance in numerical kernels using tall and skinny dense matrices.
This is the reason why selected tall and skinny matrix kernels have been implemented
directly in GHOST. Vectorized and fully unrolled versions of those kernels are au-
tomatically generated at compile time for some predefined small dimensions. See
Section 5.4 for details on code generation and its impact on performance.
Let V (n×m) and W (n× k) be tall and skinny dense matrices where m,k ≪
n. They are distributed in a row-wise manner among the processes, similar to the
system matrix in Fig. 3. X should be an m× k matrix which is redundantly stored
on each process. Three functions using tall and skinny dense matrices are currently
implemented in GHOST:
GHOST: High Performance Sparse Linear Algebra on Heterogeneous Systems 21
(a) W ←V X (b) X ←V T W
Fig. 7: Speedup of custom tall and skinny matrix kernels over
Intel MKL on a single CPU socket. V is n×m, W is n×k and
X is m× k, where m,k ≪ n.
1 2 3 4 5 6 7 8
Block vector width
0
10
20
30
40
50
60
Pe
rfo
rm
an
ce
 (G
flo
p/s
)
Row-major
Column-major
Fig. 8: SpMMV
performance of
row- and col-major
block vectors.
– X ← αV TW +β X
Tall skinny matrix transposed times tall skinny matrix (corresponds to inner prod-
uct of block vectors): ghost tsmttsm()
– W ← αVX +βW
Tall skinny matrix times small matrix: ghost tsmm()
– V ← αVX +βV
In-place version of ghost tsmm(): ghost tsmm inplace()
One may assume that a mature library like the Intel MKL yields optimal per-
formance for a widely-used kernel like general dense matrix matrix multiplication
(GEMM) for matrices of any shape. However, this is not the case as we demonstrate
in Fig. 7. Similar observations concerning performance drawback of MKL in the
context of tall and skinny dense matrices have been made by Anderson et al. [1]. The
GEMM kernel with (not too small) square matrices can reach a modern processor’s
peak floating point performance if properly optimized. The architectural performance
bottleneck is the CPU’s compute capability in this case. However, this does not hold
for tall and skinny matrices. The possibilities of blocking are limited due to the small
matrix dimensions in this case. This results in a GEMM kernel which should ideally
be memory-bound (as long as the dimension n is sufficiently large). In Fig. 7 it can be
observed that the GHOST versions of both kernels are at least on par with the MKL
performance for relatively small dimensions. The potential speedup can be up to 30×
for some matrix sizes. Note that if the general function ghost gemm() is called,
it first checks whether a suitable specialized function is applicable before calling the
BLAS library.
It is a known concern in extreme-scale computing that reduction operations are
susceptible to truncation errors. In GHOST, the computation of the inner product of
two tall skinny matrices (ghost tsmttsm()) is one of the kernels where problems
of this kind may occur for very large n. This motivated the addition of a variant of
this kernel which uses Kahan summation [22]. Depending on the width of the block
vectors m and k, and hence the computational intensity of the inner product, the over-
head from the additional floating point operations is small or negligible compared to
standard summation [19]. However, the improvement in accuracy may be significant
which could result in a lower iteration count for some iterative algorithms and such
22 Moritz Kreutzer et al.
a smaller time to solution. This has been demonstrated, e.g., by Mizukami [30] for
CG-like methods.
In order to achieve a transparent support of block vectors in GHOST we imple-
mented several BLAS level 1 routines and equipped them with block vector support.
Currently, the list of block vector operations includes axpy, axpby, scal, and
dot. Each of those operations works vector-wise. In addition, versions of axpy,
axpby, and scal with varying scalar factors for each vector in the block have been
added: vaxpy, vaxpby, vscal. Obviously, as block vectors are also represented
as dense matrices, all of those operations could be realized using existing BLAS level
3 routines. For example, the vscal kernel could be replaced by a multiplication with
a diagonal matrix containing the scaling factors on its diagonal. However, this would
come at additional cost by transferring zeros, which we want to avoid. Additionally,
the concerns about the efficiency of BLAS level 3 operations for tall and skinny dense
matrices apply here as well.
Figure 8 shows benchmark results for the SpMMV kernel, comparing row-major
and column-major storage of block vectors with increasing width. Storing the block
vector in row-major corresponds to interleaved storage. As expected, the performance
for row-major storage surpasses the performance of column-major storage due to a
better data access pattern. This is well known, and both vendor-supplied sparse linear
algebra libraries (Intel MKL and NVIDIA cuSparse) support row-major block vectors
in their SpMMV kernel.
5.3 Kernel fusion
Many sparse iterative solvers consist of a central SpMV routine accompanied by sev-
eral BLAS level 1/2 functions. It is thus useful to augment the SpMV with more oper-
ations according to our needs. The general sparse matrix vector multiplication func-
tion y = α(A− γI)x+β y encompasses many of the practical use cases. In GHOST
this operation can be chained with the computation of the dot products of 〈y,y〉, 〈x,y〉,
and 〈x,x〉 as well as the BLAS level 1 operation z = δ z+ηy. This approach is similar
to the well-known optimization technique of kernel fusion. Similar thoughts led to the
addition of the so-called BLAS 1.5/2.5 operators AXPY DOT, GE SUM MV, GEMVT,
TRMVT, and GEMVER to the updated BLAS standard [6]. Siek et al. [46] observed the
application specificity of these BLAS x.5 operators and made an attempt towards a
domain-specific compiler to generate arbitrarily fused kernels consisting of different
BLAS calls. This work has been continued by Nelson et al. [32], who plan to adapt
their framework towards sparse matrices in future work. Recently, the idea of kernel
fusion has gained new attention in the GPU programming community ([43,49,52]).
The in- and output vectors of the augmented SpMMV kernel are of the type gen-
eral ghost densemat. Hence, they may also be (views of) block vectors. The
values of α , β , γ , δ and η and the storage location of computed dot products are
passed to the function via a ghost spmv opts structure, which results in a sin-
gle interface function for any kind of (augmented) sparse matrix (multiple) vector
multiplication. Note that each augmentation on top of the standard SpM(M)V can
GHOST: High Performance Sparse Linear Algebra on Heterogeneous Systems 23
be enabled separately. In the following we show a small example of how to use this
function.
/* default options for SpMV operation */
ghost_spmv_opts opts = GHOST_SPMV_OPTS_INITIALIZER;
/* compute y = Ax where y and x may be block vectors */
ghost_spmv(y,A,x,opts);
/* compute y = (A− γI)x with different γ for each vector in the block */
double shift[nvecs] = { /* init */ }; /* shift for each vector in the block */
opts.gamma = shift;
opts.flags |= GHOST_SPMV_VSHIFT;
ghost_spmv(y,A,x,opts);
/* compute y = (A− γI)x−2y with different γ for each vector, chained with xH y */
double dot[3*nvecs], beta = -2.0;
opts.beta = &beta;
opts.dot = dot;
opts.flags |= GHOST_SPMV_AXPBY|GHOST_SPMV_DOT_XY;
ghost_spmv(y,A,x,opts);
Both, the use of block vectors and kernel fusion, have large potential regarding perfor-
mance, depending on the algorithm. For example, for the Kernel Polynomial Method,
a method for computing the eigenvalue density of quantum systems as analyzed in
[24], a 2.5-fold performance gain for the overall solver could be achieved by us-
ing block vectors and augmenting the SpMV. Fused kernels are likely to be more
cumbersome from an implementation point of view than fine-grained kernels. For
instance, fusing the SpMMV operation with block vector dot products on a GPU
leads to complex data access patterns which make an efficient implementation hard
to achieve (see [24] for details). Due to the potentially high complexity of fused ker-
nels and fundamental architectural differences, hand-optimized implementations for
each target architecture can hardly be avoided if the focus is on high efficiency in
heterogeneous settings.
A significant design decision for scientific computing libraries is whether and
how to use task and data parallelism. A task-parallel approach for work distribution
between heterogeneous devices, as implemented in MAGMA [27], may conflict with
the presented optimization technique of kernel fusion so that some optimization po-
tential is left unused. In cases where the potential benefits of task parallelism are
limited, such as the sparse matrix algorithms targeted by GHOST, data parallelism
with kernel fusion may thus be favored over task parallelism.
5.4 Low-level implementation and code generation
GHOST is implemented with the goal of efficient execution from a single core to the
petaflop level. Modern CPUs feature SIMD units which cause code vectorization to
be a crucial ingredient for efficient core-level code. For kernels with sufficient sim-
plicity, automatic vectorization is likely to be done by the compiler. If this is not the
case, GHOST addresses this issue by using compiler pragmas, or SSE, AVX or MIC
compiler intrinsics for explicit vectorization. Benchmarks on one CPU showing the
impact of vectorization on SpMV performance can be seen in Fig. 9. Here, we used
24 Moritz Kreutzer et al.
1 2 3 4 5 6 7 8 9 10
Number of cores
0
4
8
12
16
20
Pe
rfo
rm
an
ce
 (G
flo
p/s
)
SELL-4-128 (AVX intrinsics)
SELL-4-128 (plain C)
CRS (plain C)
Fig. 9: Intra-socket performance on
a single CPU showing the impact of
vectorization on SpMV performance
for different storage formats.
1 2 3 4 5 6 7 8
Block vector width
0
10
20
30
40
50
60
Pe
rfo
rm
an
ce
 (G
flo
p/s
)
Hard-coded block vector width
Arbitrary block vector width
Fig. 10: The impact of hard-coded
loop length on the SpMMV perfor-
mance with increasing block vector
width on a single CPU.
the Sinclair/3Dspectralwave5 matrix (n = 680,943, nnz = 30,290,827) in
complex double precision. A first observation is that all three variants reach the same
maximum performance when using the full socket. Due to the the bandwidth-bound
nature of the SpMV kernel this limit corresponds to the CPU’s maximum memory
bandwidth. However, the faster saturation of the explicitly vectorized SELL kernel
allows to use less cores to reach the same performance. The spare cores can be used
for working on independent tasks (cf. Section 4.2) or they can be switched off in
order to save energy. Hence, good vectorization should always be a goal, even for
bandwidth-bound kernels. Note that this is especially true on accelerator hardware,
where the width of vector units is typically larger than on standard hardware (cf.
Table 1).
An obstacle towards efficient code often observed by application developers is
lacking performance of existing program libraries due to their inherent and indis-
pensable requirement of generality. Often, better performance could be achieved if
performance-critical components were tailored to the application. Obviously, this
goal is opposing the original goal of program libraries, namely general applicabil-
ity. An important feature of GHOST for achieving high performance is code gener-
ation. At compile time, the user can specify common dimensions of data structures
for which versions of highly-optimized kernels will be compiled. A prominent ex-
ample is the width of block vectors, i.e., the number of vectors in the block. This
number typically is rather small, potentially leading to overhead due to short loops in
numerical kernels.
The positive impact of hard-coded loop lengths on the performance of SpMMV
with increasing block vector width is demonstrated in Fig. 10. The hardware and
problem setting is the same as described above for Fig. 9, i.e., the performance for
5 http://www.cise.ufl.edu/research/sparse/matrices/Sinclair/3Dspectralwave.html
GHOST: High Performance Sparse Linear Algebra on Heterogeneous Systems 25
one vector is the same as the saturated performance of Fig. 9. If we configure the
block vector widths 1, . . . ,8 at compile time a significant performance benefit can be
achieved compared to a variant where none of them is configured. This is due to more
optimization possibilities for the compiler due to simpler code and a lower impact of
loop overheads. Note that for both variants the SELL chunk height C was configured
at compile time.
Code generation in GHOST serves two purposes. First, marked code lines can
be duplicated with defined variations. Second, it is possible to generate variations
of functions, similar to C++ templates. Assuming that we need a fully unrolled and
AVX-vectorized kernel for scaling a dense matrix with a scalar value. The template
code using GHOST code generation markers reads as follows:
#GHOST SUBST NVECS $BLOCKDIM1
void scale_unroll_avx_NVECS(ghost_densemat *a, double scal) {
double *val = (double *)a->val;
__m256d s = _mm256_set1_pd(scal);
#GHOST UNROLL#__m256d tmp@ = _mm256_setzero_pd();#(NVECS+3)/4
for (ghost_lidx row=0; row < a->traits.nrows; row++) {
#GHOST UNROLL#tmp@ = _mm256_load_pd(&val[row*a->stride+@*4]);#(NVECS+3)/4
#GHOST UNROLL#tmp@ = _mm256_mul_pd(tmp@,s);#(NVECS+3)/4
#GHOST UNROLL#_mm256_store_pd(&val[row*a->stride+@*4],tmp@);#(NVECS+3)/4
}
}
For simplicity, we assume appropriately padded and aligned input data and we omit
thread parallelization in this example. Assuming that block vector widths 4 and 8 are
specified in the build system, the following two files are generated:
void scale_unroll_avx_4(ghost_densemat *a, double scal) {
double *val = (double *)a->val;
__m256d s = _mm256_set1_pd(scal);
__m256d tmp0 = _mm256_setzero_pd();
for (ghost_lidx row=0; row < a->traits.nrows; row++) {
tmp0 = _mm256_load_pd(&val[row*a->stride+0*4]);
tmp0 = _mm256_mul_pd(tmp0,s);
_mm256_store_pd(&val[row*a->stride+0*4],tmp0);
}
}
void scale_unroll_avx_8(ghost_densemat *a, double scal) {
double *val = (double *)a->val;
__m256d s = _mm256_set1_pd(scal);
__m256d tmp0 = _mm256_setzero_pd();
__m256d tmp1 = _mm256_setzero_pd();
for (ghost_lidx row=0; row < in->traits.nrows; row++) {
tmp0 = _mm256_load_pd(&val[row*a->stride+0*4]);
tmp1 = _mm256_load_pd(&val[row*a->stride+1*4]);
tmp0 = _mm256_mul_pd(tmp0,s);
tmp1 = _mm256_mul_pd(tmp1,s);
_mm256_store_pd(&val[row*a->stride+0*4],tmp0);
_mm256_store_pd(&val[row*a->stride+1*4],tmp1);
}
}
Note that in this example the factor for code duplication (NVECS+3)/4 depends
on the function variant. This disqualifies the straightforward use of C++ templates
for function variation, as values of the template parameters would have to be known
before compilable (i.e., with evaluated GHOST UNROLL statements) code is present.
26 Moritz Kreutzer et al.
While this might still be possible with more sophisticated C++ techniques, it cannot
be denied that the possibility to look at the high-level intermediate representation as
shown in the example above is an advantage of our approach over C++ metaprogram-
ming. Note that it is not always possible to replace the generation of code line variants
by loops, e.g., for the declaration of variables.
Fallback implementations exist for all compute kernels. This guarantees general
applicability of GHOST functions. The degree of specialization gets diminished until
a suitable implementation is found, which probably implies a performance loss. For
example, if a kernel is not implemented using vectorization intrinsics and hard-coded
small loop dimensions, a version with one arbitrary loop dimension is searched. If
none of the small loop dimensions is available in an explicitly vectorized kernel,
GHOST checks for the existence of a high-level language implementation with hard-
coded small loop dimensions, and so on. In case no specialized version has been built
into GHOST the library will select the highly-general fallback version.
6 Using GHOST in existing iterative solver packages
In this section we want to briefly discuss how GHOST can be used with existing
sparse solver libraries. A characteristic feature of typical iterative solvers for sparse
linear systems or eigenvalue problems is that they require only the application of the
matrix to a given vector. It is therefore good practice to separate the implementation
of such methods from the data structure and details of the SpMV.
One approach that originated in the days of Fortran 77 is the ‘Reverse Commu-
nication Interface’ (RCI). The control flow passes back and forth between the solver
routine and the calling program unit, which receives instructions about which opera-
tions are to be performed on which data in memory. While this programming model
is still widely used in, e.g., ARPACK [26] and even in modern libraries such as Intel
MKL [20], it is awkward and error-prone by today’s standards. Another idea is to
use callback functions for selected operations. For example, the eigensolver package
PRIMME [47] requires the user to provide the SpMMVM and a reduction operation
on given data.
Neither RCI nor simple callbacks can make optimal use of GHOST. Obviously
such software could only make use of accelerators by means of offloading inside a
function scope. If no special attention is paid to data placement, this is typically inef-
ficient due to the slow PCI express bus between CPU and device. Even on the CPU,
GHOST preferably would control memory allocation itself to achieve alignment and
NUMA-aware placement of data. Another drawback is the restriction to data struc-
tures as prescribed by such solvers. For instance, the required storage order of block
vectors is typically column-major, which may be also inefficient (cf. Section 5.2).
The Trilinos package Anasazi [4] takes a different approach. It requires the user
to implement what we call a ‘kernel interface’, an extended set of callback functions
that are the only way the solver can work with matrices and vectors. New (block)
vectors are created by cloning an existing one via the kernel interface. Thus, memory
allocation stays on the user side and can be done, e.g., on a GPU, with a custom data
layout, or applying further optimizations.
GHOST: High Performance Sparse Linear Algebra on Heterogeneous Systems 27
In the iterative solver package PHIST [36] we use a similar kernel interface which
is written in plain C. It does not require a very general vector view concept and has
some functionality for executing (parts of) kernels asynchronously as GHOST tasks.
PHIST also provides GHOST adapters for Anasazi and the linear solver library Belos
from Trilinos, with the restriction that views of permuted columns of a block vector
do not work with row-major storage. This is not a grave restriction as the feature is
– to our knowledge – hardly used in the packages. For a performance study of the
block Jacobi-Davidson method implemented in PHIST (using GHOST), see [41].
6.1 Case study: An eigensolver with Trilinos and GHOST
We have demonstrated the applicability and performance of GHOST in a number of
publications. In [41], we have presented and implemented a block Jacobi-Davidson
method using PHIST & GHOST on up to 512 dual-socket CPU nodes. A fully hetero-
geneous GHOST implementation of the kernel polynomial method which we scaled
up to 1024 CPU+GPU nodes has been demonstrated in [24]. In the meantime, we
have continued our scaling studies of this application to 4096 heterogeneous nodes.
Recent work includes the implementation of Chebyshev filter diagonalization, for
which we show performance data on up to 512 dual-socket CPU nodes in [38].
While all of the presented work has been conducted within activities closely re-
lated to the GHOST project, we see that it is of special interest for a broader po-
tential user base how GHOST could integrate in existing CS&E software stacks. In
the following we want to demonstrate the applicability and performance of GHOST
using the Krylov-Schur method [48] for finding a few eigenvalues of large sparse
matrices. An implementation of this method is available in the Anasazi package [4]
of Trilinos. As mentioned in the previous section, PHIST can serve as an interface
layer between algorithmic packages like Anasazi and kernel libraries like GHOST
or Tpetra (+Kokkos). Developers can thus work at a high level of abstraction and
have the option to switch between kernel implementations. For this study, we are
using version 11.12.1 of Trilinos and an MPI+X approach with OpenMP paralleliza-
tion on the socket level. The test case is the non-symmetric MATPDE6 problem. It
represents a five-point central finite difference discretization of a two-dimensional
variable-coefficient linear elliptic equation on an n× n grid with Dirichlet boundary
conditions. The ten eigenvalues with largest real part are sought using a search space
of twenty vectors. The convergence criterion is a residual tolerance of 10−6. We set
the random number seed in GHOST in a way which guarantees consistent iteration
counts between successive runs.
GHOST integrates well with Anasazi and is straightforward to use on this level.
Moreover we show in Fig. 11 that GHOST surpasses Tpetra both in terms of per-
formance and scalability. On a single node one can save about 16% of runtime for
the entire solver. Figure 11a reveals a higher parallel efficiency of GHOST. Con-
sequently, the better node-level performance gets amplified on larger node counts,
resulting in a 42% runtime saving on 64 nodes. For weak scaling, similar conclu-
sions can be drawn from Fig. 11b. At the largest node count, the parallel efficiency of
6 http://math.nist.gov/MatrixMarket/data/NEP/matpde/matpde.html
28 Moritz Kreutzer et al.
1 4 16 64
Number of nodes
0
50
100
150
200
Ti
m
e 
to
 so
lu
tio
n 
(s)
GHOST time
Tpetra time
0
25
50
75
100
Pa
ra
lle
l e
ffi
ci
en
cy
 (%
)
GHOST efficiency
Tpetra efficiency
857
857
923
1077
857
802
868 835
(a) Strong scaling runtime (left axis) and paral-
lel efficiency (right axis) for n = 212.
1 4 16 64
Number of nodes
0
20
40
60
80
100
Pa
ra
lle
l e
ffi
ci
en
cy
 (%
)
0
200
400
600
800
1000
Ti
m
e 
to
 so
lu
tio
n 
(s)
857
857 1341
1198
1616
1957
2771
2771
(b) Weak scaling runtime (left axis) and parallel
efficiency (right axis) for n = 212,...,15.
Fig. 11: Scaling behavior of GHOST and Tpetra on up to 64 dual-socket CPU nodes
for Anasazi’s implementation of the Krylov-Schur method. The annotations show the
number of iterations until convergence. The computed parallel efficiencies consider
changed iteration counts.
GHOST is ten percentage points above Tpetra. Relevant GHOST features used in the
presented runs are resource management (thread pinning), SpMV with SELL-C-σ
and auto-generated kernels for tall & skinny dense matrix multiplications. Note that
even higher performance could possibly be obtained by exploiting advanced algorith-
mic optimizations available with GHOST, such as kernel fusion, block operations and
communication hiding. However, those would potentially require a re-formulation of
the algorithm which is not what we wanted to demonstrate here.
7 Conclusion and outlook
7.1 Conclusions
GHOST is a novel and promising attempt towards highly scalable heterogeneous
sparse linear algebra software. It should not be considered a comprehensive library
but rather a toolbox featuring approaches to the solution of several problems which
we have identified as relevant on modern hardware in the context of sparse solvers.
A crucial component of highly efficient software, especially in the complex envi-
ronment of heterogeneous systems, is sensible resource management. Our flexible,
transparent, process-based and data-parallel approach for heterogeneous execution is
accompanied by a lightweight and affinity-aware tasking mechanism, which reflects
the requirements posed by modern algorithms and hardware architectures. During
the ongoing development, we have observed that high performance is the result of a
mixture of ingredients. First, algorithmic choices and optimizations have to be made
considering the relevant hardware bottlenecks. In the context of sparse solvers, where
minimizing data movement is often the key to higher efficiency, this includes, e.g.,
vector blocking and kernel fusion. Second, while implementing those algorithms, it
GHOST: High Performance Sparse Linear Algebra on Heterogeneous Systems 29
is crucial to have an idea of upper performance bounds. This can be accomplished
by means of performance models, which form a substantial element of our devel-
opment process. This is demonstrated in [23] and [24]. An optimal implementation
may come at the cost of fundamental changes to data structures, e.g., storing dense
matrices as row- instead of column-major or changing the sparse matrix storage for-
mat from CRS to SELL-C-σ . During the ongoing development it has turned out that
often the generality of the interface has to be traded in for high performance. There
are several ways to relax this well-known dilemma. Very promising is, e.g., a close
collaboration between library and application developers with the possibility for the
latter to feed their application-specific knowledge into the library. In GHOST this
idea is implemented by automatic code generation.
7.2 Outlook
In its current state, GHOST has no provision for exploiting matrix symmetry. Obvi-
ously, there is large potential for increased of performance if symmetric (or Hermi-
tian) matrices were treated as such. The implementation is challenging, but cannot
be avoided in the long run. Bringing sparse solvers to a very large scale is often lim-
ited by malicious sparse matrix patterns which lead to communication dominating
the runtime. This can be ameliorated by bandwidth reduction of the sparse matrix. A
goal for further development is the evaluation and implementation of additional band-
width reduction algorithms like, e.g., hypergraph partitioning [12]. Furthermore, the
optimization of heterogeneous MPI communication, e.g., using GPUdirect which by-
passes the host memory in GPU-GPU communication, should be investigated in or-
der to improve the communication performance. Future architectural developments,
like deeper memory hierarchies and a tighter integration of “standard” and “accel-
erated” resources require rethinking existing performance models and possibly new
implementations. Currently, the heterogeneous work distribution weights have to be
specified by the user, mostly based on knowledge about the involved hardware archi-
tectures and their capabilities. In future work, micro-benchmarks will be integrated
into GHOST that allow automatic determination of device-specific work weights. On
top of that, another important goal for future development is dynamic and automatic
load balancing during an iterative solver’s runtime. Currently, the sparse matrix por-
tion for each process is fixed during the entire runtime. By using the SELL-C-σ stor-
age format, it will be straightforward to communicate matrix data at runtime between
heterogeneous devices to overcome load imbalances.
Acknowledgements This work was supported by the German Research Foundation (DFG) through the
Priority Program 1648 “Software for Exascale Computing” (SPPEXA) under project ESSEX (“Equipping
Sparse Solvers for Exascale”). We would like to thank Intel Germany and Nvidia for providing test sys-
tems for benchmarking. Special thanks go to Andreas Alvermann for providing sparse matrix generation
functions for testing and everyone else who contributed to GHOST, directly or indirectly.
30 Moritz Kreutzer et al.
References
1. Anderson, M., Ballard, G., Demmel, J., Keutzer, K.: Communication-avoiding QR decomposition for
GPUs. In: Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pp. 48–58
(2011). DOI:10.1109/IPDPS.2011.15
2. Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task schedul-
ing on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experi-
ence 23(2), 187–198 (2011). DOI:10.1002/cpe.1631
3. Baker, C.G., Heroux, M.A.: Tpetra, and the use of generic programming in scientific computing. Sci.
Program. 20(2), 115–128 (2012). DOI:10.1155/2012/693861
4. Baker, C.G., Hetmaniuk, U.L., Lehoucq, R.B., Thornquist, H.K.: Anasazi software for the numerical
solution of large-scale eigenvalue problems. ACM Trans. Math. Softw. 36(3), 13:1–13:23 (2009).
DOI:10.1145/1527286.1527287
5. Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Eijkhout,
V., Gropp, W.D., Kaushik, D., Knepley, M.G., McInnes, L.C., Rupp, K., Smith, B.F., Zampini, S.,
Zhang, H.: PETSc Web page (2015). URL http://www.mcs.anl.gov/petsc
6. Blackford, L.S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman,
L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., Whaley, R.C.: An updated set of basic linear
algebra subprograms (BLAS). ACM Transactions on Mathematical Software 28, 135–151 (2001).
DOI:10.1145/567806.567807
7. Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S.,
Namyst, R.: Hwloc: A generic framework for managing hardware affinities in HPC applications.
In: Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-
based Processing, PDP ’10, pp. 180–186. IEEE Computer Society, Washington, DC, USA (2010).
DOI:10.1109/PDP.2010.67
8. Chevalier, C., Pellegrini, F.: PT-Scotch: A tool for efficient parallel graph ordering. Parallel Comput.
34(6-8), 318–331 (2008). DOI:10.1016/j.parco.2007.12.001
9. Chow, E., Patel, A.: Fine-grained parallel incomplete factorization. SIAM J. Sci. Comp. 37(2), 169–
193 (2015). DOI:10.1137/140968896
10. Demmel, J., Hoemmen, M., Mohiyuddin, M., Yelick, K.: Avoiding communication in sparse matrix
computations. In: Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Sym-
posium on, pp. 1–12 (2008). DOI:10.1109/IPDPS.2008.4536305
11. Denis, A.: POSTER: a generic framework for asynchronous progression and multithreaded commu-
nications. In: Cluster Computing (CLUSTER), 2014 IEEE International Conference on, pp. 276–277
(2014). DOI:10.1109/CLUSTER.2014.6968752
12. Devine, K., Boman, E., Heaphy, R., Bisseling, R., Catalyurek, U.: Parallel hypergraph partitioning for
scientific computing. In: Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th
International, pp. 10 pp.– (2006). DOI:10.1109/IPDPS.2006.1639359
13. Galgon, M., Kra¨mer, L., Thies, J., Basermann, A., Lang, B.: On the parallel iterative solution of linear
systems arising in the FEAST algorithm for computing inner eigenvalues. Parallel Computing 49,
153–163 (2015). DOI:10.1016/j.parco.2015.06.005
14. Gebremedhin, A.H., Nguyen, D., Patwary, M.M.A., Pothen, A.: Colpack: Software for graph coloring
and related problems in scientific computing. ACM Trans. Math. Softw. 40(1), 1:1–1:31 (2013).
DOI:10.1145/2513109.2513110
15. GHOST: General, Hybrid, and Optimized Sparse Toolkit. URL http://tiny.cc/GHOST. Ac-
cessed: February 2016
16. Ghysels, P., Ashby, T.J., Meerbergen, K., Vanroose, W.: Hiding global communication latency in the
GMRES algorithm on massively parallel machines. SIAM J. Sci. Comp. 35(1), C48–C71 (2013).
DOI:10.1137/12086563X
17. Gropp, W.D., Kaushik, D.K., Keyes, D.E., Smith, B.F.: Towards realistic performance bounds for
implicit CFD codes. In: Proceedings of Parallel CFD99, pp. 233–240. Elsevier (1999)
18. Hager, G., Wellein, G.: Introduction to High Performance Computing for Scientists and Engineers,
1st edn. CRC Press, Inc., Boca Raton, FL, USA (2010)
19. Hofmann, J., Fey, D., Eitzinger, J., Hager, G., Wellein, G.: Performance analy-
sis of the Kahan-enhanced scalar product on current multicore processors URL
http://arxiv.org/abs/1505.02586. Accepted for PPAM 2015, the 11th Interna-
tional Conference on Parallel Processing and Applied Mathematics, September 3-6, 2015, Krakow,
Poland
GHOST: High Performance Sparse Linear Algebra on Heterogeneous Systems 31
20. Intel Math Kernel Library. URL https://software.intel.com/en-us/intel-mkl. Ac-
cessed: February 2016
21. Kaczmarz, S.: Angena¨herte Auflo¨sung von Systemen linearer Gleichungen. Bulletin International de
l’Acade´mie Polonaise des Sciences et des Lettres 35, 355–357 (1937)
22. Kahan, W.: Pracniques: Further remarks on reducing truncation errors. Commun. ACM 8(1), 40
(1965). DOI:10.1145/363707.363723
23. Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format
for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units.
SIAM J. Sci. Comput. 36(5), C401–C423 (2014). DOI:10.1137/130930352
24. Kreutzer, M., Pieper, A., Hager, G., Wellein, G., Alvermann, A., Fehske, H.: Performance
engineering of the kernel polynomal method on large-scale cpu-gpu systems. In: Parallel
and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pp. 417–426 (2015).
DOI:10.1109/IPDPS.2015.76
25. LAMA: Library for accelerated mathematical applications. URL http://www.libama.org.
Accessed: February 2016
26. Lehoucq, R., Sorensen, D., Yang, C.: ARPACK Users’ Guide. Society for Industrial and Applied
Mathematics (1998). DOI:10.1137/1.9780898719628
27. MAGMA: Matrix algebra on GPU and multicore architectures. URL
http://icl.cs.utk.edu/magma/. Accessed: February 2016
28. Matrix Market Exchange Format. URL http://math.nist.gov/MatrixMarket/formats.html#MMformat.
Accessed: February 2016
29. McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers.
IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–
25 (1995)
30. Mizukami, E.: The accuracy of floating point summations for CG-like methods. Technical Report 486
(1997). URL ftp://ftp.cs.indiana.edu/pub/techreports/TR486.pdf
31. Monakov, A., Lokhmotov, A., Avetisyan, A.: Automatically tuning sparse matrix-vector multiplica-
tion for GPU architectures. In: Y. Patt, P. Foglia, E. Duesterwald, P. Faraboschi, X. Martorell (eds.)
High Performance Embedded Architectures and Compilers, Lecture Notes in Computer Science, vol.
5952, pp. 111–125. Springer Berlin Heidelberg (2010). DOI:10.1007/978-3-642-11515-8 10
32. Nelson, T., Belter, G., Siek, J.G., Jessup, E., Norris, B.: Reliable generation of high-performance
matrix algebra. ACM Trans. Math. Softw. 41(3), 18:1–18:27 (2015). DOI:10.1145/2629698
33. O’Leary, D.P.: The block conjugate gradient algorithm and related methods. Linear Algebra and
its Applications 29(0), 293 – 322 (1980). DOI:10.1016/0024-3795(80)90247-5. Special Volume
Dedicated to Alson S. Householder
34. Oppe, T.C., Kincaid, D.R.: The performance of ITPACK on vector computers for solving large sparse
linear systems arising in sample oil reseervoir simulation problems. Communications in Applied
Numerical Methods 3(1), 23–29 (1987). DOI:10.1002/cnm.1630030106
35. PARALUTION. URL http://www.paralution.com. Accessed: February 2016
36. PHIST: Pipelined Hybrid-parallel Iterative Solver Toolkit. URL
https://bitbucket.org/essex/phist. Accessed: February 2016
37. Pieper, A., Heinisch, R.L., Wellein, G., Fehske, H.: Dot-bound and dispersive states in graphene quan-
tum dot superlattices. Phys. Rev. B 89, 165,121 (2014). DOI:10.1103/PhysRevB.89.165121
38. Pieper, A., Kreutzer, M., Galgon, M., Alvermann, A., Fehske, H., Hager, G., Lang, B., Wellein, G.:
High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue compu-
tations (2015). URL http://arxiv.org/abs/1510.05895. Preprint
39. Polizzi, E.: Density-matrix-based algorithm for solving eigenvalue problems. Phys. Rev. B 79,
115,112 (2009). DOI:10.1103/PhysRevB.79.115112
40. Rabenseifner, R., Hager, G., Jost, G.: Hybrid mpi/openmp parallel programming on clusters of multi-
core smp nodes. In: Parallel, Distributed and Network-based Processing, 2009 17th Euromicro Inter-
national Conference on, pp. 427–436 (2009). DOI:10.1109/PDP.2009.43
41. Ro¨hrig-Zo¨llner, M., Thies, J., Kreutzer, M., Alvermann, A., Pieper, A., Basermann, A., Hager, G.,
Wellein, G., Fehske, H.: Increasing the performance of the Jacobi–Davidson method by blocking.
SIAM Journal on Scientific Computing 37(6), C697–C722 (2015). DOI:10.1137/140976017
42. Rupp, K., Rudolf, F., Weinbub, J.: ViennaCL - A High Level Linear Algebra Library for GPUs and
Multi-Core CPUs. In: Intl. Workshop on GPUs and Scientific Applications, pp. 51–56 (2010)
43. Rupp, K., Weinbub, J., Ju¨ngel, A., Grasser, T.: Pipelined iterative solvers with kernel fusion for graph-
ics processing units (2014). URL http://arxiv.org/abs/1410.4054. Preprint
32 Moritz Kreutzer et al.
44. Schofield, G., Chelikowsky, J.R., Saad, Y.: A spectrum slicing method for the Kohn-Sham problem.
Computer Physics Communications 183(3), 497–505 (2012). DOI:10.1016/j.cpc.2011.11.005
45. Schubert, G., Fehske, H., Fritz, L., Vojta, M.: Fate of topological-insulator surface states under strong
disorder. Phys. Rev. B 85, 201,105 (2012). DOI:10.1103/PhysRevB.85.201105
46. Siek, J., Karlin, I., Jessup, E.: Build to order linear algebra kernels. In: Parallel and Dis-
tributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pp. 1–8 (2008).
DOI:10.1109/IPDPS.2008.4536183
47. Stathopoulos, A., McCombs, J.R.: PRIMME: preconditioned iterative multimethod eigensolver–
methods and software description. ACM Trans. Math. Softw. 37(2), 1–30 (2010).
DOI:10.1145/1731022.1731031
48. Stewart, G.W.: A Krylov–Schur algorithm for large eigenproblems. SIAM Journal on Matrix Analysis
and Applications 23(3), 601–614 (2002). DOI:10.1137/S0895479800371529
49. Tabik, S., Ortega, G., Garzn, E.: Performance evaluation of kernel fusion BLAS routines on the
GPU: iterative solvers as case study. The Journal of Supercomputing 70(2), 577–587 (2014).
DOI:10.1007/s11227-014-1102-4
50. TOP500 Supercomputer Sites as of November 2015. URL http://www.top500.org. Accessed:
February 2016
51. Vital, B.: Etude de quelques mthodes de rsolution de problmes linaires de grande taille sur multipro-
cessor. Ph.D. thesis, Universit de Rennes, Rennes (1990)
52. Wahib, M., Maruyama, N.: Scalable kernel fusion for memory-bound GPU applications. In: Pro-
ceedings of the International Conference for High Performance Computing, Networking, Storage and
Analysis, SC ’14, pp. 191–202. IEEE Press, Piscataway, NJ, USA (2014). DOI:10.1109/SC.2014.21
53. Williams, S., Waterman, A., Patterson, D.: Roofline: An insightful visual performance model for
multicore architectures. Commun. ACM 52(4), 65–76 (2009). DOI:10.1145/1498765.1498785
54. Wittmann, M., Hager, G., Zeiser, T., Wellein, G.: Asynchronous MPI for the masses (2013). URL
http://arxiv.org/abs/1302.4280. Preprint
