Semi-External Memory Sparse Matrix Multiplication for Billion-Node
  Graphs by Zheng, Da et al.
1Semi-External Memory Sparse Matrix
Multiplication for Billion-Node Graphs
Da Zheng1, Disa Mhembere1, Vince Lyzinski2, Joshua T. Vogelstein3, Carey E. Priebe2, and Randal Burns1
1Department of Computer Science, Johns Hopkins University
2Department of Applied Mathematics and Statistics, Johns Hopkins University
3Department of Biomedical Engineering, Johns Hopkins University
Abstract—Sparse matrix multiplication is traditionally performed in memory and scales to large matrices using the distributed memory
of multiple nodes. In contrast, we scale sparse matrix multiplication beyond memory capacity by implementing sparse matrix dense
matrix multiplication (SpMM) in a semi-external memory (SEM) fashion; i.e., we keep the sparse matrix on commodity SSDs and dense
matrices in memory. Our SEM-SpMM incorporates many in-memory optimizations for large power-law graphs. It outperforms the
in-memory implementations of Trilinos and Intel MKL and scales to billion-node graphs, far beyond the limitations of memory.
Furthermore, on a single large parallel machine, our SEM-SpMM operates as fast as the distributed implementations of Trilinos using
five times as much processing power. We also run our implementation in memory (IM-SpMM) to quantify the overhead of keeping data
on SSDs. SEM-SpMM achieves almost 100% performance of IM-SpMM on graphs when the dense matrix has more than four
columns; it achieves at least 65% performance of IM-SpMM on all inputs. We apply our SpMM to three important data analysis
tasks—PageRank, eigensolving, and non-negative matrix factorization—and show that our SEM implementations significantly advance
the state of the art.
Index Terms—Sparse matrix multiplication, semi-external memory, billion-node graphs, SSDs.
F
1 INTRODUCTION
Sparse matrix multiplication is an important computation
with a wide variety of applications in scientific computing,
machine learning, and graph analysis. For example, matrix
factorization algorithms on a sparse matrix, such as singular
value decomposition (SVD) [1] and non-negative matrix fac-
torization (NMF) [2], require sparse matrix multiplication.
Graph analysis algorithms such as PageRank [3] can be
formulated as sparse matrix multiplication or generalized
sparse matrix multiplication [4]. Some of the algorithms,
such as PageRank and SVD, require sparse matrix vector
multiplication. Others, such as NMF, require sparse matrix
dense matrix multiplication.
It is challenging to implement sparse matrix multipli-
cation efficiently. It has very low computation density and
its performance is limited by memory access. Sparse matrix
multiplication usually achieves only a small fraction of the
peak performance of a modern processor [5].
Sparse matrices derived from graphs impose additional
challenges. Many graphs have a power-law distribution in
vertex degree. This produces a power-law distribution in
the number of non-zero entries per column or row in the
associated matrix, which results in load imbalance in par-
allel decompositions of sparse matrix multiplication. Graph
matrices are much sparser and have more random distribu-
tion than traditional sparse matrices that arise in scientific
computing. As such, many optimizations for traditional
sparse matrices, such as register blocking and prefetching
[5] increase storage size, bring more data to CPU cache
unnecessarily and hurt performance. The commonly used
sparse matrix storage formats such as CSR (compressed
sparse row) and CSC (compressed sparse column) for graph
matrices leads to significant CPU cache misses in sparse
matrix multiplication because graphs are sparse and large.
We focus on sparse matrices derived from graphs be-
cause they are more difficult to compute efficiently than
general sparse matrices and because some of the largest ma-
trices and most important analysis tasks operate on graph
datasets, e.g. social networks and Web graphs. Facebook’s
social network has billions of vertices and the Web Data
Commons graph [6] has 3.6 billion vertices and 128 billions
edges. We perform sparse matrix multiplication for graph
analysis such as community detection with NMF and spec-
tral analysis with SVD.
Current research focuses on sparse matrix multiplication
in memory for small matrices and scaling to a large sparse
matrix in a large cluster, where the aggregate memory is
sufficient to store the sparse matrix [5], [7], [8]. Distributed
solutions for sparse matrix multiplication lead to signifi-
cant network communication and network bandwidth is
usually the bottleneck. As such, this operation requires
a fast network to achieve performance. A supercomputer
or a large cluster with a fast network is inaccessible or
too expensive for many users. In addition, the distributed
solution imposes challenges in achieving load balancing on
power-law graphs.
On the other hand, a current trend for hardware design
scales up a single machine, rather than scaling out to many
networked machines. These machines typically have mul-
ar
X
iv
:1
60
2.
02
86
4v
3 
 [c
s.D
C]
  1
4 O
ct 
20
16
2tiple processors with many CPU cores and a large amount
of memory. They are also equipped with fast flash memory
such as solid-state drives (SSDs) to further extend memory
capacity. This conforms to the node design for supercomput-
ers [9]. Recent finding, from Frank McSherry [10], [11] and
our prior work [12], show that the largest graph analytics
tasks can be done on a small fraction of the hardware, at
less cost, as fast, and using less energy on a single shared-
memory node, rather than a distributed compute engine.
Our findings in this paper reveal that sparse matrices have
the same structure and benefits and that the largest graph
matrices can be processed efficiently on a single scale-up
compute node.
We explore a solution that scales sparse matrix dense
matrix multiplication (SpMM) on a multi-core machine
with commodity SSDs and perform SpMM in semi-external
memory (SEM). The concept of semi-external memory arose
as a functional computing approach for graphs [13] in which
the vertex state of a graph is stored in memory and the edges
accessed from external memory. We introduce a similar
construct for SpMM in which one or more columns of a
dense matrix are kept in memory and the sparse matrix is
accessed from external memory. In semi-external memory,
we assume that the memory of a machine is sufficient to
keep at least one column of the input dense matrix but is
insufficient to hold the sparse matrix and the dense matrices.
Given fast SSDs, we demonstrate that the SEM solution uses
the resources of a multi-core machine well and achieves
performance that exceeds the state-of-the-art in-memory
implementations.
We overcome many technical challenges to construct a
sparse matrix multiplication implementation on SSDs to
achieve performance. Specifically, SSDs are an order of
magnitude slower in throughput and multiple orders of
magnitude slower in latency than DRAM. Furthermore,
sequential I/O of SSDs is still much faster than random I/O
[14] and reads are faster than writes. In addition, SSDs wear
out when we write data to them and random writes further
shorten the lives of SSDs [15]. As such, our solution needs to
sequentialize I/O access and reduce I/O, especially writes.
Semi-external memory provides a scalable and efficient
SpMM solution that meets the I/O challenges. We fur-
ther incorporate substantial computation optimizations to
achieve performance of SpMM on graphs in a NUMA (non-
uniform memory architecture) machine. During the compu-
tation, each thread streams its own partitions of the sparse
matrix from SSDs, maximizing I/O throughput and avoid-
ing thread synchronization. We buffer all intermediate com-
putation results in local memory, to reduce remote memory
access, and stream the output matrix to SSDs at most once,
minimizing writes to SSDs. We design a very compact
sparse matrix format to accelerate retrieving a sparse matrix
from SSDs. Semi-external memory has memory constraints.
As such, we deploy only computation optimizations that
require a small memory footprint, e.g., fine-grain dynamic
task scheduling to balance loads on power-law graphs and
cache blocking to increase CPU cache hits.
Given the sparsity and the enormous size of graphs,
the dense matrices involved in SpMM are massive and our
semi-external memory solution adapts to dense matrices
with different numbers of columns. When the dense matrix
is larger than memory, we split it vertically into multiple
partitions so that each partition can fit in memory. As such,
the minimum memory requirement of our solution is O(r),
where r is the number of rows in the input dense matrix.
By keeping more columns in the dense matrix in memory,
we reduce I/O from SSDs and SEM-SpMM becomes CPU
bound, instead of I/O bound, on fast SSDs.
We develop three important applications in scientific
computing and graph analysis with our SEM-SpMM:
PageRank [3], eigensolver [16] and non-negative matrix
factorization [2]. Each of them requires SpMM with different
numbers of columns in dense matrices, resulting in differ-
ent strategies of placing data in memory. With the three
applications, we demonstrate data placement choices for
different memory capacities in a machine and the impact
of the memory size on the performance of the applications.
The main contributions of this paper are:
• We scale SpMM in semi-external memory using asym-
metric I/O to SSDs and we deploy runtime optimiza-
tions that meet memory constraints.
• We deploy a compact sparse matrix format to reduce
the amount of I/O and alleviate I/O as the bottleneck
of the system.
• We deploy a fine-grain dynamic load balancing scheme
for sparse matrices with power-law distribution in non-
zero entries.
• We optimize SpMM for dense matrices with various
numbers of columns and extends SEM-SpMM to large
dense matrices that cannot fit in memory by vertically
partitioning the dense matrices.
Our result shows that for real-world sparse graphs, our
SEM-SpMM achieves almost 100% performance of our in-
memory implementation on a large parallel machine with
48 CPU cores when the dense matrix has more than four
columns. Even for SpMV (sparse-matrix vector multipli-
cation: the special case in which the dense matrix has a
single column), our SEM implementation achieves at least
65% performance of our in-memory implementation and
outperforms Trilinos [17] and MKL [18] by a factor of 2− 9.
We conclude that semi-external memory coupled with SSDs
delivers an efficient solution for large-scale sparse matrix
multiplication. It serves as a building block and offers new
design possibilities for large-scale data analysis, replacing
memory with larger, cheaper, more energy-efficient SSDs
and processing bigger problems on fewer machines. The
code for SpMM and SpMV has been released as open source
as part of the FlashX system at http://flashx.io.
2 RELATED WORK
Recent sparse matrix multiplication studies focus on in-
memory optimizations. Williams et al. [5] describe opti-
mizations for sparse matrix vector multiplication (SpMV)
in multicore architectures. Yoo et al. [7] and Boman et
al. [8] optimize distributed SpMV for large scale-free graphs
with 2D partitioning to reduce communication between
machines. In contrast, sparse matrix dense matrix multi-
plication (SpMM) receives less attention from the high-
performance computing community. Even though SpMM
can be implemented with SpMV, SpMV fails to explore
data locality in SpMM. Aktulga et al. [19] optimize SpMM
3with cache blocking. Koanantakool et al. [20] experiment
with different parallel algorithms for sparse matrix dense
matrix multiplication and analyze their communication cost
in distributed memory. We advance SpMM with a focus on
optimizations for semi-external memory.
Compressed row storage (CSR) and compressed column
storage (CSC) formats are commonly used sparse matrix for-
mats in many numeric libraries, such as Intel MKL [18] and
Trilinos [17]. However, these formats are not designed for
graphs. Sparse matrix multiplication with these formats on
graphs incurs many random memory accesses. Sparsity [21]
designs a format that encodes both register blocking and
cache blocking to increase data reuse in the CPU cache for
sparse matrix multiplication. Register blocking requires ex-
plicit storage of zero values in register blocks. This strategy
wastes space and computation for graphs because graphs
are sparse and a vertex in a graph can connect to arbitrary
other vertices. Buluc et al. [22] further advance sparse matrix
format by doubly compressed sparse column (DCSC) for
hypersparse submatrices after 2D partitioning on a sparse
matrix. This format significantly reduces the storage size of
a 2D-partitioned sparse matrix. Our format further reduces
the storage size of a sparse matrix.
Abello et al. [13] introduced the semi-external mem-
ory algorithmic framework for graphs. Pearce et al. [23]
implement several semi-external memory graph traversal
algorithms for SSDs. FlashGraph [12] adopted the concept
and performs graph algorithms with vertex state in memory
and edge lists on SSDs. This work extends the semi-external
memory concept to matrix operations.
GridGraph [24] is a general-purpose graph processing
framework on a single machine and scales to large graphs
using disks. It performs 2D partitioning on a graph to reduce
CPU cache misses in graph analysis and deploys 2-level
hierarchical partitioning to reduce I/O. This graph engine
is designed for graph algorithms expressed as sparse matrix
vector multiplication and keeps both graphs and vertex
computation state on disks. In contrast, our work focuses on
optimizations on sparse matrix dense matrix multiplication
and performs this operation in semi-external memory to
reduce I/O, especially writes, to SSDs. GridGraph runs
more slowly than in-memory linear algebra libraries such
as Intel MKL [18] and Trilinos Tpetra [17].
Array databases such as SciDB [25] and Rasdaman [26]
support sparse matrix mutliplication. They perform matrix
multiplication using CSR formats [27], similar to Intel MKL
and Trilinos. There has also been preliminary research on
accelerating matrix multiplication with GPUs [28], showing
that speedups are limited by I/O and setup costs and that
benefits are limited to compute dense operations.
MapReduce [29] is also commonly used for processing
large graphs. PEGASUS [30] is a popular graph processing
engine built on MapReduce and expresses graph algorithms
as a generalized form of sparse matrix-vector multiplication.
GBase [31] is a graph database built on MapReduce. It
optimizes the graph storage for graph queries and analysis
and perform graph queries and analysis with sparse matrix
vector multiplication. Other work [32], [33], [34] imple-
ment non-negative matrix factorization (NMF) with MapRe-
duce. Although these MapReduce-based implementations
can scale to very large graphs, they run orders of magnitude
more slowly than optimized in-memory solutions, such as
Trilinos and Intel MKL.
Zhou et al. [35] implement an LOBPCG [36] eigensolver
in an SSD cluster. Their implementation targets nuclear
many-body Hamiltonian matrices, which are much denser
and have smaller dimensions than many sparse graphs.
Therefore, their solution stores the sparse matrix on SSDs
and keep the entire vector subspace in RAM. In contrast,
our SEM-SpMM handles dense matrices of different sizes
and our eigensolver stores both the sparse matrix and the
vector subspace on SSDs.
The Trilinos project [17] has a large collection of numer-
ical libraries. The Tpetra library provides efficient sparse
matrix operations such as sparse matrix multiplication. The
matrix implementations in Trilinos are optimized for dis-
tributed memory.
Intel Math Kernel Library [18] is an efficient and parallel
linear algebra library with matrix operations optimized for
Intel platforms. It provides an efficient sparse matrix multi-
plication for regular sparse matrices. In contrast, our sparse
matrix multiplication optimizes for power-law graphs with
near-random vertex connection.
3 SPARSE MATRIX MULTIPLICATION
Sparse matrix dense matrix multiplication (SpMM) gener-
ates many random memory accesses and its performance
is limited by random memory throughput of DRAM. We
perform SpMM in semi-external memory (SEM) to enable
nearly in-memory performance, while achieving scalability
in proportion to the ratio of non-zero entries to rows or
columns in a sparse matrix.
3.1 Semi-external memory
Our definition of semi-external memory for SpMM keeps
the sparse matrix on SSDs and the input dense matrix or
some columns of the input dense matrix in memory. We may
keep the entire output matrix in memory if it is sufficiently
small or stream it to SSDs or the subsequent computation.
In some applications, such as non-negative matrix factor-
ization (Section 4), even the input dense matrix does not fit
in memory. In this case, we partition the input dense matrix
vertically so that each partition has complete columns of
the original input dense matrix that fit in memory. For
each vertical partition, we perform SpMM in semi-external
memory and stream the output matrix to SSDs.
3.2 Sparse matrix format
We design a new compact sparse matrix format that in-
creases CPU cache hits and reduces the amount of data
read from SSDs. A compact format is performance critical
because in SEM-SpMM the I/O to SSDs is most often the
bottleneck.
To increase CPU cache hits, we deploy cache blocking
[21] and store non-zero entries of a sparse matrix in tiles,
t × t submatrices (Figure 1). When a tile is small, the
rows from the input and output dense matrices involved in
multiplication with the tile are always kept in the CPU cache
during the computation. The optimal tile size should fill the
CPU cache with the rows from the dense matrices and is
4Tile
row id col idx
...tile header non-zero valuesrow header 1
row 
header 2 COO
t
t
Fig. 1: The storage format of a sparse matrix. In this example,
non-zero entries of the sparse matrix are organized into 9
tiles. Tiles are stored in the row-major order and a tile stores
non-zero entries in SCSR+COO format.
affected by the number of columns of the dense matrices. To
handle dense matrices with different numbers of columns,
we deploy both static cache blocking and dynamic cache
blocking. We generate sparse matrices with a relatively
small tile size and rely on the runtime system to optimize
for different numbers of columns (Section 3.4). However,
a small tile size potentially increases the storage size of a
sparse matrix. In semi-external memory, the dense matrices
usually have a small number of columns in sparse matrix
multiplication. We use the tile size of 16K× 16K by default
to balance the matrix storage size and the adaptibility to
different numbers of columns.
We use a compact format to store non-zero entries and
refer to this format as SCSR (Super Compressed Row Stor-
age) (Figure 1). In sparse matrices encoding graphs, many
rows in a tile do not have any non-zero entries. The CSR
or CSC formats waste space because they require an entry
for each row/column in the row/column index. Doubly
compressed sparse column (DCSC) [22] is proposed to avoid
this problem. However, even DCSC wastes space because
it requires the storage of pointers to columns and fails to
reference to non-zero values with small integers (e.g., two-
byte integers). SCSR is designed to further shrink the storage
size of a tile. This format keeps data only for rows with
non-zero entries in a tile. Each non-empty row has a row
header that only contains an identifier to indicate the row
number, followed by column indices. To determine the end
of a row, the most significant bit of the identifier is always
set to 1, while the most significant bit of a column index
entry is always set to 0. Owing to the small size of a tile, we
use two bytes to store a row number and a column index
entry, which further reduces the storage size. Each non-zero
entry requires at most four bytes to indicate its location in
a matrix. Because the most significant bit is used to indicate
the beginning of a row, this format supports a maximum tile
size of 32K × 32K .
SCSR is much more compact than DCSC [22]. Figure 2
shows that SCSR uses 45%-70% of the storage size used
by DCSC for large real-world graphs (Table 1). For a tile
with nnr non-empty rows and nnz non-zero entries, SCSR
requires 2 × nnr bytes for row ids, 2 × nnz bytes for
column indices of non-zero entries and c × nnz bytes for
 0
 0.2
 0.4
 0.6
 0.8
 1
Twitter Friendster Page
SC
SR
/D
CS
C
Fig. 2: The ratio of the storage size required by SCSR and
DCSC [22] format for real-world graphs.
non-zero values, where c is the number of bytes for a non-
zero value. As such, SSCSR = 2 × nnr + (2 + c) × nnz.
In contrast, DCSC requires much more metadata for a tile
with nnc non-empty columns and nnz non-zero entries.
SDCSC = (2 + 2 + 4) × nnc + (2 + c) × nnz. In a t × t
tile, nnr ≤ nnz ≤ nnr × t. On average, nnr ≈ nnc in
most sparse matrices. As such, for a binary sparse matrix,
0.4 ≤ SSCSR/SDCSC < 1. SCSR saves more space in a
sparse matrix with larger nnr/nnz.
Inside each cache tile of the SCSR, we use the coordinate
format (COO) for the rows that have only a single non-zero
entry. For the adjacency matrices of real-world graphs, many
rows in a cache tile have only one non-zero entry. Iterating
over single-entry rows in the SCSR format requires to test
the end of a row for every non-zero entry, which leads to
many conditional jumps. In contrast, COO is more suitable
for storing these single-entry rows. It does not increase
the storage size but significantly reduces the number of
conditional jump instructions. As a result, we combine SCSR
with COO and store non-zero entries in the COO format
behind the row headers of SCSR (Figure 1).
3.3 Dense matrices
In many applications, the dense matrices in SpMM are tall
and skinny with millions or even billions of rows and a
small number of columns. Because many real-world graphs
are very large and sparse, the dense matrices may be
too large to fit in memory. Thus, for the general case of
SpMM, we split the dense matrices into vertical partitions
of the memory size. To increase data locality in SpMM,
the elements in each vertical partition are stored in row-
major order (Figure 3 (a)). When performing SpMM in semi-
external memory, we load the input dense matrix or one of
its vertical partitions into memory.
For a non-uniform memory architecture (NUMA), we
partition the input dense matrix or one of its vertical par-
titions horizontally and store horizontal partitions evenly
across NUMA nodes (Figure 3(b)). Striping partitions evenly
across all NUMA nodes helps to fully utilize the bandwidth
of memory. We assign multiple contiguous rows in a row
interval with 2i rows to a NUMA node. The row interval
size is multiple of the tile size of a sparse matrix so that
multiplication on a tile accesses rows only from a single row
interval.
5NUMA
node 0
NUMA
node 1
NUMA
node 0
NUMA
node 1
(a) On SSDs (b) In memory
Fig. 3: Dense matrices are partitioned vertically, so that each
vertical partition fits in memory. A vertical partition stores
data in row-major order. When a vertical partition is loaded
into memory, it is partitioned horizontally into many row
intervals that are striped across NUMA nodes.
pt
s
t
s
n
s
Fig. 4: The execution of multiplication on tiles in a sparse
matrix and the input dense matrix in a single thread. A
thread reads st tile rows and multiplies them with the input
dense matrix. s = CPU cache2×p causes the dense matrices
to fill the CPU cache. The arrows indicate the order of
multiplication on tiles with the input dense matrix.
3.4 Parallel Execution
We parallelize sparse matrix multiplication and deploy only
computation optimizations with a small memory footprint.
Memory is precious resource in semi-external memory be-
cause memory should be used to keep more columns in the
input dense matrix to reduce I/O from SSDs. Algorithm
1 and Figure 4 illustrate the execution of sparse matrix
multiplication in semi-external memory.
Semi-external memory favors horizontal partitioning on
a sparse matrix for parallelization because this partitioning
scheme minimizes writes to remote memory and SSDs.
Horizontal partitioning requires only one thread to allocate
local memory buffers for computation on a tile row. All
intermediate computation results on tiles are merged into
the local memory buffers. As such, we write the output
matrix at most once to SSDs and there are no remote
memory writes. In contrast, the vertical partitioning and
2D partitioning [20] would require each thread to maintain
a local memory buffer for the same tile rows to aggregate
writes to SSDs and remote memory.
Algorithm 1 Parallel execution of sparse matrix dense ma-
trix multiplication.
1: procedure SPARSEMATRIXMULTIPLY(spm, inM )
Input: spm, a n× n sparse matrix on SSDs
Input: inM , a n× p dense matrix in memory
Output: outM , a n× p dense matrix on SSDs
2: ~trQ← get tile row ids(spm)
3: outM ← zeros SSD(n, p)
4: parfor thread ∈ ~threads do
5: ProcessT ileRows(spm, ~trQ, inM, outM)
6: end parfor
7:
8: procedure PROCESSTILEROWS(spm, ~trQ, inM , outM )
Input: spm, a n× n sparse matrix on SSDs
Input: ~trQ, a queue that contains all tile row ids
Input: inM , a n× p input dense matrix in memory
In-Out: outM , a n× p output dense matrix on SSDs
9: t← get tile size(spm)
10: numTRs← CPU cache size2×p×t
11: while | ~trQ| > 0 do
12: if | ~trQ| <= #threads then
13: numTRs← 1
14: ~ids← get tile rows( ~trQ, numTRs)
15: trs← read tilerow async(spm, ~ids)
16: res←MulTRows(trs, inM) when trs is ready
17: write rows async(outM, res)
18:
19: procedure MULTROWS(trs, inM )
Input: trs, a s× n submatrix in the sparse matrix
Input: inM , n× p dense matrix
Output: outBuf , a s× p dense matrix
20: outBuf ← zeros(s, p)
21: t← get tile size(trs)
22: for k = 0; k < nt ; k = k +
s
t do
23: for i = 0; i < st ; i = i+ 1 do
24: for j = 0; j < st ; j = j + 1 do
25: tile← trs[i, k + j]
26: lInMat← rows from inM for tile
27: lOutMat← rows from outBuf for tile
28: lOutMat+ = tile ∗ lInMat
We deploy a fine-grain dynamic load balancing scheme
for parallel computation with a global task queue trQ (Al-
gorithm 1). Each thread gets one computation task (tile
rows) at a time from trQ with get tile rows and performs
computation (MulTRows). At the beginning, a thread gets
larger computation tasks that contains multiple tile rows;
as the computation approaches completion, a thread gets
smaller tasks that contain only one tile row. This design
reduces concurrent access to the global data structure while
realizing good load balance.
The algorithm performs asynchronous I/O and merge
writes from multiple threads into larger ones. Large writes
ensure sustainable write throughput to SSDs and in-
crease their duration. To assist in I/O merging, we use
get tile rows() to control a global execution order that en-
sures that all threads are processing contiguous tile rows
and the results from the tile rows are located closely on
6SSDs. We write computation results from MulTRows to SSDs
asynchronously with write rows async(), which merges the
writes from different threads.
To better utilize the CPU cache, MulTRows, reorganizes
t×t tiles from multiple contiguous tile rows into s×s blocks,
where s = CPU cache2×p (Figure 4). We process all tiles in a
s × s block before moving to the next block. The tile size
of a sparse matrix is relatively small. As such, this execu-
tion order helps to reuse data in the CPU cache from the
computation on the previous tile. MulTRows also maintains
a local memory buffer outBuf to store the computation
results, which minimizes remote memory access. Once the
computation in MulTRows is complete, outBuf contains
complete results.
We vectorize the computation on rows of dense matrices.
In SpMM, we multiply a non-zero entry from a tile with all
elements in a row of the input dense matrix and add the
results to the corresponding row of the output dense matrix.
We perform these operations with vector CPU instructions,
such as AVX [37], to enable more efficient memory access
and computation.
3.5 I/O optimizations
Semi-external memory sparse matrix multiplication streams
a sparse matrix from SSDs. When we access fast SSDs
sequentially, the overhead of operating systems, such as
thread context switches and memory allocation, becomes
noticeable. We tackle these obstacles to maximize I/O
throughput.
We issue asynchronous I/O and poll for I/O to avoid
thread context switches. I/O polling prevents a thread
from being switched out after it completes all available
computation. Without polling, when a thread issues an I/O
request and waits for I/O completion, the operating system
switches the thread out; the operating system reschedules
the thread for execution once I/O is complete. However,
there is latency for thread rescheduling and the latency from
frequent rescheduling can cause noticeable performance
degradation on a high-speed SSD array.
When accessing a sparse matrix or a dense matrix from
SSDs, we maintain a set of memory buffers for I/O access
to reduce the overhead of memory allocation. We use large
I/O to access matrices on SSDs to increase I/O throughput.
Large memory allocation is expensive because the operating
system usually allocates a large memory buffer with mmap()
and populates the buffer with physical pages. Therefore,
we keep a set of memory buffers allocated previously and
reuse them for new I/O requests. For accessing a sparse
matrix, tile rows usually have different sizes, so we resize
a previously allocated memory buffer if it is too small for a
new I/O request.
3.6 The impact of the memory size on I/O
More memory reduces I/O in semi-external memory. The
minimum memory requirement for SEM-SpMM is nc + t,
where n is the number of rows of the input dense matrix,
c is the element size in bytes, t is the number of threads
processing the sparse matrix and  is the buffer size for
the sparse matrix and the output dense matrix. When a
machine does not have sufficient memory to keep the entire
input dense matrix in memory, we need multiple passes on
the sparse matrix to complete the computation. Reducing
memory consumption is essential to achieve performance
in semi-external memory. By keeping more columns of the
input dense matrix in memory, we reduce the number of
I/O passes.
Although we can cache part of a sparse matrix, keeping
more columns of the input dense matrix in memory saves
more I/O than using the same amount of memory for the
sparse matrix. Assume the input dense matrix has n rows
and p columns. Again, c is the element size in bytes. The
storage size of the sparse matrix is E bytes and the memory
size is M bytes. We further assume we use M ′ bytes to keep
some columns of the dense matrices in memory (M ′ < M ,
ncp mod M ′ ≡ 0) and the remaining memory (M −M ′) to
cache the sparse matrix. The amount of data in the sparse
matrix read from SSDs is
IOin =
ncp
M ′
[E − (M −M ′)]
Because E > M in semi-external memory, we minimize
IOin by maximizing M ′, i.e., using more memory for the
input dense matrix.
As the number of columns in memory from the input
dense matrix increases, the bottleneck of the system may
switch. When we keep only one column of the input dense
matrix in memory, the system is usually I/O bound; when
we keep more columns of the dense matrix in memory, the
system will become CPU bound and the I/O complexity
does not affect its performance.
4 APPLICATIONS
We apply sparse matrix multiplication to three important
applications widely used in graph analysis and machine
learning: PageRank [3], eigensolver [16] and non-negative
matrix factorization [2]. Each application demonstrates a
different strategy of using memory for SpMM.
4.1 PageRank
PageRank is an algorithm to rank the Web pages by using
hyperlinks between Web pages. It was first used by Google
and is identified as one of the top 10 data mining algo-
rithms [38]. PageRank is a representative of a set of graph
algorithms that can be expressed with SpMM or generalized
SpMM, in which the dense matrices have only one column.
Other important examples are label propagation [39] and
belief propagation [40]. The algorithm runs iteratively and
its update rule for each Web pages in an iteration is
PR(u) =
1− d
N
+ d
∑
v∈B(u)
PR(v)
L(v)
where B(u) denotes the neighbor list of vertex u and L(v)
denotes the out-degree of vertex v.
4.2 Eigensolver
An eigensolver is another commonly used application that
requires sparse matrix multiplication. Many algorithms [41],
[42], [43] and frameworks [16], [44], [45] have been devel-
oped to solve a large eigenvalue problem.
7We take advantage of the Anasazi eigensolver frame-
work [16] and replace its original matrix operations with
our SEM-SpMM. To compute eigenvalues of a n× n sparse
matrix, many eigenvalue algorithms construct a vector sub-
space with a sequence of sparse matrix multiplications and
each vector in the subspace has the length of n. Due to the
sparsity of real-world graphs, the vector subspace is large
and we keep the vectors on SSDs. The Anasazi eigensolvers
have block extension to update multiple vectors in the
subspace simultaneously, which results in sparse matrix
dense matrix multiplication. The most efficient Anasazi
eigensolver on sparse graphs is the KrylovSchur eigensolver
[43], which updates a small number of vectors (1-4) in the
subspace simultaneously. Zheng et al. [46] provides the
details of extending the Anasazi eigensolver.
4.3 Non-negative matrix factorization
Non-negative matrix factorization (NMF) [2] finds two non-
negative low-rank matrices W and H to approximate a
matrix A ≈ WH . NMF has many applications in machine
learning and data mining. A well-known example is collab-
orative filtering [47] in recommender systems. NMF can also
be applied to graphs to find communities [48], [49].
Many algorithms are designed to solve NMF and here
we describe an algorithm [2] that requires a sequence of
sparse matrix dense matrix multiplications. The algorithm
uses multiplicative update rules and updates matrices W
and H alternately. In each iteration, the algorithm first fixes
W to update H and then fixes H to update W .
Haµ ← Haµ
(WTA)aµ
(WTWH)aµ
,Wia ←Wia (AH
T )ia
(WHHT )ia
We apply SEM-SpMM to NMF differently based on the
memory size and the number of columns in W and H .
Due to the sparsity of a graph, W and H may require
storage as large as the sparse matrix and can no longer
fit in memory. Therefore, we partition W and H vertically
and run multiple sparse matrix multiplications to compute
WTA and AHT , if the memory is not large enough.
5 EXPERIMENTAL EVALUATION
We evaluate the performance of our SEM-SpMM on mul-
tiple real-world billion-scale graphs including a web-page
graph with 3.4 billion vertices. We first measure the perfor-
mance of SEM-SpMM and compare it with our in-memory
implementation (IM-SpMM), which is simply the SEM-
SpMM implementation with the sparse matrix in mem-
ory. We also compare SEM-SpMM with state-of-the-art in-
memory implementations in Intel MKL (mkl dcsrmm) and
Trilinos Tpetra. We use Intel MKL 2015 and Trilinos v12.0.1
for the experiments. We demonstrate the effectiveness of
CPU and I/O optimizations on SEM-SpMM and evaluate
the overall performance of the applications in Section 4.
We conduct experiments on a NUMA machine with four
Intel Xeon E7-4860 processors, clocked at 2.6 GHz, and 1TB
memory of DDR3-1600. Each processor has 12 cores. The
machine has three LSI SAS 9300-8e host bus adapters (HBA)
connected to a SuperMicro storage chassis, in which 24 OCZ
Intrepid 3000 SSDs are installed. The 24 SSDs together are
Graph datasets # Vertices # Edges Directed
Twitter [50] 42M 1.5B Yes
Friendster [51] 65M 1.7B No
Page graph [6] 3.4B 129B Yes
RMAT-40 [52] 100M 3.7B Yes & No
RMAT-160 [52] 100M 14B Yes & No
TABLE 1: Graph data sets. We construct a directed and
undirected version for both RMAT-40 and RMAT-160.
 0
 0.2
 0.4
 0.6
 0.8
 1
Twitter Friendster
Page graph
RMAT-40
RMAT-160
R
el
at
iv
e 
pe
rfo
rm
an
ce
1col 2cols 4cols 8cols
(a) The runtime performance of SEM-SpMM, normalized to IM-
SpMM for the dense matrix with the same number of columns.
 0
 2
 4
 6
 8
 10
Twitter Friendster
Page graph
RMAT-40
RMAT-160
G
B/
s
(b) The I/O throughput generated by SEM-SpMM.
Fig. 5: The performance and the I/O throughput of
SEM-SpMM with dense matrices of different numbers of
columns.
capable of delivering 12 GB/s for read and 10 GB/s for write
at maximum. The machine runs Linux kernel v3.13.0. We
use 48 threads for all implementations.
We use the adjacency matrices of the graphs in Table 1
for performance evaluation. The smallest graph we use has
42 million vertices and 1.5 billion edges. The largest graph
is the Page graph with 3.4 billion vertices and 129 billion
edges. We generate two synthetic graphs with R-Mat [52]
to fill the size gap between the smallest and largest graphs.
We construct a directed and undirected version for each of
the synthetic graphs because some applications in Section
4 run on directed graphs and others run on undirected
graphs. The real-world datasets are publicly available and
the synthetic datasets are generated with the boost RMAT
generator1. We always use the undirected version of the
synthetic graphs for the performance evaluation of sparse
matrix multiplication.
5.1 SEM-SpMM vs. IM-SpMM
We compare the performance of SEM-SpMM with IM-
SpMM to investigate the performance penalty of scaling
1. We use the parameters of a = 0.57, b = 0.19, c = 0.19, d = 0.05.
8SpMM with SSDs. In this case, the dense matrices have a
small number of columns and are stored in memory.
There is only a small performance penalty for semi-
external memory (Figure 5a). The performance gap between
IM-SpMM and SEM-SpMM is affected by randomness of
vertex connection. The gap is smaller if vertex connection
in a graph is more random. The Page graph is relatively
well clustered, so SpMM on this graph is less CPU-bound
than others. Even for the Page graph, SEM-SpMM gets 65%
performance of IM-SpMM. The other factor of affecting the
performance gap is the number of columns in the dense
matrices. The gap gets smaller as the number of columns in
the dense matrices increases.
We further measure the average I/O throughput gener-
ated by SEM-SpMM to indicate the bottleneck of the system
(Figure 5b). These experiments vary the number of columns
in the dense matrix from 1 (Sparse Matrix Vector multi-
plication, SpMV) to 8. SpMV on the Page graph saturates
the I/O bandwidth of SSDs and is clearly bottlenecked by
I/O. SpMV on other graphs (except Twitter) also generates
high I/O throughput, which consumes memory bandwidth
and potentially interferes with random memory access in
SpMV. SEM-SpMV on the Twitter graph takes only about
0.5 second to complete and, thus, startup overhead has
significant impact on the average I/O throughput. As the
number of columns in the dense matrix increases, SEM-
SpMM becomes more CPU-bound and requires a small
number of columns to achieve close to 100% performance
of IM-SpMM.
Multiple factors affect CPU cache hits in sparse matrix
multiplication and the performance gap between IM-SpMV
and SEM-SpMV (Figure 6). We illustrate some of the factors
with stochastic block model (SBM) [53], a random graph
generation model with well-defined clusters widely used
in the graph community. In this experiment, we generate
SBM graphs with 100 million vertices and 3 billion edges.
If the vertices in a graph are randomly ordered, SpMV on
the graph generates many random memory access. Thus,
the system is bottlenecked by memory bandwidth and the
performance gap between IM-SpMV and SEM-SpMV is
small. On a graph with vertices ordered based on cluster
structures, the number of clusters and the ratio of edges
inside and outside clusters have significant impact on the
performance gap. As the number of clusters increases, the
size of clusters gets smaller and there are fewer random
memory accesses inside clusters. Similarly, more edges in-
side clusters also lead to fewer random memory accesses. In
both cases, the performance gap grows.
5.2 Comparison with other in-memory SpMM
In this section, we compare our SpMM implementation
with the Intel MKL and Trilinos Tpetra. Intel MKL runs
on shared-memory machines. Trilinos Tpetra runs in both
shared memory and distributed memory, so we measure its
performance in our 48-core NUMA machine and an EC2
cluster. We run Tpetra in the largest EC2 instances r3.8xlarge
with 16 physical CPU cores and 244GB of RAM each. The
EC2 instances are connected with 10Gbps network in the
same placement group. MKL and Tpetra cannot run on the
Page graph on our NUMA machine because their memory
consumption on this graph exceeds its memory capacity.
 0
 0.2
 0.4
 0.6
 0.8
 1
10 100 1000 10000
SE
M
/IM
#clusters
IN/OUT=1,clustered
IN/OUT=5,clustered
IN/OUT=1,unclustered
IN/OUT=5,unclustered
Fig. 6: The relative performance of SEM-SpMV on graphs
generated with stochastic block model, normalized to IM-
SpMV on the same graphs. IN/OUT indicates the ratio
of edges inside and outside clusters. A graph can have
vertices ordered based on the cluster structures (clustered)
and randomly ordered (unclustered).
 0
 0.2
 0.4
 0.6
 0.8
 1
Twitter Friendster RMAT-40 RMAT-160
R
el
at
iv
e 
pe
rfo
rm
an
ce
SEM-SpMM MKL Trilinos
(a) SpMV
 0
 0.2
 0.4
 0.6
 0.8
 1
Twitter Friendster RMAT-40 RMAT-160
R
el
at
iv
e 
pe
rfo
rm
an
ce
(b) SpMM with a dense matrix of 8 columns.
Fig. 7: The performance of different sparse matrix multipli-
cation implementations on the 48-core machine normalized
to IM-SpMM for the same graphs.
Our IM-SpMM and SEM-SpMM significantly outper-
form Intel MKL and Trilinos Tpetra on the natural graphs
on our NUMA machine for both SpMV and SpMM (Figure
7). Even though the Tpetra implementation is optimized for
SpMV, our SEM-SpMM constantly outperforms Tpetra by a
factor of 2− 3 even in SpMV. MKL has better optimizations
for SpMM than Tpetra. Our SEM-SpMM is still almost
twice as fast as MKL in SpMM with a dense matrix of
eight columns. Our SpMM implementation benefits all of
the in-memory optimizations to achieve high computation
efficiency and has substantial I/O optimizations to reduce
the performance gap between IM-SpMM and SEM-SpMM
(Section 5.4). Specifically, the fine-grain dynamic load bal-
ancer achieves better load balancing than MKL and Tpetra.
Cache blocking significantly reduces CPU cache misses,
whereas MKL and Tpetra store a sparse matrix in CSC or
CSR format and have more CPU cache misses.
9 0
 100
 200
 300
 400
 500
 600
 700
 800
1col 8cols
M
em
 (G
B)
SEM-SpMM
IM-SpMM
MKL
Trilinos
Fig. 8: Memory consumption of different SpMM implemen-
tations on RMAT-160.
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
Friendster Twitter RMAT-40 RMAT-160
R
el
at
iv
e 
pe
rfo
rm
an
ce
2x EC2
4x EC2
8x EC2
16x EC2
IM-EC2
SEM
(a) SpMV
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
Friendster Twitter RMAT-40 RMAT-160
R
el
at
iv
e 
pe
rfo
rm
an
ce
(b) SpMM with a dense matrix of 8 columns.
Fig. 9: The performance of SEM-SpMM on our 48-core
machine (SEM) and Trilinos Tpetra on EC2 clusters (2xEC2,
4xEC2, 8xEC2 and 16xEC2), normalized to IM-SpMM on
our 48-core machine for the same graphs. We also show the
performance of IM-SpMM on one of the EC2 instance (IM-
EC2) where Tpetra runs.
SEM-SpMM consumes a small fraction of memory when
compared with IM-SpMM and other SpMM implementa-
tions (Figure 8). SEM-SpMM consumes memory for the
input dense matrix and per-thread memory buffers for the
sparse matrix and the output dense matrix. The memory
used by memory buffers in each thread is significant but is
relatively constant for different graph sizes. We show the
memory consumption on one of the large graphs RMAT-160
(Figure 8). SEM-SpMM uses about one tenth of the memory
used by IM-SpMM, while IM-SpMM consumes much less
memory than MKL and Tpetra owing to its compact format
for sparse matrices.
Our SEM-SpMM on a large parallel machine achieves
comparable performance and, in many cases, outperforms
Trilinos Tpetra using five times as much processing power,
especially on real-world graphs (Figure 9). In this exper-
iment, we run our SpMM implementation on both our
NUMA machine with 48 CPU cores and one of the EC2
 0
 0.2
 0.4
 0.6
 0.8
 1
1 2 4 8 16 32R
el
at
iv
e 
pe
rfo
rm
an
ce
#columns of the input dense matrix that are kept in memory
Twitter
Friendster
RMAT-40
RMAT-160
Fig. 10: The performance of SEM-SpMM with a dense matrix
of 32 columns relative to IM-SpMM, when the number of
columns of the input dense matrix kept in memory varies.
machines with 16 CPU cores. Owing to the compact for-
mat for a sparse matrix, our IM-SpMM runs on all of the
graphs on an EC2 instance and achieves around half of
the performance of our IM-SpMM on our NUMA machine.
When Tpetra runs on 16 EC2 instances, it has 5 times as
many CPU cores as our NUMA machine. Tpetra cannot
run SpMV on RMAT-160 on two EC2 nodes. Tpetra uses
many more computation resources and still barely reaches
the same performance as our IM-SpMM and SEM-SpMM
on our NUMA machine. One of the main reasons that
our SpMM implementation performs much better on real-
world graphs is that these graphs are more likely to cause
load imbalance. The dynamic scheduling of our SpMM
implementation balances load much better than Tpetra.
5.3 SEM-SpMM with a large input dense matrix
We further measure the performance of SEM-SpMM with a
large input dense matrix, in which neither the sparse matrix
nor the dense matrices can fit in memory. In this experiment,
we measure the performance of multiplying a sparse matrix
with a vertically partitioned dense matrix and the input and
output dense matrices are stored on SSDs. We study the
impact of memory size on the performance of SEM-SpMM
by artificially varying the number of columns that can fit in
memory. SEM-SpMM accesses the sparse matrix with direct
I/O and, thus, varying the number of columns stored in
memory does not affect I/O access to the sparse matrix.
IM-SpMM does not need to partition the dense matrices
vertically. We do not show the result on the Page graph
because the dense matrix with 32 columns for this graph
cannot fit in memory.
As more columns in the input dense matrix can fit
in memory, the performance of SEM-SpMM constantly in-
creases (Figure 10). When the memory can fit over four
columns of the input dense matrix, SEM-SpMM gets over
50% of the performance of IM-SpMM. Even when only one
column of the input dense matrix can fit in memory, SEM-
SpMM still gets 25% of the in-memory performance. When
the entire input dense matrix can fit in memory, we get
about 80% of the in-memory performance.
Two main factors lead to performance loss in SEM-
SpMM when the input dense matrix cannot fit in memory.
We illustrate the contribution of four potential overheads in
SEM-SpMM on the Friendster graph (Figure 11). The main
performance loss comes from the loss of data locality in
10
 0
 10
 20
 30
 40
 50
 60
1 2 4 8 16 32
R
un
tim
e 
(s)
#columns of the input dense matrix that are kept in memory
Base
Vert-part
SpM-EM
Out-EM
In-EM
Fig. 11: The overhead breakdown of SEM-SpMM on the
Friendster graph with a dense matrix of 32 columns when
the number of columns in the input dense matrix kept in
memory varies.
SpMM caused by vertical partitioning of the input dense
matrix (Vert-part). Partitioning the dense matrix into one-
column matrices contributes 60% of performance loss. It
drops quickly when the vertical partition size increases.
Keeping the sparse matrix on SSDs (SpM-EM) also con-
tributes some performance loss when the dense matrix is
partitioned into small matrices. The overhead almost goes
away when more than four columns of the dense matrix
can fit in memory. The overhead of streaming the output
dense matrix to SSDs (Out-EM) and reading the input dense
matrix to memory (In-EM) is less significant and remains the
same for different memory sizes.
5.4 Optimizations on sparse matrix multiplication
Accelerating SEM-SpMM requires both computation and
I/O optimizations. We first evaluate the effectiveness of
computation optimizations by deploying them on IM-
SpMM. We further show the effectiveness of I/O optimiza-
tions on SEM-SpMM with all computation optimizations.
The number of columns in the dense matrix affects the
effectiveness of the optimizations. As such, we illustrate
their effectiveness on SpMM for dense matrices with one
column (SpMV) up to eight columns.
Here we illustrate the most significant computation op-
timizations from Section 3. We start with an in-memory
implementation that performs sparse matrix multiplication
on a sparse matrix in the CSR format and apply the opti-
mizations incrementally in the following order:
• dispatch partitions of a sparse matrix to threads dynam-
ically to balance load (Load balance),
• partition dense matrices for NUMA (NUMA),
• organize the non-zero entries in a sparse matrix into
tiles to increase CPU cache hits (Cache blocking),
• use CPU vectorization instructions to accelerate arith-
metic computation (Vec),
All of these optimizations have positive effects on sparse
matrix multiplication and all optimizations together speed
up SpMM by 3−5 times (Figure 12). The degree of effective-
ness varies between different graphs and different numbers
of columns in the dense matrices. The largest performance
boost is from cache blocking, especially for SpMV. This is
expected because vertices in these graph can connect to ar-
bitrary vertices, which leads many random memory access
 0
 1
 2
 3
 4
 5
SpMV(F) SpMM(F) SpMV(T) SpMM(T)
Sp
ee
du
p
Base
Load balance
NUMA
Cache blocking
Vec
Fig. 12: The speedup of each computation optimization for
SpMV and SpMM on the Friendster (F) and Twitter (T)
graph. The input dense matrix in SpMM has 8 columns.
 0
 0.5
 1
 1.5
 2
 2.5
Friendster Page graph
Sp
ee
du
p
Base
SCSR
Buf-pool
IO-poll
Fig. 13: The speedup of I/O optimizations for SpMV on the
Friendster graph and the Page graph.
in sparse matrix multiplication. Cache blocking significantly
increases CPU cache hits to reduce random memory access.
CPU vectorization is only effective on SpMM because it
optimizes computation on a row of the dense matrix. With
all optimizations, we have a fast in-memory implementation
for both sparse matrix vector multiplication and sparse
matrix dense matrix multiplication.
We evaluate I/O optimizations on SEM-SpMV against
a base implementation that has all of the computation op-
timizations and use doubly compressed sparse row format
(DCSR) to store tiles of a sparse matrix. We illustrate their
effectiveness on the Friendster graph and the Page graph.
The first one represents a graph that is not well clustered;
the other one is clustered with domain names. We apply the
I/O optimizations in the following order:
• use SCSR to reduce data read from SSDs (SCSR),
• reduce memory allocation overhead for I/O with per-
thread buffer pools (buf-pool),
• reduce the number of thread context switches for I/O
accesses with I/O polling (IO-poll),
The I/O optimizations lead to substantial speedup over
the base implementation, but behave very differently on
these two graphs (Figure 13). On the unclustered graph
(Friendster), SCSR requires a much smaller storage size than
DCSR (Figure 2) and thus achieves significant speedup. The
Page graph, on the other hand, is well clustered and DCSR
already achieves a small storage size. SCSR further reduces
the storage size, but is less significant. SpMV on the Page
graph has less random memory access and is I/O-bound
even on a large SSD array. Buf-pool and IO-poll increases I/O
throughput and, thus, improves performance. In contrast,
11
Conv Conv (I/O) SpMV SpMV (I/O)
Page graph 86.00 s 9.33 GB/s 26.92 s 10.60 s
RMAT-160 19.81 s 9.51 GB/s 7.96 s 10.27 s
TABLE 2: The speed and average I/O throughput of format
conversion from CSR to SCSR on large graphs, compared
with SEM-SpMV.
SEM-SpMV with the Friendster graph in the SCSR format
already achieves almost 80% of IM-SpMV and, thus, further
I/O optimizations have less noticeable speedup.
Format conversion from CSR to SCSR has low cost with
linear time complexity and is bottlenecked by SSDs on our
largest graphs (Table 2). It reads the CSR image once and
write the SCSR image back once, both sequentially. This
results in the minimum amount of I/O and I/O bounds
the computation. Format conversion generates only one-
time overhead. Most of the applications such as PageRank
and eigensolving require many iterations of sparse matrix
multiplication to compute accurate results. As such, the
format conversion overhead is amortized.
5.5 Application performance
We evaluate the performance of our implementations of
the applications in Section 4. We show the effectiveness of
additional memory for these applications and compare their
performance with state-of-the-art implementations.
5.5.1 PageRank
We evaluate the performance of our SpMM-based PageR-
ank implementation (SpMM-PageRank). This implementa-
tion requires the input vector to be in memory, but it is
optional to keep the output vector and the degree vector
in memory. PageRank is a benchmarking graph algorithm
implemented by many graph processing frameworks. We
compare the performance of SpMM-PageRank with state-
of-the-art implementations in FlashGraph [12], a semi-
external memory graph engine, and GraphLab Create, the
next generation of PowerGraph [54]. The PageRank im-
plementation in FlashGraph computes approximate PageR-
ank values while SpMM-PageRank and GraphLab Create
compute exact PageRank values. We run GraphLab Create
completely in memory and FlashGraph in semi-external
memory. GraphLab Create is not able to compute PageRank
on the Page graph. We use FlashGraph v0.3 and a trial
version of GraphLab Create v1.9.
SpMM-PageRank in memory and in semi-external mem-
ory both significantly outperform the implementations in
FlashGraph and GraphLab Create (Figure 14). Our SpMM
is highly optimized for both CPU and I/O. Even though
SpMM-PageRank performs more computation than Flash-
Graph, it performs the computation much more efficiently
and reads less data from SSDs than FlashGraph. SpMM-
PageRank and the implementation in GraphLab create per-
forms the same computation, but SpMM-PageRank per-
forms the computation much more efficiently.
The experiment results also show that keeping more
vectors in memory has modest performance improvement
for SpMM-PageRank. As such, SpMM-PageRank only needs
to keep one vector in memory, which results in very small
memory consumption.
 1
 10
 100
 1000
Twitter RMAT-40 RMAT-160
Page graph
R
un
tim
e 
(s)
IM
SEM-1vec
SEM-2vec
SEM-3vec
FlashGraph
GraphLab
Fig. 14: The runtime of SpMM-PageRank in 30 iterations.
The SEM implementation keeps different numbers of vec-
tors in memory (SEM-1vec, SEC-2vec, SEM-3vec). We com-
pare them with the implementations in FlashGraph and
GraphLab Create.
 1
 10
 100
Friendster RMAT-40 RMAT-160
Page Graph
R
un
tim
e 
(m
in)
Trilinos
SEM-min
SEM-max
IM
Fig. 15: The runtime of our SEM KrylovSchur, our in-
memory eigensolver and the Trilinos eigensolvers when
computing eight eigenvalues. SEM-min keeps the entire
vector subspace on SSDs and SEM-max keeps the entire
vector subspace in memory.
5.5.2 Eigensolver
We evaluate the performance of our SEM KrylovSchur
eigensolver and compare its performance with our in-
memory eigensolver and the Trilinos KrylovSchur eigen-
solver. Usually, spectral analysis only requires a very small
number of eigenvalues, so we compute eight eigenvalues
in this experiment. We run the eigensolvers on the smaller
undirected graphs in Table 1. To evaluate the scalability
of the SEM eigensolver, we compute singular value de-
composition (SVD) on the Page graph. Among all of the
eigensolvers, only our SEM eigensolver is able to compute
eigenvalues on the Page graph.
For computing 8 eigenvalues, our SEM eigensolver
achieves performance comparable to our in-memory eigen-
solver and the Trilinos eigensolver and can scale to very
large graphs (Figure 15). Unlike PageRank, an eigensolver
has many more vector or dense matrix operations. As such,
the memory size has noticeable impact on performance. For
the setting with the minimum memory consumption, it has
at least 45% performance of our in-memory eigensolver;
when keeping the entire subspace in memory, it has almost
the same performance as our in-memory eigensolver.
5.5.3 NMF
We evaluate the performance of our NMF implementation
(SEM-NMF) on the directed graphs in Table 1. The dense
12
 0
 50
 100
 150
 200
 250
 300
 350
Twitter RMAT-40 RMAT-160
R
un
tim
e 
(s)
SEM-2cols
SEM-8cols
SEM-32cols
IM
SmallK
Fig. 16: The runtime per iteration of SEM-NMF on directed
graphs. We vary the number of columns in the dense matri-
ces that are kept in memory to evaluate effect of the memory
size on the performance of SEM-NMF.
matrices for NMF can be as large as the sparse matrix. As
such, we experiment with the effect of the memory size
on the performance of SEM-NMF by varying the num-
ber of columns in memory from the dense matrices. We
also compare the performance of SEM-NMF with a high-
performance NMF implementation SmallK [55], built on top
of the numeric library Elemental [56]. We factorize each
of the graphs into two n × k non-negative dense matrices
and we use k = 16 because 16 is the largest k that SmallK
supports for the graphs in Table 1. We use SmallK v1.6 and
Elemental v0.85.
We significantly improve the performance of SEM-NMF
by keeping more columns of the input dense matrix in
memory (Figure 16). The performance improvement is more
significant when the number of columns that fit in memory
is small. When we keep eight columns of the input dense
matrix in memory, SEM-NMF achieves over 60% of the
performance of the in-memory implementation.
SEM-NMF significantly outperforms other NMF imple-
mentations. SmallK is the closest competitor. We run the
same NMF algorithm in SmallK and SEM-NMF outperforms
SmallK by a large factor on all graphs (Figure 16).
6 CONCLUSIONS
We present an alternative solution for scaling sparse matrix
dense matrix multiplication (SpMM) to large sparse matri-
ces by utilizing commodity SSDs in a large parallel machine.
We perform this operation in semi-external memory (SEM),
in which we keep the sparse matrix on SSDs and the
dense matrices in memory. Semi-external memory increases
scalability in proportion to the ratio of non-zero entries to
rows or columns in a sparse matrix, while achieving in-
memory performance. We demonstrate that our approach
provides a promising alternative to distributed computation
for large-scale data analysis.
With substantial memory and I/O optimizations, our
SEM-SpMM achieves high efficiency while scaling to large
graphs with billions of vertices and hundreds of billions of
edges. It significantly outperforms the Intel MKL and Trili-
nos implementations. We run our implementation in mem-
ory (IM-SpMM) to quantify the overhead of keeping data
on SSDs. SEM-SpMM achieves almost 100% performance of
IM-SpMM on some graphs when the dense matrices have
more than four columns, and achieves at least 65% of the
performance of IM-SpMM on all graphs even when the
dense matrix has only one column.
For a machine with insufficient memory to keep the
entire input dense matrix in memory, we partition the dense
matrix vertically and run SEM-SpMM multiple times. In this
case, the main overhead of SEM-SpMM comes from the loss
of data locality caused by vertical partitioning on the dense
matrix. However, given sufficient memory to keep a small
number of columns of the input dense matrix, we achieve
performance comparable to IM-SpMM.
We apply our sparse matrix multiplication to three im-
portant applications: PageRank, eigendecomposition and
non-negative matrix factorization. We demonstrate how ad-
ditional memory should be used in semi-external memory
in each application. We further demonstrate that each of our
implementations significantly outperforms state of the art
and scales to large graphs.
Our SSD-based solution also achieves high energy ef-
ficiency even though we have not measured energy con-
sumption explicitly. SSDs are energy-efficient storage media
[57] compared with RAM and hard drives. When processing
large datasets, our solution only uses a single machine and
requires a relatively small amount of memory. In contrast,
a distributed solution requires many more machines and
much more aggregate memory in order to process datasets
of the same size. As such, our solution introduces an energy-
efficient architecture for large-scale data analysis tasks.
ACKNOWLEDGMENTS
This work is partially supported by NSF ACI-1261715,
DARPA GRAPHS N66001-14-1-4028 and DARPA SIMPLEX
program through SPAWAR contract N66001-15-C-4041.
REFERENCES
[1] G. Golub and W. Kahan, “Calculating the singular values and
pseudo-inverse of a matrix,” Journal of the Society for Industrial and
Applied Mathematics Series B Numerical Analysis, vol. 2, no. 2, pp.
205–224, 1965.
[2] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix
factorization,” in Advances in Neural Information Processing Systems
13, 2001.
[3] S. Brin and L. Page, “The anatomy of a large-scale hypertextual
web search engine,” in Proceedings of the Seventh International
Conference on World Wide Web 7, 1998.
[4] T. Mattson, D. Bader, J. Berry, A. Buluc, J. Dongarra, C. Falout-
sos, J. Feo, J. Gilbert, J. Gonzalez, B. Hendrickson, J. Kepner,
C. Leiserson, A. Lumsdaine, D. Padua, S. Poole, S. Reinhardt,
M. Stonebraker, S. Wallach, and A. Yoo, “Standards for graph
algorithm primitives,” in High Performance Extreme Computing
Conference (HPEC), 2013 IEEE, 2013.
[5] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel,
“Optimization of sparse matrix-vector multiplication on emerging
multicore platforms,” in Proceedings of the 2007 ACM/IEEE Confer-
ence on Supercomputing, 2007.
[6] “Web graph,” http://webdatacommons.org/hyperlinkgraph/,
Accessed 4/18/2014.
[7] A. Yoo, A. H. Baker, R. Pearce, and V. E. Henson, “A scalable eigen-
solver for large scale-free graphs using 2D graph partitioning,”
in Proceedings of 2011 International Conference for High Performance
Computing, Networking, Storage and Analysis, 2011.
[8] E. G. Boman, K. D. Devine, and S. Rajamanickam, “Scalable
matrix computations on large scale-free graphs using 2D graph
partitioning,” in Proceedings of the International Conference on High
Performance Computing, Networking, Storage and Analysis, 2013.
13
[9] J. A. Ang, R. F. Barrett, R. E. Benner, D. Burke, C. Chan, J. Cook,
D. Donofrio, S. D. Hammond, K. S. Hemmert, S. M. Kelly, H. Le,
V. J. Leung, D. R. Resnick, A. F. Rodrigues, J. Shalf, D. Stark,
D. Unat, and N. J. Wright, “Abstract machine models and proxy
architectures for exascale computing,” in Proceedings of the 1st
International Workshop on Hardware-Software Co-Design for High
Performance Computing, 2014.
[10] F. McSherry, M. Isard, and D. G. Murray, “Scalability! but at what
cost?” in 15th Workshop on Hot Topics in Operating Systems (HotOS
XV), 2015.
[11] “Scalability! but at what cost?” http://www.frankmcsherry.
org/graph/scalability/cost/2015/01/15/COST.html, Accessed
9/3/2016.
[12] D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and
A. S. Szalay, “FlashGraph: Processing billion-node graphs on an
array of commodity SSDs,” in 13th USENIX Conference on File and
Storage Technologies (FAST 15), 2015.
[13] J. Abello, A. L. Buchsbaum, and J. R. Westbrook, “A functional
approach to external graph algorithms,” in Algorithmica, 1998.
[14] D. Zheng, R. Burns, and A. S. Szalay, “Toward millions of file sys-
tem IOPS on low-cost, commodity hardware,” in Proceedings of the
International Conference on High Performance Computing, Networking,
Storage and Analysis, 2013.
[15] C. Min, K. Kim, H. Cho, S.-W. Lee, and Y. I. Eom, “SFS: Random
write considered harmful in solid state drives,” in Proceedings of
the 10th USENIX Conference on File and Storage Technologies, 2012.
[16] C. G. Baker, U. L. Hetmaniuk, R. B. Lehoucq, and H. K. Thorn-
quist, “Anasazi software for the numerical solution of large-scale
eigenvalue problems,” ACM Trans. Math. Softw., vol. 36, no. 3, pp.
13:1–13:23, 2009.
[17] M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu,
T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T.
Phipps, A. G. Salinger, H. K. Thornquist, R. S. Tuminaro, J. M.
Willenbring, A. Williams, and K. S. Stanley, “An overview of the
Trilinos project,” ACM Trans. Math. Softw., 2005.
[18] “Intel math kernel library,” https://software.intel.com/en-us/
intel-mkl, Accessed 1/24/2016.
[19] H. M. Aktulga, A. Buluc¸, S. Williams, and C. Yang, “Optimizing
sparse matrix-multiple vectors multiplication for nuclear configu-
ration interaction calculations,” in Proceedings of the 2014 IEEE 28th
International Parallel and Distributed Processing Symposium, 2014.
[20] P. Koanantakool, A. Azad, A. Buluc¸, D. Morozov, S.-Y. Oh,
L. Oliker, and K. Yelick, “Communication-avoiding parallel
sparse-dense matrix-matrix multiplication,” in Proceedings of the
2016 IEEE 30th International Parallel and Distributed Processing Sym-
posium, 2016.
[21] E.-J. Im, K. Yelick, and R. Vuduc, “Sparsity: Optimization frame-
work for sparse matrix kernels,” Int. J. High Perform. Comput. Appl.,
2004.
[22] A. Buluc and J. R. Gilbert, “On the representation and multiplica-
tion of hypersparse matrices,” in Parallel and Distributed Processing,
2008. IPDPS 2008. IEEE International Symposium on, 2008.
[23] R. Pearce, M. Gokhale, and N. M. Amato, “Multithreaded asyn-
chronous graph traversal for in-memory and semi-external mem-
ory,” in Proceedings of the ACM/IEEE International Conference for
High Performance Computing, Networking, Storage and Analysis, 2010.
[24] X. Zhu, W. Han, and W. Chen, “GridGraph: Large-scale graph
processing on a single machine using 2-level hierarchical parti-
tioning,” in 2015 USENIX Annual Technical Conference (USENIX
ATC 15), 2015.
[25] M. Stonebraker, P. Brown, A. Poliakov, and S. Raman, “The archi-
tecture of scidb,” in Proceedings of the 23rd International Conference
on Scientific and Statistical Database Management, 2011.
[26] P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann,
“The multidimensional database system rasdaman,” in Proceedings
of the 1998 ACM SIGMOD International Conference on Management
of Data, 1998.
[27] D. Kernert, F. Ko¨hler, and W. Lehner, “Slacid - sparse linear algebra
in a column-oriented in-memory database system,” in Proceed-
ings of the 26th International Conference on Scientific and Statistical
Database Management, 2014.
[28] F. Liu, K. Lee, I. Roy, V. Talwar, S. Chen, J. Chang, and P. Ran-
ganathan, “GPU accelerated array queries: The good, the bad, and
the promising,” HP Laboratories, Tech. Rep. HPL-2014-50, 2014.
[29] J. Dean and S. Ghemawat, “MapReduce: Simplified data pro-
cessing on large clusters,” in Proceedings of the 6th Conference on
Symposium on Opearting Systems Design & Implementation, 2004.
[30] U. Kang, C. E. Tsourakakis, and C. Faloutsos, “PEGASUS: A peta-
scale graph mining system implementation and observations,” in
Proceedings of the 2009 Ninth IEEE International Conference on Data
Mining, 2009.
[31] U. Kang, H. Tong, J. Sun, C.-Y. Lin, and C. Faloutsos, “GBASE: A
scalable and general graph management system,” in Proceedings
of the 17th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2011.
[32] R. Liao, Y. Zhang, J. Guan, and S. Zhou, “CloudNMF: A mapre-
duce implementation of nonnegative matrix factorization for
large-scale biological datasets,” Genomics, Proteomics & Bioinformat-
ics, vol. 12, no. 1, pp. 48 – 51, 2014.
[33] J. Yin, L. Gao, and Z. Zhang, “Scalable nonnegative matrix factor-
ization with block-wise updates,” in ECML PKDD, 2014.
[34] C. Liu, H.-c. Yang, J. Fan, L.-W. He, and Y.-M. Wang, “Distributed
nonnegative matrix factorization for web-scale dyadic data analy-
sis on mapreduce,” in Proceedings of the 19th International Conference
on World Wide Web, 2010.
[35] Z. Zhou, E. Saule, H. Aktulga, C. Yang, E. Ng, P. Maris, J. Vary,
and U. Catalyurek, “An out-of-core eigensolver on SSD-equipped
clusters,” in Cluster Computing (CLUSTER), 2012 IEEE International
Conference on, 2012.
[36] P. Arbenz, U. L. Hetmaniuk, R. B. Lehoucq, and R. S. Tuminaro, “A
comparison of eigensolvers for large-scale 3D modal analysis us-
ing AMG-preconditioned iterative methods,” International Journal
for Numerical Methods in Engineering, 2005.
[37] C. Lomont, “Introduction to intel advanced vector
extensions,” https://software.intel.com/en-us/articles/
introduction-to-intel-advanced-vector-extensions, 2011.
[38] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda,
G. McLachlan, A. Ng, B. Liu, P. Yu, Z.-H. Zhou, M. Steinbach,
D. Hand, and D. Steinberg, “Top 10 algorithms in data mining,”
Knowledge and Information Systems, vol. 14, no. 1, pp. 1–37, 2008.
[39] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled
data with label propagation,” Carnegie Mellon University, Tech.
Rep., 2002.
[40] J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Exploring artificial
intelligence in the new millennium.” San Francisco, CA, USA:
Morgan Kaufmann Publishers Inc., 2003, ch. Understanding Belief
Propagation and Its Generalizations, pp. 239–269.
[41] C. Lanczos, “An iteration method for the solution of the eigen-
value problem of linear differential and integral operators,” Journal
of Research of the National Bureau of Standards, 1950.
[42] D. Calvetti, L. Reichel, and D. C. Sorensen, “An implicitly restarted
lanczos method for large symmetric eigenvalue problems,” ETNA,
vol. 2, pp. 1–21, 1994.
[43] G. W. Stewart, “A KrylovSchur algorithm for large eigenprob-
lems,” SIAM Journal on Matrix Analysis and Applications, 2002.
[44] R. Lehoucq, D. Sorensen, and C. Yang, ARPACK Users’ Guide.
Society for Industrial and Applied Mathematics, 1998.
[45] V. Hernandez, J. E. Roman, and V. Vidal, “SLEPc: A scalable and
flexible toolkit for the solution of eigenvalue problems,” ACM
Trans. Math. Softw., 2005.
[46] D. Zheng, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay,
“An SSD-based eigensolver for spectral analysis on billion-node
graphs,” CoRR, vol. abs/1602.01421, 2016.
[47] X. Su and T. M. Khoshgoftaar, “A survey of collaborative filtering
techniques,” Adv. in Artif. Intell., vol. 2009, pp. 4:2–4:2, Jan. 2009.
[48] J. Yang and J. Leskovec, “Overlapping community detection at
scale: A nonnegative matrix factorization approach,” in Proceedings
of the Sixth ACM International Conference on Web Search and Data
Mining, 2013.
[49] F. Wang, T. Li, X. Wang, S. Zhu, and C. Ding, “Community dis-
covery using nonnegative matrix factorization,” Data Min. Knowl.
Discov., vol. 22, no. 3, pp. 493–521, 2011.
[50] H. Kwak, C. Lee, H. Park, and S. Moon, “What is twitter, a social
network or a news media?” in Proceedings of the 19th International
Conference on World Wide Web, 2010.
[51] J. Yang and J. Leskovec, “Defining and evaluating network com-
munities based on ground-truth,” in Proceedings of the ACM
SIGKDD Workshop on Mining Data Semantics, 2012.
[52] D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-MAT: A recursive
model for graph mining,” in In SDM, 2004.
[53] D. L. Sussman, M. Tang, D. E. Fishkind, and C. E. Priebe, “A con-
sistent adjacency spectral embedding for stochastic blockmodel
graphs,” Journal of the American Statistical Association, 2012.
14
[54] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin,
“PowerGraph: Distributed graph-parallel computation on natural
graphs,” in Proceedings of the 10th USENIX Conference on Operating
Systems Design and Implementation, 2012.
[55] R. Boyd, B. Drake, D. Kuang, and H. Park, “Smallk is a
C++/Python high-performance software library for nonnegative
matrix factorization (nmf) and hierarchical and flat clustering
using the nmf; current version 1.2.0,” http://smallk.github.io/,
2014.
[56] J. Poulson, B. Marker, R. A. van de Geijn, J. R. Hammond, and
N. A. Romero, “Elemental: A new framework for distributed
memory dense matrix computations,” ACM Trans. Math. Softw.,
vol. 39, no. 2, pp. 13:1–13:24, 2013.
[57] D. Tsirogiannis, S. Harizopoulos, and M. A. Shah, “Analyzing the
energy efficiency of a database server,” in Proceedings of the 2010
ACM International Conference on Management of Data, 2010.
Da Zheng received the BS degree in computer
science from Zhejiang University, China, in 2006,
the MS degree in computer science from E´cole
polytechnique fe´de´rale de Lausanne, in 2009.
Since 2010, he is a PhD student of computer sci-
ence at Johns Hopkins University. His research
interests include high-performance computing,
large-scale data analysis systems, programming
languages and large-scale machine learning.
Disa Mhembere received the BS degree in elec-
trical engineering from Morgan State University,
in 2009, the MS degree in engineering manage-
ment in 2012 and the MS degree in computer
science in 2015 from Johns Hopkins University.
He is a PhD student of computer science at
Johns Hopkins University. His research interests
include large-scale graph analysis and cluster-
ing.
Vince Lyzinski received the BSc degree in
mathematics from the University of Notre Dame
in 2006, the MSc degree in mathematics from
John Hopkins University (JHU) in 2007, the
MScE degree in applied mathematics and statis-
tics from JHU in 2010, and the PhD degree in
applied mathematics and statistics from JHU in
2013. From 2013 to 2014, he was a postdoctoral
fellow in the Applied Mathematics and Statistics
Department at JHU. Since 2014, he has been
a senior research scientist at the JHU Human
Language Technology Center of Excellence and an Assistant Research
Professor in the AMS Department at JHU. His research interests in-
clude graph matching, statistical inference on random graphs, pattern
recognition, dimensionality reduction, stochastic processes, and high-
dimensional data analysis.
Joshua T. Vogelstein received the BS degree
from the Department of Biomedical Engineering
(BME) at Washington University in St. Louis,
in 2002, the MS degree from the Department
of Applied Mathematics and Statistics at Johns
Hopkins University (JHU), in 2009, and the PhD
degree from the Department of Neuroscience
at JHU in 2009. He was a postdoctoral fellow
in AMS at JHU from 2009 until 2011, when he
was appointed an assistant research scientist,
and became a member of the Institute for Data
Intensive Science and Engineering. He spent two years at Information
Initiative at Duke, before becoming Assistant Professor in BME at JHU,
and core faculty in the Institute for Computational Medicine and the
Center for Imaging Science. His research interests include computa-
tional statistics, focusing on ultrahigh-dimensional and non-Euclidean
neuroscience data, especially connectomics.
Carey E. Priebe received the BS degree in
mathematics from Purdue University, in 1984,
the MS degree in computer science from San
Diego State University, in 1988, and the PhD
degree in information technology (computational
statistics) from George Mason University, in
1993. From 1985 to 1994, he worked as a
mathematician and scientist in the US Navy
research and development laboratory system.
Since 1994, he has been a professor in the
Department of Applied Mathematics and Statis-
tics, Johns Hopkins University (JHU). His research interests include
computational statistics, kernel and mixture estimates, statistical pattern
recognition, statistical image analysis, dimensionality reduction, model
selection, and statistical inference for high-dimensional and graph data.
He is a lifetime member of the Institute of Mathematical Statistics, an
elected member of the International Statistical Institute, and a fellow of
the American Statistical Association.
Randal Burns received the BS degree in Geo-
physics from Stanford University, in 1993, the
MS degree in computer science from Univer-
sity of California, Santa Cruz, in 1997, and the
PhD degree in in computer science from the
University of California, Santa Cruz, in 2000.
Since 2001, he has been been a professor in
the Department of Computer Science at Johns
Hopkins University, where he directs the Hopkins
Storage Systems Lab, serves on the advisory
board of Kavli Neuroscience Discovery Institute,
is a memory of the Institute for Data Intensive Engineering & Science,
and is a founder of NeuroData and the Open Connectome Project. His
research focuses on the management and analysis of large datasets
used in scientific applications.
