An SSD-based eigensolver for spectral analysis on billion-node graphs by Zheng, Da et al.
An SSD-based eigensolver for spectral analysis on billion-node graphs
Da Zheng, Randal Burns
Department of Computer Science
Johns Hopkins University
Joshua Vogelstein
Institute for Computational Medicine
Department of Biomedical Engineering
Johns Hopkins University
Carey E. Priebe
Department of Applied Mathematics and Statistics
Johns Hopkins University
Alexander S. Szalay
Department of Physics and Astronomy
Johns Hopkins University
Abstract
Many eigensolvers such as ARPACK and Anasazi have
been developed to compute eigenvalues of a large sparse
matrix. These eigensolvers are limited by the capacity
of RAM. They run in memory of a single machine for
smaller eigenvalue problems and require the distributed
memory for larger problems.
In contrast, we develop an SSD-based eigensolver
framework called FlashEigen, which extends Anasazi
eigensolvers to SSDs, to compute eigenvalues of a graph
with hundreds of millions or even billions of vertices in
a single machine. FlashEigen performs sparse matrix
multiplication in a semi-external memory fashion, i.e.,
we keep the sparse matrix on SSDs and the dense ma-
trix in memory. We store the entire vector subspace on
SSDs and reduce I/O to improve performance through
caching the most recent dense matrix. Our result shows
that FlashEigen is able to achieve 40%-60% performance
of its in-memory implementation and has performance
comparable to the Anasazi eigensolvers on a machine
with 48 CPU cores. Furthermore, it is capable of scaling
to a graph with 3.4 billion vertices and 129 billion edges.
It takes about four hours to compute eight eigenvalues of
the billion-node graph using 120 GB memory.
1 Introduction
Spectral analysis [] is a fundamental tool for both graph
analysis and other areas of data mining. Essentially, it
computes eigenvalues and eigenvectors of graphs to infer
properties of graphs. spectral clustering [17, 22], trian-
gle counting [24]. Many real-world graphs are massive:
Facebook’s social network has billions of vertices and
today’s web graph is even much larger.
It is computationally expensive to compute all eigen-
values and eigenvectors of a large matrix. The computa-
tion complexity is O(n3) on a square matrix [19], where
n is the number of rows and columns of the matrix. When
the size of a matrix grows to millions or even billions of
rows and columns, it becomes prohibitive to compute all
eigenvalues and eigenvectors.
Numerous algorithms [14, 7, 21, 4] have been devel-
oped to compute a small number of eigenpairs. The cur-
rent popular eigensolver packages such as ARPACK [15]
and Anasazi [5] have state-of-art eigensolvers capable of
computing a few eigenvalues with certain properties such
as the eigenvalues of the largest or smallest magnitude.
All of these eigensolvers perform a sequence of sparse
matrix multiplication to construct and update a vector
subspace S ∈Rn×m, where n is the size of the eigenprob-
lem and m is the subspace size [4]. In addition, they
perform matrix operations on the vector subspace. When
computing eigenpairs of a graph at the billion scale, nei-
ther the sparse matrix that represents a graph nor the vec-
tor subspace fits in the RAM of a single machine.
It is challenging to implement an efficient kernel of
sparse matrix multiplication for many real-world graphs.
Sparse matrix multiplication on these graphs induces
many small random memory access due to near-random
vertex connection. It may suffer from load imbalancing
because of the power-law distribution in vertex degree.
Furthermore, graphs cannot be clustered or partitioned
effectively [16] to localize access.
Large-scale eigendecomposition is generally solved in
a large cluster [5, 9], where the aggregate memory is
sufficient to store the sparse matrix and the vector sub-
space. Sparse matrix multiplication on graphs in dis-
tributed memory leads to significant network communi-
cation and is usually bottlenecked by the network. As
such, this operation requires fast network to achieve per-
formance. However, a supercomputer or a large clus-
ter with fast network communication is not accessible to
many people.
We build FlashEigen, an external-memory eigen-
solver, on top of a user-space filesystem called SAFS
[29] to solve a large eigenproblem with SSDs in a sin-
gle machine. Instead of developing a new eigensolver
ar
X
iv
:1
60
2.
01
42
1v
3 
 [c
s.D
C]
  2
6 F
eb
 20
16
from scratch, we leverage the Anasazi framework and
implement SSD-based matrix operations for the frame-
work. FlashEigen is specifically optimized for the Block
Krylov-Schur [21] eigensolver because it is the fastest
and generates the least I/O among the Anasazi eigen-
solvers when computing eigenvalues of many power-law
graphs. The Anasazi eigensolvers implement a block ex-
tension, which enables the eigensolvers to update multi-
ple vectors in the subspace in each iteration to increase
computation density and amortize I/O overhead. As a re-
sult, the eigensolvers require sparse matrix dense matrix
multiplication (SpMM). Given a graph with hundreds of
millions of vertices or even billions of vertices, the vector
subspace constructed by the eigensolvers requires very
large storage size, usually much larger than the sparse
matrix. Therefore, FlashEigen stores the entire vector
subspace on SSDs.
Although SSDs can deliver high IOPS and sequen-
tial I/O throughput, we have to overcome many technical
challenges to construct an external-memory eigensolver
with performance comparable to in-memory eigen-
solvers. First, it is challenging to achieve the maximal
I/O throughput, on the order of ten gigabytes, from a
large array of commodity SSDs, due to many overheads
from the operating system. Even achieving the maxi-
mal I/O throughput, SSDs are still an order of magni-
tude slower than DRAM in throughput. Furthermore,
SSDs wear out after we write data to them. For exam-
ple, some enterprise SSDs [18] only allows one DWPD
(diskful writes per day). Writing too much data to these
SSDs drastically shortens their lives and increases oper-
ation cost.
We perform SpMM in a semi-external memory (SEM)
fashion, which keeps the sparse matrix on SSDs and
dense matrices in memory. This operation streams rows
in the sparse matrix to memory and multiplies with
the dense matrix in memory, which generates sequen-
tial I/O and allows us to yield maximal I/O throughput
from SSDs. While maximizing the I/O throughput, we
also compress the sparse matrix to further accelerate re-
trieving the sparse matrix from SSDs. The SEM strat-
egy incorporates well with in-memory optimizations for
SpMM. We deploy multiple in-memory optimizations
specifically designed for power-law graphs. For exam-
ple, we assign partitions of the sparse matrix dynami-
cally to threads for load balancing, deploy cache block-
ing to increase CPU cache hits, partition and evenly dis-
tribute the dense matrix to NUMA nodes to fully utilize
the memory bandwidth of a NUMA machine.
We deploy optimizations on dense matrix operations
to reduce I/O and fully utilize the I/O bandwidth of the
SSDs. Thanks to the block extension of the Anasazi
eigensolvers, we group multiple vectors together into a
column-major dense matrix and store each dense matrix
in a separate SAFS file for efficient I/O access to any
vectors in the subspace. When accessing a large number
of dense matrices in a single operation, we group dense
matrices to constrain memory consumption of the matrix
operation to improve its scalability. To reduce I/O and
alleviate SSD wear out, we use deploy lazy evaluation
and cache the most recent dense matrix in the subspace.
Our result shows that for many real-world sparse
graphs, the SSD-based eigensolver is able to achieve
40%-60% performance of its in-memory implementation
and has performance comparable to the Anasazi eigen-
solver on a machine with 48 CPU cores for computing
various numbers of eigenvalues. We further demonstrate
that the SSD-based eigensolver is capable of scaling to a
graph with 3.4 billion vertices and 129 billion edges. It
takes about 4 hours to compute eight eigenvalues of the
billion-node graph and use 120 GB memory. We con-
clude that our solution offers new design possibilities for
large-scale eigendecomposition, replacing memory with
larger and cheaper SSDs and processing bigger problems
on fewer machines.
2 Related Work
Anasazi [5] is an eigensolver framework in the Trilinos
project [10]. This framework implements block exten-
sion of multiple eigensolver algorithms such as Block
Krylov-Schur method [21], Block Davidson method [4]
and LOBPCG [4]. This is a very flexible framework
that allows users to redefine sparse matrix dense matrix
multiplication and dense matrix operations. By default,
Anasazi uses the matrix implementations in Trilinos that
run in the distributed memory.
Arpack [15] is another state-of-art eigensolver com-
monly used by multiple numeric computation frame-
works such as Matlab. This eigensolver implements the
implicitly restarted arnoldi method [20]. Arpack only al-
lows users to redefine sparse matrix vector multiplica-
tion. Its dense matrix operations by default run in serial.
Sparse matrix vector multiplication (SpMV) and
sparse matrix dense matrix multiplication (SpMM) are
an important operation in numeric computation and are
well studied in the literature. For example, Williams et.
al [26] described multiple optimizations for sparse ma-
trix vector multiplication in multicore architecture. Yoo
et. al [28] and Boman et. al [6] optimized SpMV for
large scale-free graphs using 2D graph partitioning. Ak-
tulga et. al [2] optimized sparse matrix dense matrix mul-
tiplication with cache blocking. In contrast, we further
advance sparse matrix dense matrix multiplication with
a focus on optimizations for external memory.
FlashGraph [30] is a general graph analysis frame-
work. It performs graph algorithms in a semi-external
memory fashion [1], i.e., it keeps vertex state in memory
2
FlashEigen
Anasazi eigensolvers
Sparse matrix Dense matrix
SAFS
SSD SSD SSD SSD SSD SSD
Figure 1: The architecture of FlashEigen.
and edge lists on SSDs. It is specifically optimized for
graph algorithms that has a fraction of vertices running
in each iteration. This design prevents FlashGraph from
performing some optimizations for sparse matrix multi-
plication as shown in this paper.
HEIGEN [12] is an eigensolver implemented with
MapReduce [8] to compute eigenpairs for spectral graph
analysis. HEIGEN implements a basic Lanczos algo-
rithm [14] with selective orthogonalization from scratch.
In contrast, our approach extends the state-of-art imple-
mentations to SSDs. By integrating many SSDs to a sin-
gle machine, our approach can compete with a cluster.
Zhou et al. [31] implemented an LOBPCG [4] eigen-
solver in an SSD cluster. Their implementation tar-
gets nuclear many-body Hamiltonian matrices, which are
much denser and have smaller dimensions than many
sparse graphs. Therefore, their solution stores the sparse
matrix on SSDs and keep the entire vector subspace in
RAM. Their solution also focus on optimizations in the
distributed environment. In contrast, we store both the
sparse matrix and the vector subspace on SSDs due to the
large number of vertices in our target graphs. We focus
on external-memory optimizations in a single machine.
3 Design
FlashEigen is an external-memory eigensolver frame-
work optimized for any fast I/O devices such as a large
SSD array to compute eigenvalues of sparse graphs (Fig-
ure 1). We build FlashEigen on top of SAFS, a user-
space filesystem, to fully utilize the I/O throughput of
a large SSD array. Instead of implementing eigen-
value algorithms from scratch, we integrate FlashEigen
with the Anasazi framework to compute eigenvalues.
The Anasazi framework implements multiple state-of-
art eigenvalue algorithms and provides users a flexi-
ble programming interface to redefine both sparse and
dense matrix operations required by the eigenvalue algo-
rithms. FlashEigen stores both sparse and dense matrices
in SAFS and focuses on optimizing the matrix operations
required by the Anasazi eigensolvers for SSDs.
3.1 Eigensolver algorithm
Algorithm 1 Pseudo code of a generic eigenvalue algo-
rithm that compute eigenvalues of a square matrix A with
n rows and columns.
1: for i = 0, 1, ..., until convergence do
2: (1) Update the subspace S ∈ Rn×m,
3: (2) Solve the projected eigenproblem ST ASy =
ST Syθ .
4: (3) Compute the residual: r = Kx− xθ , where
5: x = Sy (Ritz vector), θ = ρ(x) (Ritz value).
6: (4) Test the convergence of a Ritz pair (x,ρ(x)).
7: end for
The state-of-art eigenvalue algorithms compute eigen-
values with iterative methods. Algorithm 1 shows the
general steps used by the eigenvalue algorithms. Step (1)
constructs a vector subspace S ∈ Rn×m, where n is the
number of rows and columns of a sparse matrix and m
is the number of vectors in the subspace. When com-
puting eigenvalues of a sparse graph, two key operations
in this step are sparse matrix multiplication to construct
the subspace and reorthogonalization to correct floating-
point rounding errors. Step (2) projects the large sparse
matrix to a much smaller matrix with only m rows and m
columns, which can be solved by other eigensolvers such
as the one in LAPACK [3]. Step (3) projects the solu-
tion of the small eigenvalue problem back to the original
eigenvalue problem. Step (4) tests whether the projected
solution fulfills the precision requirement given by users.
If not, the algorithm adjusts the vector subspace, returns
to step (1) and continues the process.
The block extenion implemented in the Anasazi eigen-
solvers updates multiple vectors in the subspace in an it-
eration. This optimization leads to sparse matrix dense
matrix multiplication, which increases computation den-
sity to improve performance, as well as a set of dense
matrix operations instead of vector operations. The num-
ber of vectors to be updated together is determined by the
block size, denoted by b. The subspace size m is b×NB,
where NB is the number of blocks.
3.2 SAFS
SAFS [29] is a user-space filesystem for a high-speed
SSD array in a NUMA (non-uniform memory architec-
ture) machine. It is implemented as a library and runs
in the address space of its application. It is deployed on
top of the Linux native filesystem. SAFS was originally
designed for optimizing small I/O accesses. However,
sparse matrix multiplication and dense matrix operations
generate much fewer but much larger I/O. Therefore, we
provide additional optimizations to maximize sequential
I/O throughput from a large SSD array.
3
The latency of a thread context switch becomes notice-
able on a high-speed SSD array under a sequential I/O
workload and it becomes critical to avoid thread context
switch to gain I/O performance. If the computation in ap-
plication threads does not saturate CPU, SAFS will put
the application threads into sleep while they are waiting
for I/O. This results in many thread context switches and
underutilization of both CPU and SSDs. To saturate I/O,
an application thread issues asynchronous I/O and poll
for I/O to avoid thread context switches after completing
all computation available to it.
To better support access to many relatively small files
simultaneously, SAFS stripes data in a file across SSDs
with a different striping order for each file. This strat-
egy stores data from multiple files evenly across SSDs
and improves I/O utialization. Due to the sequential I/O
workload, FlashEigen stripes data across SSDs with a
large block size, on the order of megabytes, to increase
I/O throughput and potentially reduce write amplifica-
tion on SSDs [23]. Such a large block size may cause
storage skew for small files on a large SSD array if ev-
ery file stripes data in the same order. Using the same
striping order for all files may also cause skew in I/O ac-
cess. Therefore, SAFS generates a random striping order
for each file to evenly distribute I/O among SSDs when
a file is created. SAFS stores the striping order with the
file for future data retrieval.
3.3 Sparse matrix multiplication
Sparse matrix multiplication is one of the most computa-
tionally expensive operations in an eigensolver. Apply-
ing this operation on real-world graphs usually leads to
many random memory accesses and its performance is
usually limited by the random memory performance of
DRAM. The block extension in the Anasazi eigensolvers
enables sparse matrix dense matrix multiplication to in-
crease data locality and computation density and improve
overall performance of an eigensolver.
To scale sparse matrix multiplication to a sparse graph
with billions of vertices, we perform this operation in
semi-external memory (SEM). That is, we keep dense
matrices in memory and the sparse matrix on SSDs. This
strategy enables nearly in-memory performance while
achieving the scalability in proportion to the ratio of
edges to vertices in a graph.
3.3.1 The sparse matrix format
The state-of-art numeric libraries store a sparse matrix
in compressed row storage (CSR) or compressed col-
umn storage (CSC) format. However, these formats incur
many CPU cache misses in sparse matrix multiplication
on many real-world graphs due to their nearly random
Tile
Super tile
partition 0
partition 1
partition 2
Figure 2: The format of a sparse matrix in FlashEigen.
row id col idx
...tile header
16B
2B 2n1B
non-zero valuesrow header 1
row 
header 2 COO
Figure 3: The storage format of a tile in a sparse matrix.
vertex connection. They also require a relatively large
storage size. For a graph with billions of edges, we have
to use eight bytes to store the row and column indices.
For semi-external memory sparse matrix multiplication,
SSDs may become the bottleneck if a sparse matrix has
a large storage size. Therefore, we need to use an alter-
native format for sparse matrices to increase CPU cache
hits and reduce the amount of data read from SSDs.
To increase CPU cache hits, we deploy cache block-
ing [11] and store non-zero entries of a sparse matrix in
tiles (Figure 2). When a tile is small, the rows from the
input and output dense matrices involved in the multi-
plication with the tile are always kept in the CPU cache
during the multiplication. The optimal tile size should
fill the CPU cache with the rows of the dense matrices
involved in the multiplication with the tile and is affected
by the number of columns of the dense matrices, which
is chosen by users. Instead of generating a sparse ma-
trix with different tile sizes optimized for different num-
bers of columns in the dense matrices, we use a relatively
small tile size and rely on the runtime system to opti-
mize for different numbers of columns (in section 3.3.3).
In the semi-external memory, we expect that the dense
matrices do not have more than eight columns in sparse
matrix multiplication. Therefore, we use the tile size of
16K×16K by default to balance the matrix storage size
and the adaptibility to different numbers of columns.
To reduce the overall storage size of a sparse matrix,
we use a compact format to store non-zero entries in a
tile. In very sparse matrices such as many real-world
graphs, many rows in a tile do not have any non-zero en-
tries. The CSR (CSC) format requires an entry for each
row (column) in the row (column) index. Therefore, the
CSR or CSC format wastes space when storing elements
in a tile. Instead, we only keep data for rows with non-
4
zero entries in a tile shown in Figure 3 and refer to this
format as SCSR (Super Compressed Row Storage). This
format maintains a row header for each non-empty row.
A row header has an identifier to indicate the row num-
ber, followed by column indices. The most significant
bit of the identifier is always set to 1, while the most sig-
nificant bit of a column index entry is always set to 0.
As such, we can easily distinguish a row identifier from
a column index entry and determine the end of a row.
Thanks to the small size of a tile, we use two bytes to
further store a row number and a column index entry to
reduce the storage size. Since the most significant bit is
used to indicate the beginning of a row, this format al-
lows a maximum tile size of 32K×32K.
For many real-world graphs, many rows in a tile have
only one non-zero entry, thanks to the sparsity of the
graphs and nearly random vertex connection. Iterating
over single-entry rows requires to test the end of a row for
every non-zero entry, resulting in many extra conditional
jump CPU instructions in sparse matrix multiplication.
In contrast, the coordinate format (COO) is more suitable
for storing these single-entry rows. It does not increase
the storage size but significantly reduces the number of
conditional jump instructions when we iterate them. As
a result, we hybrid SCSR and COO to store non-zero en-
tries in a tile with COO stored behind the row headers of
SCSR. All non-zero entries are stored together at the end
of a tile.
We organize tiles in a sparse matrix in tile rows and
maintain a matrix index for them. Each entry of the in-
dex stores the location of a tile row on SSDs to facilitate
random access to tile rows. This is useful for paralleliz-
ing sparse matrix multiplication. Because a tile contains
thousands of rows, the matrix index requires a very small
storage size even for a billion-node graph. We keep the
entire index in memory during sparse matrix multiplica-
tion.
3.3.2 The dense matrix format for SpMM
Dense matrices in sparse matrix multiplication are tall-
and-skinny (TAS) matrices with millions or even billions
of rows but only several columns. The number of rows
in a dense matrix is determined by the number of ver-
tices in a sparse graph and the number of columns is
determined by the block size in an eigensolver with the
block extension. The dense matrix is kept in memory for
semi-external memory (SEM) sparse matrix dense ma-
trix multiplication (SpMM), so the size of the dense ma-
trix governs the memory consumption of SpMM. Given
the limited amount of RAM in a machine, the number of
columns in a dense matrix has to be small.
For a non-uniform memory architecture (NUMA), we
partition the input dense matrix horizontally and store
(b) on SSDs(a) in memory
Row
Interval 0
Row
Interval 1
Row
Interval 2
Row
Interval 3
Figure 4: The data layout of tall-and-skinny (TAS) dense
matrices. A TAS dense matrix is partitioned horizon-
tally into many row intervals. (a) For an in-memory ma-
trix, row intervals are stored across NUMA nodes and
elements are stored in row-major order; (b) for an SSD-
based matrix, elements inside a row interval are stored in
column-major order.
partitions evenly across NUMA nodes to fully utilize the
bandwidth of memory and inter-processor links in sparse
matrix multiplication. The NUMA architecture is preva-
lent in today’s multi-processor servers, where each pro-
cessor connects to its own memory banks. As shown in
Figure 4 (a), we assign multiple contiguous rows in a
row interval to a partition, which is assigned to a NUMA
node. A row interval always has 2i rows for efficiently
locating a row with bit operations. The row interval size
is also multiple of the tile size of a sparse matrix so that
multiplication on a tile only needs to access rows from a
single row interval. We store elements in a row interval
in row-major order to increase data locality in SpMM.
3.3.3 Execution of sparse matrix multiplication
We perform sparse matrix dense matrix multiplication in
semi-external memory and optimize it for different num-
bers of columns in the dense matrices. Thanks to the
semi-external memory execution, sparse matrix multi-
plication streams data in the sparse matrix from SSDs,
which maximizes I/O throughput of the SSDs.
We partition a sparse matrix horizontally for paral-
lelization and assign multiple contiguous tile rows to
the same partition (Figure 2). The number of tile rows
assigned to a partition is determined at runtime based
on the number of columns in the input dense matrix.
A thread reads a partition of the sparse matrix asyn-
chronously from SSDs. Once a partition is ready in
memory, the worker thread multiplies the partition with
the input dense matrix. A thread processes one tile at
a time and stores the intermediate result in a buffer al-
5
located in the local memory to reduce remote memory
access.
To better utilize CPU cache, we process tiles of a par-
tition in super tiles (Figure 2). The tile size of a sparse
matrix is specified when the sparse matrix image is cre-
ated and is relatively small to handle different numbers of
columns in the dense matrices. A super tile is composed
of tiles from multiple tile rows and its size is determined
at runtime by three factors: the number of columns in
the dense matrices, the CPU cache size and the number
of threads that share the CPU cache. An optimal size
for a super tile fills the CPU cache with the rows from
the dense matrices involved in the computation with the
super tile.
Load balancing also plays a key role in sparse matrix
multiplication on many real-world graphs due to their
power-law distribution in vertex degree. In FlashEigen,
a worker thread first processes partitions originally as-
signed to the thread. When a worker thread finishes all
of its own partitions, it steals partitions that have not been
processed from other worker threads.
In spite of nearly random edge connection in a real-
world graph, there exists regularity that allows vector-
ization to improve performance in sparse matrix dense
matrix multiplication. For each non-zero entry, we need
to multiply it with the corresponding row from the input
dense matrix and add the result to the corresponding row
in the output dense matrix. These operations can be ac-
complished by the vector CPU instructions such as AVX
[?]. The current implementation relies on GCC’s auto-
vectorization to translate the C code to the vector CPU
instructions by predefining the matrix width in the code.
When accessing a sparse matrix on SSDs, we keep
a set of memory buffers for I/O access to reduce the
overhead of memory allocation. For a large spare ma-
trix, each tile row is fairly large, on the order of tens
of megabytes. The operating system usually allocate a
memory buffer for such an I/O size with mmap() and
populates the buffer with physical pages when the buffer
is used. It is computationally expensive to populate
large memory buffers frequently. When accessing high-
throughput I/O devices, such overhead can cause sub-
stantial performance loss. Therefore, we keep a set of
memory buffers allocated previously and reuse them for
new I/O requests. Because tile rows in a sparse matrix
usually have differnt sizes, we resize a previously allo-
cated memory buffer if it is too small for a new I/O re-
quest.
3.4 The vector subspace
The vector subspace required by an eigensolver is mas-
sive when the eigensolver computes eigenvalues of a
billion-node graph or computes many eigenvalues of a
name operation
MvTimesMatAddMv CC← α×AA×B+β ×CC
MvTransMv A← α× t(AA)×BB
MvScale1 BB← α×AA
MvScale2 BB← AA×diag(vec)
MvAddMv CC← α×AA+β ×BB
MvDot vec[i]← t(AA[, i])∗BB[, i]
MvNorm vec← norm col(AA)
CloneView AA[, idxs]
SetBlock AA[, idxs]← BB
MvRandom AA← rand init
ConvLayout AA← conv layout(BB)
Table 1: The dense matrix operations required by the
Anasazi eigensolvers. AA and BB represents a tall dense
matrix, A and B represents a small dense matrix, α and
β represents scalar variables.
multi-million-node graph. The number of vectors in the
subspace increases with the number of required eigen-
values. Furthermore, a larger number of vectors in the
subspace potentially improves the convergence rate of an
eigensolver. The storage size required by the subspace is
often larger than the sparse matrix for eigendecomposi-
tion on many real-world graphs. Therefore, FlashEigen
stores these vectors on SSDs.
FlashEigen implements a set of dense matrix opera-
tions shown in Table 1. The Anasazi framework pro-
vides a set of programming interfaces that expose the
vectors in the subspace to users as dense matrices and
allow users to redefine the dense matrices and the op-
erations on them. The first ten operations are the ones
required by the Anasazi framework. The most computa-
tionally expensive operations are the two matrix multipli-
cation operations: MvTimesMatAddMv and MvTransMv,
mainly used for reorthogonalization to fix floating-point
rounding errors. The eigensolvers use CloneView and
SetBlock to access individual columns of a dense ma-
trix, so we store the dense matrices in column major by
default. However, the sparse matrix dense matrix multi-
plication described in Section 3.3 requires a row-major
dense matrix to increase data locality. Thus, FlashEigen
adds another operation ConvLayout to convert data lay-
out in dense matrices, which converts a column-major
matrix to a row-major matrix when it is passed to the
SpMM operation.
External-memory dense matrix operations faces mul-
tiple challenges. Unlike sparse matrix multiplication,
these dense matrix operations are less memory intensive.
Even though SSDs are fast, their sequential I/O perfor-
mance is still an order of magnitude slower than RAM.
Furthermore, SSDs wears out after a certain amount of
write. Even enterprise SSDs [18] only allows a small
6
number of DWPD (diskful writes per day). Therefore,
FlashEigen optimizes dense matrix operations with three
goals: (i) maximizing I/O throughput of SSDs, (ii) mini-
mizing the amount of data read from SSDs, (iii) reducing
SSD wearout.
3.4.1 The storage format of the vector subspace
FlashEigen stores multiple vectors in a dense matrix
physically because Anasazi eigensolvers update multi-
ple vectors of the subspace in an iteration thanks to the
block extension. The number of vectors in a matrix
is determined by the block size of a block eigensolver.
As such, the subspace is composed of multiple tall-and-
skinny (TAS) dense matrices.
FlashEigen stores each TAS matrix in a separate SAFS
file to leverage the optimizations in SAFS. This guaran-
tees that I/O accesses to the dense matrices are evenly
distributed to all SSDs by SAFS, regardless of the num-
ber of SSDs and the subspace size. It also eases matrix
creation and deletion by simply creating and deleting an
SAFS file.
FlashEigen partitions a TAS matrix horizontally to as-
sist in parallelization and external-memory access. Fig-
ure 4 (b) illustrates the format of an external-memory
TAS matrix. Like a NUMA dense matrix in Figure 4
(a), an external-memory matrix is divided into multiple
row intervals and data in a row interval is stored con-
tiguously to generate large I/O requests, on the order of
megabytes. The size of a row interval is chosen accord-
ing to the number of columns in the matrix. Unlike a
NUMA dense matrix, elements in a row interval of an
external-memory matrix are stored in column-major or-
der for easily accessing individual columns.
3.4.2 Parallelize matrix operations
All large dense matrices in FlashEigen are tall and skinny
so we partition them horizontally for parallelization. All
matrix operations in Table 1 allow a worker thread to
process partitions of a matrix independently.
All of the matrix operations in Table 1 that outputs
TAS matrices are embarrassingly parallelizable on the
TAS matrices. In these operations, a row interval in
the output matrix only depends on the same row interval
from all of the TAS input matrices. Thus, the computa-
tion and data access to the TAS matrices are completely
independent between row intervals. To parallelize these
matrix operations, we assign one row interval at a time
to a worker thread. When a worker thread gets a row
interval, it owns the row interval and is responsible for
accessing the data in the row interval from all of the TAS
matrices and performing computation on them. When
processing a row interval, a worker thread does not need
to access data from another row interval on SSDs.
Some of the operations do not output TAS matrices
and their outputs depend on all row intervals of the in-
put matrices. For example, MvTransMv outputs a small
matrix. These operations can usually be split into two
sub-operations: the first one performs computation on
each row interval independently and outputs a small vec-
tor or a small matrix; the second one aggregates all of
the small vectors or matrices and outputs a single small
vector or matrix. The first sub-operation accounts for
most of computation in such an operation and requires
external-memory data access, so we parallelize it in the
same fashion as the operations that output TAS matrices.
3.4.3 External memory access
I/O access for dense matrix operations is relatively sim-
ple. All matrix partitions have the same size and each op-
eration requires to access all data in a TAS matrix, which
results in sequential I/O access. For most of the matrix
operations in Table 1, a worker thread reads data in a row
interval asynchronously and perform computation when
all data in the row interval is ready in memory. In this
section, we describe some optimizations that aim at re-
ducing memory consumption and achieving maximal I/O
throughput from SSDs.
Keeping data in a row interval from all of the input
TAS matrices potentially consumes a significant amount
of memory if a matrix operation involves in many TAS
matrices. Operations such as MvTimesMatAddMv and
MvTransMv frequently take as input many vectors in the
subspace, which is stored in multiple TAS dense matrices
physically. The actual number of TAS matrices involved
in the two operations varies between iterations and can
grow to as large as several hundred when an eigensolver
computes hundreds of eigenvalues. The storage size of a
row interval is usually configured on the order of tens of
megabytes to achieve I/O throughput from SSDs. When
there are hundreds of dense matrices in the subspace, we
may keep a substantial amount of data in memory. Read-
ing only part of a partition results in many small reads
and writes to SSDs because the matrices are organized in
column-major order.
Instead, we break a large group of TAS matrices into
multiple small groups to reduce memory consumption
for these matrix operations. Figure 5 illustrates this op-
timization on MvTimesMatAddMv and MvTransMv. For
MvTimesMatAddMv, we split the small dense matrix hor-
izontally and each group of TAS matrices gets a parti-
tion of the small dense matrix. Each group generates an
intermediate TAS matrix conceptually and we apply an
addition operation on all of the intermediate matrices to
generate the final result. As such, we perform computa-
tion on each group separately and only need to keep the
7
AA BAA BB CC
D
D
AA BB CC
D
D
B
DD
CC
BB
AA
BB
AA
DD
CC
cat
EE FF GG
A
B
(a) op1
(b) op3
EE
EE
EE
Figure 5: Break a large group of dense matrices in an
operation into multiple small groups to reduce memory
consumption. XX indicates a TAS matrix stored on SSDs
and X indicates a small matrix stored in memory.
data in a row interval from all of the TAS matrices in a
group. Therefore, memory consumption is determined
by the group size instead of the number of TAS matrices
involved in the operation. Materializing these intermedi-
ate matrices would result in large memory consumption
if we store them in memory, or a large amount of I/O
if we store them on SSDs. Instead, we only materialize
part of the intermediate matrices and passes the materi-
alized parts to the addition operation to generate the final
result. We apply a similar strategy to MvTransMv, which
requires each group to share the same TAS matrix in the
right operand. Each group generates a small matrix that
is small enough to be kept in memory. To minimize I/O,
we share I/O for accessing the TAS matrix in the right
operand.
Like sparse matrix multiplication, we maintain a per-
thread memory buffer pool for I/O access to reduce the
overhead of memory allocation when accessing dense
matrices on SSDs. To increase I/O throughput, we ac-
cess data in dense matrices on SSDs with large I/O re-
quests. Allocating a large memory buffer for each I/O
request causes the operating system to populate memory
with physical pages when the buffer is used and this is
a computationally expensive operation. Therefore, we
maintain a pool of memory buffers with physical pages
populated in advance and associate each I/O request with
a buffer from the set. We maintain such a pool for each
worker thread to minimize locking overhead.
3.4.4 Matrix caching
FlashEigen deploys two forms of matrix caching to re-
duce I/O to SSDs. In the first case, FlashEigen worker
threads cache part of a TAS matrix locally. In the other
case, FlashEigen can also cache the most recent TAS ma-
trix in the vector subspace.
Caching part of a TAS matrix read from SSDs benefits
some of the matrix operations. One example is the op-
timization for op3 in section 3.4.3, which breaks a large
group of dense matrices into multiple subgroups and the
matrix in the right operand is shared by all of the sub-
groups. A worker thread only needs to cache data in a
row interval of the right matrix that is currently being
processed because a thread processes one row interval at
a time and each row interval of the TAS matrices is pro-
cessed only once. Therefore, a worker thread can buffers
the data in a row interval of a matrix locally and access-
ing the buffered data doesn’t incur any locking overhead.
When we buffer the recent portions, we need to give
each matrix a data identifier to identify its data, so we can
recognize which portion of data can be reused. For cer-
tain operations, even though a new matrix is created, the
data inside remains the same. A typical example is trans-
pose. The identifier we give to each matrix should iden-
tify the data inside a matrix instead of individual matri-
ces, so a transposed matrix and its original matrix should
share the same identifier.
FlashEigen also caches the most recent TAS matrix in
the vector subspace if the RAM of a machine is sufficient
to accommodate the entire matrix. When a new matrix
in the vector subspace is generated by sparse matrix mul-
tiplication, an eigensolver needs to perform a sequence
of operations on it, which includes reorgonalization. By
caching the matrix in memory, we can significantly re-
duce the amount of data written to SSDs.
4 Experimental Evaluation
We evaluate the performance of the SSD-based
FlashEigen on multiple real-world billion-scale graphs
including a web-page graph with 3.4 billion vertices.
We first evaluate the performance of the two most com-
putationally intensive computation in the eigensolvers:
sparse matrix dense matrix multiplication and dense ma-
trix multiplication. We demonstrate the effectiveness of
the optimizations on the two operations and compare
the performance of our external-memory implementa-
tion with multiple in-memory implementations: (i) our
in-memory implementations, (ii) MKL and (iii) Trilinos.
We then evaluate the overall performance of FlashEigen
and compare with the original Anasazi eigensolvers. We
further demonstrate the scalability of FlashEigen on a
web graph of 3.4 billion vertices and 129 billion edges.
8
Graph datasets # Vertices # Edges Directed
Twitter [13] 42M 1.5B Yes
Friendster [27] 65M 1.7B No
KNN distance graph [] 62M 12B No
Page [25] 3.4B 129B Yes
Table 2: Graph data sets.
We conduct all experiments on a non-uniform memory
architecture machine with four Intel Xeon E7-4860 pro-
cessors, clocked at 2.6 GHz, and 1TB memory of DDR3-
1600. Each processor has 12 cores. The machine has
three LSI SAS 9300-8e host bus adapters (HBA) con-
nected to a SuperMicro storage chassis, in which 24 OCZ
Intrepid 3000 SSDs are installed. The 24 SSDs together
are capable of delivering 12 GB/s for read and 10 GB/s
for write at maximum. The machine runs Linux kernel
v3.13.0. We use 48 threads for most of experiments by
default.
We use the real-world graphs in Table 2 for evalua-
tion. The largest graph is the page graph with 3.4 billion
vertices and 129 billion edges. The smallest graph we
use has 42 million vertices and 1.5 billion edges. Twitter
and Friendster are social network graphs. The KNN dis-
tance graph is a symmetrized 100-nearest neighbor ad-
jacency graph with cosine distance as the edge weight.
The distance graph is generated over all frames of the
Babel Tagalog corpus, commonly used in speech recog-
nition. Majority of the vertices in this graph has degree
between 100 to 1000, so this graph does not follow the
power-law distribution in vertex degree. The page graph
is clustered by domain, generating good CPU cache hit
rates in sparse matrix dense matrix multiplication.
For performance evaluation, we implement an in-
memory implementation of both sparse matrix multipli-
cation and dense matrix multiplication. Therefore, our
FlashEigen eigensolver is able to run both in memory
and on SSDs. In the following sections, we denote the
eigensolver in memory by FE-IM and the eigensolver on
SSDs with FM-SEM.
4.1 Sparse matrix multiplication
This section evaluates the performance of the semi-
external-memory (SEM) implementation of SpMM in
FlashEigen. We first evaluate the effectiveness of the
optimizations on the SpMM. Then we compare its per-
formance with that of the Intel MKL and Trilinos imple-
mentations.
4.1.1 Optimizations on SpMM
We first deploy a set of memory optimizations to imple-
ment a fast in-memory sparse matrix multiplication and
then deploy a set of I/O optimizations to further speed
up this operation in semi-external memory. We apply the
memory and I/O optimizations incrementally to illustrate
their effectiveness.
For memory optimizations, we start with an imple-
mentation that performs sparse matrix multiplication on
a sparse matrix in the CSR format and apply the opti-
mizations incrementally in the following order:
• partition dense matrices for NUMA (NUMA),
• partition the sparse matrix in both dimensions into
tiles of 16K×16K (Cache blocking),
• organize multiple physical tiles into a super tile to
fill CPU cache (Super tile),
• use CPU vectorization instructions (Vec),
• allocate a local buffer to store the intermediate re-
sult of multiplication of tiles and the input dense
matrix(Local write),
• combine the SCSR and COO format to reduce
the number of conditional jump CPU instructions
(SCSR+COO),
Figure 6 shows that almost all of these optimizations
have positive effect on sparse matrix multiplication and
all optimizations together speed up the operation by
2− 4 times. The degree of effectiveness varies signifi-
cantly between different graphs and different numbers of
columns in the dense matrices. For example, the NUMA
optimization is more effective when the dense matrices
have more columns because more columns in the dense
matrices require more memory bandwidth. Cache block-
ing is very effective when the dense matrices have fewer
columns because it can effectively increase CPU cache
hits. When there are more columns in the dense matri-
ces, data locality improves and cache blocking becomes
less effective. When there are too many columns, the
rows from the input and output matrices can no longer
be in the CPU cache. Thus, it even has a little nega-
tive effect on the Friendster graph when the dense ma-
trices have 16 columns. However, we never use dense
matrices with more than four columns in sparse matrix
multiplication in the KrylovSchur eigensolver. With all
optimizations, we have a fast in-memory sparse matrix
dense matrix multiplication, denoted by FE-IM SpMM.
4.1.2 SpMM performance
We then evaluate the performance of the SEM SpMM
in FlashEigen and compare its performance with FE-IM
SpMM, the MKL implementation and the Trilinos im-
plementation (Figure 7). We only show the performance
of SpMM on the Friendster graph and the performance
on the other graphs is similar. The MKL and Trilinos
SpMM cannot run on the page graph.
Our SEM SpMM has performance comparable to FE-
IM SpMM, while both FE-IM SpMM and FE-SEM
9
0
2
4
6
8
10
12
14
16
18
F-1col F-4cols F-16colsT-1col T-4cols T-16cols
se
co
nd
s
Base
NUMA
Cache blocking
Super tile
Vec
Local write
SCSR+COO
Figure 6: The effectiveness of the SpMM optimizations
on the Friendster graph (F) and the Twitter graph (T) for
different numbers of columns in the dense matrices.
0
5
10
15
20
25
30
35
1 2 4 8 16
se
co
nd
s
The number of columns in the dense matrix
FE-IM
FE-SEM
MKL
Trilinos
Figure 7: The runtime of in-memory SpMM (FE-IM)
and SEM-SpMM (FE-SEM) in the FlashEigen, the MKL
and the Trilinos implementation on the Friendster graph.
SpMM outperforms the MKL and Trilinos implementa-
tions. On the Friendster graph, FE-SEM SpMM achieves
60% performance of FE-IM SpMM when the dense ma-
trix has only one column and the performance gap nar-
rows as the number of columns in the dense matrices in-
creases. The Trilinos SpMM is optimized for sparse ma-
trix vector multiplication (SpMV). But even for SpMV,
our IM-SpMM outperforms Trilinos by 36%. The MKL
SpMM performs better when the dense matrices have
more columns, but our implementations can still outper-
form MKL by 2−3 times in most settings.
4.2 Dense matrix multiplication
This section evaluates the performance of dense matrix
multiplication, the other computationally intensive op-
eration in an eigensolver. For dense matrix multiplica-
tion, we focus on I/O optimizations, so we evaluate the
effectiveness of the I/O optimizations on this operation.
Then we compare the performance of our in-memory and
external-memory implementations with the ones in Intel
MKL and Trilinos.
0
0.2
0.4
0.6
0.8
1
rmat-100M-40
rmat-100M-160
W Friendster
Page graph
Trilinos
SEM
(a) SpMV
0
0.2
0.4
0.6
0.8
1
rmat-100M-40
rmat-100M-160
W Friendster
Page graph
Trilinos
SEM
(b) SpMM
Figure 8: The performance of Trilinos and FlashEigen-
SEM sparse matrix multiplication relative to FlashEigen-
IM sparse matrix multiplication. In sparse matrix dense
multiplication (SpMM), there are four columns in the
dense matrix.
In eigendecomposition, there are two forms of dense
matrix multiplication. The first form, shown by op1
in Table 1, multiplies a tall-and-skinny dense matrix of
n×m with a small dense matrix of m× b and outputs
a tall-and-skinny dense matrix of n× b, where n is the
size of the eigenvalue problem, m is the number of exist-
ing vectors in the subspace and b is the block size of the
eigensolver. The second form, shown by op3, multiplies
a transpose of a tall-and-skinny dense matrix of n×m
with another tall-and-skinny dense matrix of n× b and
outputs a small matrix. In both forms, m varies from one
block size to the maximal subspace size specified by a
user and is increased by one block size in each iteration.
In the experiments, we set n to 60M, roughly the size of
eigenvalue problems in Table 2, b to 4 and vary m from 4
to 512. This setting of b and m is used in the experiments
of our external-memory KrylovSchur eigensolver in the
next section.
4.2.1 Optimizations on dense matrix multiplication
We illustrate the most effective I/O optimizations on
dense matrix multiplication. We start with a basic imple-
mentation of matrix multiplication in SAFS and deploy
the optimizations on SAFS and matrix multiplication in-
crementally in the following order:
• use different random striping orders for each file
(diff strip),
• use a per-thread buffer pool to allocate memory for
I/O (buf pool),
• use one I/O thread per NUMA node (1IOT),
• use I/O polling in worker threads (polling),
• use 8MB for the maximal block size in the kernel
(max block),
The effectiveness of the optimizations on external
memory dense matrix multiplication is shown in Figure
10
0
5
10
15
20
25
30
35
40
45
64 256
se
co
nd
s
The number of columns in the left matrix
Base
diff stripe
buf pool
1IOT
polling
max block
Figure 9: The effectiveness of I/O optimizations on
dense matrix multiplication.
0.015625
0.0625
0.25
1
4
16
64
4 8 16 32 64 128 256 512
se
co
nd
s
The number of columns
FE-IM
FE-SEM
MKL
Trilinos
Figure 10: The runtime of dense matrix multiplication
in op1. We compare in-memory (FE-IM) and external-
memory (FE-EM) implementations in FlashEigen with
the MKL and Trilinos implementations.
9. We only demonstrate their effectiveness on the second
form of dense matrix multiplication and their effective-
ness in the first form is very similar. As shown in Figure
9, using a per-thread buffer pool and a smaller number
of I/O threads has the most significant performance im-
provement. By combining all of the optimizations to-
gether, we increase the performance by a factor of up to
four. This also indicates the importance of reducing the
overhead of thread context switches in high-throughput
I/O access.
4.2.2 Dense matrix multiplication performance
Figure 10 show the performance of the in-memory and
external-memory dense matrix multiplication in the first
form. We omit the performance result of dense matrix
multiplication in the second form because it is very sim-
ilar to the first form and MKL cannot parallelize it. The
external-memory multiplication in FlashEigen is roughly
3-6 times slower than its in-memory counterpart. In
the dense matrix multiplication of the first form, the in-
memory implementation outperforms MKL and Trilinos
0
2
4
6
8
10
12
4 8 16 32 64 128 256 512
G
B
/s
The number of columns
op1 op3
Figure 11: The average I/O throughput in dense matrix
multiplication on the SSD array.
when the number of the columns get larger and MKL
does not have a parallel implementation for the dense
matrix multiplication of the second form.
The external-memory dense matrix multiplication has
almost saturated the I/O bandwidth of the SSDs (Fig-
ure 11). The average I/O throughput reaches 10.87GB/s
from the entire SSD array or 464MB/s per SSD,
while the peak throughput of an SSD is approaching
500MB/s, which is close to the maximal I/O through-
put of 540MB/s per SSD, advertised by the SSD vendor.
This also indicates that the SSDs are the bottleneck for
the dense matrix multiplication in the eigensolver. This
is not surprising because the sequential I/O performance
of SSDs is an order of magnitude smaller than RAM.
4.3 The KrylovSchur eigensolver in
FlashEigen
In this section, we focus on evaluating the perfor-
mance of the KrylovSchur eigensolvers in FlashEigen.
KrylovSchur is not only the fastest in-memory eigen-
solver on the graphs in Table 2 among all Anasazi eigen-
solvers, but also generates the least I/O, especially writes,
to SSDs. Reducing I/O is essential to achieve per-
formance and reduce SSD wearout. We evaluate our
external-memory eigensolver and compare it with its in-
memory counterpart as well as the original KrylovSchur
eigensolver in the Anasazi framework.
The KrylovSchur eigensolver has two important pa-
rameters: the subspace size and the block size. They
significantly affect the convergence of the KrylovSchur
eigensolver. A reasonably large subspace size and block
size accelerates its convergence. However, a larger sub-
space increases memory consumption of the eigensolver
as well as computation and I/O complexity in a single
matrix operation. A larger block size increases memory
consumption of FlashEigen.
For the experiments below, we select the values for
11
00.2
0.4
0.6
0.8
1
rmat-100M-40
rmat-100M-160
Friendster
W
Trilinos
EM
(a) 8 eigenvalues.
0
0.2
0.4
0.6
0.8
1
rmat-100M-40
rmat-100M-160
Friendster
W
Trilinos
EM
(b) 32 eigenvalues.
Figure 12: The performance of the Trilinos KrylovSchur
and FlashEigen-EM KrylovSchur relative to the
FlashEigen-IM KrylovSchur.
the two parameters that achieve the best runtime perfor-
mance for the eigensolvers. We assume that our setting is
not constrained by memory size available to the machine.
Because sparse matrix in Trilinos is not optimized for the
dense matrix with more than one column, we use 1 as the
block size and 2× ev as the number of blocks for most
of the graphs in the original KrylovSchur eigensolver,
where ev is the number of the eigenvalues to compute.
For the FlashEigen KrylovSchur eigensolver, we choose
the subspace size and the block size to reduce the amount
of I/O to SSDs. we use 1 as the block size and 2× ev
as the number of blocks when computing a small num-
ber of eigenvalues and use 4 as the block size and ev as
the number of blocks. Therefore, the FlashEigen eigen-
solver uses the subspace twice as large as the one used
by the original KrylovSchur eigensolver when comput-
ing a large number of eigenvalues. However, the eigen-
values of the graph W are very close to each other, so
we have to use a much larger subspace to have the eigen-
solver to converge or converge faster. For the W graph,
we use the subspace four times larger than the ones used
for other graphs. Therefore, we can trade off the runtime
performance with memory consumption in eigendecom-
position.
4.3.1 Performance of eigensolvers
We evaluate the performance of our SEM eigensolver
and compare its performance with our in-memory eigen-
solver and the original KrylovSchur eigensolver. We
compute different numbers of eigenvalues on the three
smaller graphs in Table 2. Only our SEM eigensolver
is able to compute eigenvalues on the page graph on the
1TB-memory machine.
Our SEM eigensolver achieves at least 40% per-
formance of our in-memory eigensolver, while the
in-memory eigensolver outperforms the original
KrylovSchur eigensolver (Figure 12). The SEM eigen-
solver is more efficient to compute a small number
#eigenvalues runtime memory read write
8 4.2 hours 120GB 145TB 4TB
Table 3: The performance and resource consumption of
computing eigenvalues on the page graph.
of eigenvalues and is able to achieve around 50%
performance of our in-memory eigensolver. SpMM and
reorthogonalization account for most of computation
time when computing a small number of eigenvalues, but
reorthogonalization eventually dominates the eigende-
composition for computing many eigenvalues. Because
external-memory dense matrix multiplication is several
times slower than the in-memory implementations,
reorthogonalization accounts for over 90% of runtime
in the SEM eigensolver for computing a large number
of eigenvalues. For many spectral analysis tasks, users
only require a small number of eigenvalues.
The SEM eigensolver uses a small fraction of memory
used by its in-memory counterparts and the original Trili-
nos eigensolver and its memory consumption remains
roughly the same as the number of eigenvalues computed
by the eigensolvers increases. A small memory con-
sumption provides two benefits. First, FlashEigen is able
to scale to a much larger eigenvalue problem. The sec-
ond benefit is that FlashEigen gives users more freedom
to choose the subspace size that gives the fastest conver-
gence in a given eigenvalue problem because SSDs sig-
nificantly increases memory available to the eigensolver.
We have to point out that our experiments assume that
we choose the subspace size that is not constrained by
the memory size in a machine.
4.3.2 Scale to billion-node graphs
We evaluate the scalability of FlashEigen with the page
graph with 3.4 billion vertices and 129 billion edges. Be-
cause the page graph is a directed graph, its adjacency
matrix is asymmetric and we perform singular value de-
composition (SVD) on the adjacency matrix instead of
simple eigendecomposition. For the page graph, we use
2 for the block size and 2× ev for the number of blocks
because sparse matrix multiplication is completely bot-
tlenecked by SSDs. Neither the in-memory eigensolver
nor the original Trilinos eigensolver is able to compute
eigenvalues on the page graph with 1TB RAM.
FlashEigen computes a fairly large number of eigen-
values within a reasonable amount of time and consumes
a fairly small amount of resources given the large size
of the eigenvalue problem. The average I/O throughput
during the computation of 8 eigenvalues is about 10GB/s,
which is very close to the maximal I/O throughput pro-
vided by the SSD array. Given the relative small mem-
12
ory footprint, we are able to scale FlashEigen to a much
larger eigenvalue problem on our 1TB-RAM machine.
5 Conclusions
We present an external-memory framework called
FlashEigen using a large array of commodity SSDs
to solve eigenvalue problems at the billion scale.
FlashEigen utilizes the Anasazi framework to get state-
of-art eigensolver implementations so that we can fo-
cus on optimizations on sparse matrix multiplication and
dense matrix multiplication on SSDs.
We implement a sparse matrix dense matrix multipli-
cation in the semi-external memory fashion. That is, we
keep the sparse matrix on SSDs and the dense matrices
in memory. We deploy a set of memory and I/O opti-
mizations so that the sparse matrix dense matrix multi-
plication has performance comparable to its in-memory
counterparts while significantly outperforming the MKL
and Trilinos implementation. The semi-external sparse
matrix multiplication is able to saturate either CPU or
SSDs or both, which suggests that we have achieved the
maximal performance from the existing hardware.
We further implement and optimize external-memory
matrix multiplication for FlashEigen. When comput-
ing eigenvalues on many real-world graphs, the storage
size required by the subspace in an eigensolver is mas-
sive. Therefore, we keep the subspace on the SSD array
and deploy a set of I/O optimizations on dense matrix
multiplication. With the I/O optimizations, we saturate
the SSDs in dense matrix multiplication and achieve the
maximal performance from the existing hardware. How-
ever, the SSD array is an order of magnitude slower than
RAM, so external-memory dense matrix multiplication
is about three or four times slower than the state-of-art
in-memory dense matrix multiplication.
We implement our external-memory eigensolver with
the semi-external memory sparse matrix multiplication
and the external-memory dense matrix multiplication.
Our experiments show that our external-memory eigen-
solver achieves at least 30% of the performance of its in-
memory counterparts. For a small number of eigenval-
ues, which is the most common case for spectral analysis,
our external-memory eigensolver has less than 50% per-
formance loss. We further show that our external- mem-
ory eigensolver is able to scale a graph with 3.4 billion
vertices and 129 billion edges. It finishes eigendecompo-
sition within a reasonable amount of time and consumes
a fairly small amount of resources. This suggests that
our external- memory eigensolver is able to scale to even
a larger eigenvalue problem in our 1TB-RAM machine.
We are able to scale to much larger graphs. Given
the same amount of resources, SSDs significantly ex-
tends the memory capacity and give users the freedom
of choosing the optimal subspace size for better con-
vergence. Our solution also works better for relatively
denser graphs. Our solution works better for computing
a small number of eigenvalues.
6 Acknowledgments
This work is partially supported by NSF ACI-1261715,
DARPA GRAPHS N66001-14-1-4028 and DARPA
SIMPLEX program through SPAWAR contract N66001-
15-C-4041.
References
[1] ABELLO, J., BUCHSBAUM, A. L., AND WESTBROOK, J. R. A
functional approach to external graph algorithms. In Algorith-
mica (1998), Springer-Verlag.
[2] AKTULGA, H. M., BULUC¸, A., WILLIAMS, S., AND YANG,
C. Optimizing sparse matrix-multiple vectors multiplication for
nuclear configuration interaction calculations. In Proceedings of
the 2014 IEEE 28th International Parallel and Distributed Pro-
cessing Symposium (2014).
[3] ANDERSON, E., BAI, Z., BISCHOF, C., BLACKFORD, S.,
DEMMEL, J., DONGARRA, J., DU CROZ, J., GREENBAUM, A.,
HAMMARLING, S., MCKENNEY, A., AND SORENSEN, D. LA-
PACK Users’ Guide, third ed. Society for Industrial and Applied
Mathematics, Philadelphia, PA, 1999.
[4] ARBENZ, P., HETMANIUK, U. L., LEHOUCQ, R. B., AND TU-
MINARO, R. S. A comparison of eigensolvers for large-scale
3D modal analysis using AMG-preconditioned iterative meth-
ods. International Journal for Numerical Methods in Engineering
(2005).
[5] ARBENZ, P., HETMANIUK, U. L., LEHOUCQ, R. B., AND TU-
MINARO, R. S. A comparison of eigensolvers for large-scale
3D modal analysis using AMG-preconditioned iterative meth-
ods. International Journal for Numerical Methods in Engineering
(2005).
[6] BOMAN, E. G., DEVINE, K. D., AND RAJAMANICKAM, S.
Scalable matrix computations on large scale-free graphs using 2D
graph partitioning. In Proceedings of the International Confer-
ence on High Performance Computing, Networking, Storage and
Analysis (2013).
[7] CALVETTI, D., REICHEL, L., AND SORENSEN, D. C. An im-
plicitly restarted lanczos method for large symmetric. . . ETNA 2
(1994), 1–21.
[8] DEAN, J., AND GHEMAWAT, S. MapReduce: Simplified data
processing on large clusters. In Proceedings of the 6th Confer-
ence on Symposium on Opearting Systems Design & Implemen-
tation - Volume 6 (2004).
[9] HERNANDEZ, V., ROMAN, J. E., AND VIDAL, V. SLEPc: A
scalable and flexible toolkit for the solution of eigenvalue prob-
lems. ACM Trans. Math. Softw. (2005).
[10] HEROUX, M. A., BARTLETT, R. A., HOWLE, V. E., HOEK-
STRA, R. J., HU, J. J., KOLDA, T. G., LEHOUCQ, R. B.,
LONG, K. R., PAWLOWSKI, R. P., PHIPPS, E. T., SALINGER,
A. G., THORNQUIST, H. K., TUMINARO, R. S., WILLEN-
BRING, J. M., WILLIAMS, A., AND STANLEY, K. S. An
overview of the Trilinos project. ACM Trans. Math. Softw. (2005).
[11] IM, E.-J., YELICK, K., AND VUDUC, R. Sparsity: Optimiza-
tion framework for sparse matrix kernels. Int. J. High Perform.
Comput. Appl. (2004).
13
[12] KANG, U., MEEDER, B., AND FALOUTSOS, C. Spectral analy-
sis for billion-scale graphs: Discoveries and implementation. In
Proceedings of the 15th Pacific-Asia Conference on Advances in
Knowledge Discovery and Data Mining - Volume Part II (2011).
[13] KWAK, H., LEE, C., PARK, H., AND MOON, S. What is twitter,
a social network or a news media? In Proceedings of the 19th
International Conference on World Wide Web (2010).
[14] LANCZOS, C. An iteration method for the solution of the eigen-
value problem of linear differential and integral operators. Jour-
nal of Research of the National Bureau of Standards (1950).
[15] LEHOUCQ, R., SORENSEN, D., AND YANG, C. ARPACK Users’
Guide. Society for Industrial and Applied Mathematics, 1998.
[16] LESKOVEC, J., LANG, K. J., DASGUPTA, A., AND MAHONEY,
M. W. Community structure in large networks: Natural clus-
ter sizes and the absence of large well-defined clusters. Internet
Mathematics 6, 1 (2009), 29–123.
[17] NG, A. Y., JORDAN, M. I., AND WEISS, Y. On spectral cluster-
ing: Analysis and an algorithm. In Advances in Neural Informa-
tion Processing Systems 14. 2002.
[18] OCZ Intrepid 3000. http://ocz.com/enterprise/
intrepid-3000-sata-ssd, Accessed 11/5/2015.
[19] PAN, V. Y., AND CHEN, Z. Q. The complexity of the matrix
eigenproblem. In Proceedings of the Thirty-first Annual ACM
Symposium on Theory of Computing (1999).
[20] SORENSEN, D. C. Implicitly restarted ARNOLDI/LANCZOS
methods for large scale eigenvalue calculations. Tech. rep., 1996.
[21] STEWART, G. W. A KrylovSchur algorithm for large eigenprob-
lems. SIAM Journal on Matrix Analysis and Applications (2002).
[22] SUSSMAN, D. L., TANG, M., FISHKIND, D. E., AND PRIEBE,
C. E. A consistent adjacency spectral embedding for stochastic
blockmodel graphs. Journal of the American Statistical Associa-
tion (2012).
[23] TANG, L., HUANG, Q., LLOYD, W., KUMAR, S., AND LI, K.
RIPQ: Advanced photo caching on flash for Facebook. In 13th
USENIX Conference on File and Storage Technologies (FAST 15)
(Feb. 2015).
[24] TSOURAKAKIS, C. Fast counting of triangles in large real net-
works without counting: Algorithms and laws. In Data Min-
ing, 2008. ICDM ’08. Eighth IEEE International Conference on
(2008).
[25] Web graph. http://webdatacommons.org/
hyperlinkgraph/, Accessed 4/18/2014.
[26] WILLIAMS, S., OLIKER, L., VUDUC, R., SHALF, J., YELICK,
K., AND DEMMEL, J. Optimization of sparse matrix-vector mul-
tiplication on emerging multicore platforms. In Proceedings of
the 2007 ACM/IEEE Conference on Supercomputing (2007).
[27] YANG, J., AND LESKOVEC, J. Defining and evaluating network
communities based on ground-truth. In Proceedings of the ACM
SIGKDD Workshop on Mining Data Semantics (2012).
[28] YOO, A., BAKER, A. H., PEARCE, R., AND HENSON, V. E.
A scalable eigensolver for large scale-free graphs using 2D graph
partitioning. In Proceedings of 2011 International Conference for
High Performance Computing, Networking, Storage and Analysis
(2011).
[29] ZHENG, D., BURNS, R., AND SZALAY, A. S. Toward millions
of file system IOPS on low-cost, commodity hardware. In Pro-
ceedings of the International Conference on High Performance
Computing, Networking, Storage and Analysis (2013).
[30] ZHENG, D., MHEMBERE, D., BURNS, R., VOGELSTEIN, J.,
PRIEBE, C. E., AND SZALAY, A. S. FlashGraph: Processing
billion-node graphs on an array of commodity SSDs. In 13th
USENIX Conference on File and Storage Technologies (FAST 15)
(2015).
[31] ZHOU, Z., SAULE, E., AKTULGA, H., YANG, C., NG, E.,
MARIS, P., VARY, J., AND CATALYUREK, U. An out-of-core
eigensolver on SSD-equipped clusters. In Cluster Computing
(CLUSTER), 2012 IEEE International Conference on (2012).
14
