Abstract-Sparse matrix multiplication is traditionally performed in memory and scales to large matrices using the distributed memory of multiple nodes. In contrast, we scale sparse matrix multiplication beyond memory capacity by implementing sparse matrix dense matrix multiplication (SpMM) in a semi-external memory (SEM) fashion; i.e., we keep the sparse matrix on commodity SSDs and dense matrices in memory. Our SEM-SpMM incorporates many in-memory optimizations for large power-law graphs. It outperforms the in-memory implementations of Trilinos and Intel MKL and scales to billion-node graphs, far beyond the limitations of memory. Furthermore, on a single large parallel machine, our SEM-SpMM operates as fast as the distributed implementations of Trilinos using five times as much processing power. We also run our implementation in memory (IM-SpMM) to quantify the overhead of keeping data on SSDs. SEM-SpMM achieves almost 100% performance of IM-SpMM on graphs when the dense matrix has more than four columns; it achieves at least 65% performance of IM-SpMM on all inputs. We apply our SpMM to three important data analysis tasks-PageRank, eigensolving, and non-negative matrix factorization-and show that our SEM implementations significantly advance the state of the art.
INTRODUCTION
Sparse matrix multiplication is an important computation with a wide variety of applications in scientific computing, machine learning, and graph analysis. For example, matrix factorization algorithms on a sparse matrix, such as singular value decomposition (SVD) [1] and non-negative matrix factorization (NMF) [2] , require sparse matrix multiplication. Graph analysis algorithms such as PageRank [3] can be formulated as sparse matrix multiplication or generalized sparse matrix multiplication [4] . Some of the algorithms, such as PageRank and SVD, require sparse matrix vector multiplication. Others, such as NMF, require sparse matrix dense matrix multiplication.
It is challenging to implement sparse matrix multiplication efficiently. It has very low computation density and its performance is limited by memory access. Sparse matrix multiplication usually achieves only a small fraction of the peak performance of a modern processor [5] .
Sparse matrices derived from graphs impose additional challenges. Many graphs have a power-law distribution in vertex degree. This produces a power-law distribution in the number of non-zero entries per column or row in the associated matrix, which results in load imbalance in parallel decompositions of sparse matrix multiplication. Graph matrices are much sparser and have more random distribution than traditional sparse matrices that arise in scientific computing. As such, many optimizations for traditional sparse matrices, such as register blocking and prefetching [5] increase storage size, bring more data to CPU cache unnecessarily and hurt performance. The commonly used sparse matrix storage formats such as CSR (compressed sparse row) and CSC (compressed sparse column) for graph matrices leads to significant CPU cache misses in sparse matrix multiplication because graphs are sparse and large.
We focus on sparse matrices derived from graphs because they are more difficult to compute efficiently than general sparse matrices and because some of the largest matrices and most important analysis tasks operate on graph datasets, e.g. social networks and Web graphs. Facebook's social network has billions of vertices and the Web Data Commons graph [6] has 3.6 billion vertices and 128 billions edges. We perform sparse matrix multiplication for graph analysis such as community detection with NMF and spectral analysis with SVD.
Current research focuses on sparse matrix multiplication in memory for small matrices and scaling to a large sparse matrix in a large cluster, where the aggregate memory is sufficient to store the sparse matrix [5] , [7] , [8] . Distributed solutions for sparse matrix multiplication lead to significant network communication and network bandwidth is usually the bottleneck. As such, this operation requires a fast network to achieve performance. A supercomputer or a large cluster with a fast network is inaccessible or too expensive for many users. In addition, the distributed solution imposes challenges in achieving load balancing on power-law graphs.
On the other hand, a current trend for hardware design scales up a single machine, rather than scaling out to many networked machines. These machines typically have mul-tiple processors with many CPU cores and a large amount of memory. They are also equipped with fast flash memory such as solid-state drives (SSDs) to further extend memory capacity. This conforms to the node design for supercomputers [9] . Recent finding, from Frank McSherry [10] , [11] and our prior work [12] , show that the largest graph analytics tasks can be done on a small fraction of the hardware, at less cost, as fast, and using less energy on a single sharedmemory node, rather than a distributed compute engine. Our findings in this paper reveal that sparse matrices have the same structure and benefits and that the largest graph matrices can be processed efficiently on a single scale-up compute node.
We explore a solution that scales sparse matrix dense matrix multiplication (SpMM) on a multi-core machine with commodity SSDs and perform SpMM in semi-external memory (SEM). The concept of semi-external memory arose as a functional computing approach for graphs [13] in which the vertex state of a graph is stored in memory and the edges accessed from external memory. We introduce a similar construct for SpMM in which one or more columns of a dense matrix are kept in memory and the sparse matrix is accessed from external memory. In semi-external memory, we assume that the memory of a machine is sufficient to keep at least one column of the input dense matrix but is insufficient to hold the sparse matrix and the dense matrices. Given fast SSDs, we demonstrate that the SEM solution uses the resources of a multi-core machine well and achieves performance that exceeds the state-of-the-art in-memory implementations.
We overcome many technical challenges to construct a sparse matrix multiplication implementation on SSDs to achieve performance. Specifically, SSDs are an order of magnitude slower in throughput and multiple orders of magnitude slower in latency than DRAM. Furthermore, sequential I/O of SSDs is still much faster than random I/O [14] and reads are faster than writes. In addition, SSDs wear out when we write data to them and random writes further shorten the lives of SSDs [15] . As such, our solution needs to sequentialize I/O access and reduce I/O, especially writes.
Semi-external memory provides a scalable and efficient SpMM solution that meets the I/O challenges. We further incorporate substantial computation optimizations to achieve performance of SpMM on graphs in a NUMA (nonuniform memory architecture) machine. During the computation, each thread streams its own partitions of the sparse matrix from SSDs, maximizing I/O throughput and avoiding thread synchronization. We buffer all intermediate computation results in local memory, to reduce remote memory access, and stream the output matrix to SSDs at most once, minimizing writes to SSDs. We design a very compact sparse matrix format to accelerate retrieving a sparse matrix from SSDs. Semi-external memory has memory constraints. As such, we deploy only computation optimizations that require a small memory footprint, e.g., fine-grain dynamic task scheduling to balance loads on power-law graphs and cache blocking to increase CPU cache hits.
Given the sparsity and the enormous size of graphs, the dense matrices involved in SpMM are massive and our semi-external memory solution adapts to dense matrices with different numbers of columns. When the dense matrix is larger than memory, we split it vertically into multiple partitions so that each partition can fit in memory. As such, the minimum memory requirement of our solution is O(r), where r is the number of rows in the input dense matrix. By keeping more columns in the dense matrix in memory, we reduce I/O from SSDs and SEM-SpMM becomes CPU bound, instead of I/O bound, on fast SSDs.
We develop three important applications in scientific computing and graph analysis with our SEM-SpMM: PageRank [3] , eigensolver [16] and non-negative matrix factorization [2] . Each of them requires SpMM with different numbers of columns in dense matrices, resulting in different strategies of placing data in memory. With the three applications, we demonstrate data placement choices for different memory capacities in a machine and the impact of the memory size on the performance of the applications.
The main contributions of this paper are:
• We scale SpMM in semi-external memory using asymmetric I/O to SSDs and we deploy runtime optimizations that meet memory constraints.
• We deploy a compact sparse matrix format to reduce the amount of I/O and alleviate I/O as the bottleneck of the system.
• We deploy a fine-grain dynamic load balancing scheme for sparse matrices with power-law distribution in nonzero entries.
• We optimize SpMM for dense matrices with various numbers of columns and extends SEM-SpMM to large dense matrices that cannot fit in memory by vertically partitioning the dense matrices. Our result shows that for real-world sparse graphs, our SEM-SpMM achieves almost 100% performance of our inmemory implementation on a large parallel machine with 48 CPU cores when the dense matrix has more than four columns. Even for SpMV (sparse-matrix vector multiplication: the special case in which the dense matrix has a single column), our SEM implementation achieves at least 65% performance of our in-memory implementation and outperforms Trilinos [17] and MKL [18] by a factor of 2 − 9. We conclude that semi-external memory coupled with SSDs delivers an efficient solution for large-scale sparse matrix multiplication. It serves as a building block and offers new design possibilities for large-scale data analysis, replacing memory with larger, cheaper, more energy-efficient SSDs and processing bigger problems on fewer machines. The code for SpMM and SpMV has been released as open source as part of the FlashX system at http://flashx.io.
RELATED WORK
Recent sparse matrix multiplication studies focus on inmemory optimizations. Williams et al. [5] describe optimizations for sparse matrix vector multiplication (SpMV) in multicore architectures. Yoo et al. [7] and Boman et al. [8] optimize distributed SpMV for large scale-free graphs with 2D partitioning to reduce communication between machines. In contrast, sparse matrix dense matrix multiplication (SpMM) receives less attention from the highperformance computing community. Even though SpMM can be implemented with SpMV, SpMV fails to explore data locality in SpMM. Aktulga et al. [19] optimize SpMM with cache blocking. Koanantakool et al. [20] experiment with different parallel algorithms for sparse matrix dense matrix multiplication and analyze their communication cost in distributed memory. We advance SpMM with a focus on optimizations for semi-external memory.
Compressed row storage (CSR) and compressed column storage (CSC) formats are commonly used sparse matrix formats in many numeric libraries, such as Intel MKL [18] and Trilinos [17] . However, these formats are not designed for graphs. Sparse matrix multiplication with these formats on graphs incurs many random memory accesses. Sparsity [21] designs a format that encodes both register blocking and cache blocking to increase data reuse in the CPU cache for sparse matrix multiplication. Register blocking requires explicit storage of zero values in register blocks. This strategy wastes space and computation for graphs because graphs are sparse and a vertex in a graph can connect to arbitrary other vertices. Buluc et al. [22] further advance sparse matrix format by doubly compressed sparse column (DCSC) for hypersparse submatrices after 2D partitioning on a sparse matrix. This format significantly reduces the storage size of a 2D-partitioned sparse matrix. Our format further reduces the storage size of a sparse matrix.
Abello et al. [13] introduced the semi-external memory algorithmic framework for graphs. Pearce et al. [23] implement several semi-external memory graph traversal algorithms for SSDs. FlashGraph [12] adopted the concept and performs graph algorithms with vertex state in memory and edge lists on SSDs. This work extends the semi-external memory concept to matrix operations.
GridGraph [24] is a general-purpose graph processing framework on a single machine and scales to large graphs using disks. It performs 2D partitioning on a graph to reduce CPU cache misses in graph analysis and deploys 2-level hierarchical partitioning to reduce I/O. This graph engine is designed for graph algorithms expressed as sparse matrix vector multiplication and keeps both graphs and vertex computation state on disks. In contrast, our work focuses on optimizations on sparse matrix dense matrix multiplication and performs this operation in semi-external memory to reduce I/O, especially writes, to SSDs. GridGraph runs more slowly than in-memory linear algebra libraries such as Intel MKL [18] and Trilinos Tpetra [17] .
Array databases such as SciDB [25] and Rasdaman [26] support sparse matrix mutliplication. They perform matrix multiplication using CSR formats [27] , similar to Intel MKL and Trilinos. There has also been preliminary research on accelerating matrix multiplication with GPUs [28] , showing that speedups are limited by I/O and setup costs and that benefits are limited to compute dense operations.
MapReduce [29] is also commonly used for processing large graphs. PEGASUS [30] is a popular graph processing engine built on MapReduce and expresses graph algorithms as a generalized form of sparse matrix-vector multiplication. GBase [31] is a graph database built on MapReduce. It optimizes the graph storage for graph queries and analysis and perform graph queries and analysis with sparse matrix vector multiplication. Other work [32] , [33] , [34] implement non-negative matrix factorization (NMF) with MapReduce. Although these MapReduce-based implementations can scale to very large graphs, they run orders of magnitude more slowly than optimized in-memory solutions, such as Trilinos and Intel MKL.
Zhou et al. [35] implement an LOBPCG [36] eigensolver in an SSD cluster. Their implementation targets nuclear many-body Hamiltonian matrices, which are much denser and have smaller dimensions than many sparse graphs. Therefore, their solution stores the sparse matrix on SSDs and keep the entire vector subspace in RAM. In contrast, our SEM-SpMM handles dense matrices of different sizes and our eigensolver stores both the sparse matrix and the vector subspace on SSDs.
The Trilinos project [17] has a large collection of numerical libraries. The Tpetra library provides efficient sparse matrix operations such as sparse matrix multiplication. The matrix implementations in Trilinos are optimized for distributed memory.
Intel Math Kernel Library [18] is an efficient and parallel linear algebra library with matrix operations optimized for Intel platforms. It provides an efficient sparse matrix multiplication for regular sparse matrices. In contrast, our sparse matrix multiplication optimizes for power-law graphs with near-random vertex connection.
SPARSE MATRIX MULTIPLICATION
Sparse matrix dense matrix multiplication (SpMM) generates many random memory accesses and its performance is limited by random memory throughput of DRAM. We perform SpMM in semi-external memory (SEM) to enable nearly in-memory performance, while achieving scalability in proportion to the ratio of non-zero entries to rows or columns in a sparse matrix.
Semi-external memory
Our definition of semi-external memory for SpMM keeps the sparse matrix on SSDs and the input dense matrix or some columns of the input dense matrix in memory. We may keep the entire output matrix in memory if it is sufficiently small or stream it to SSDs or the subsequent computation.
In some applications, such as non-negative matrix factorization (Section 4), even the input dense matrix does not fit in memory. In this case, we partition the input dense matrix vertically so that each partition has complete columns of the original input dense matrix that fit in memory. For each vertical partition, we perform SpMM in semi-external memory and stream the output matrix to SSDs.
Sparse matrix format
We design a new compact sparse matrix format that increases CPU cache hits and reduces the amount of data read from SSDs. A compact format is performance critical because in SEM-SpMM the I/O to SSDs is most often the bottleneck.
To increase CPU cache hits, we deploy cache blocking [21] and store non-zero entries of a sparse matrix in tiles, t × t submatrices ( Figure 1 ). When a tile is small, the rows from the input and output dense matrices involved in multiplication with the tile are always kept in the CPU cache during the computation. The optimal tile size should fill the CPU cache with the rows from the dense matrices and is affected by the number of columns of the dense matrices. To handle dense matrices with different numbers of columns, we deploy both static cache blocking and dynamic cache blocking. We generate sparse matrices with a relatively small tile size and rely on the runtime system to optimize for different numbers of columns (Section 3.4). However, a small tile size potentially increases the storage size of a sparse matrix. In semi-external memory, the dense matrices usually have a small number of columns in sparse matrix multiplication. We use the tile size of 16K × 16K by default to balance the matrix storage size and the adaptibility to different numbers of columns. We use a compact format to store non-zero entries and refer to this format as SCSR (Super Compressed Row Storage) ( Figure 1 ). In sparse matrices encoding graphs, many rows in a tile do not have any non-zero entries. The CSR or CSC formats waste space because they require an entry for each row/column in the row/column index. Doubly compressed sparse column (DCSC) [22] is proposed to avoid this problem. However, even DCSC wastes space because it requires the storage of pointers to columns and fails to reference to non-zero values with small integers (e.g., twobyte integers). SCSR is designed to further shrink the storage size of a tile. This format keeps data only for rows with non-zero entries in a tile. Each non-empty row has a row header that only contains an identifier to indicate the row number, followed by column indices. To determine the end of a row, the most significant bit of the identifier is always set to 1, while the most significant bit of a column index entry is always set to 0. Owing to the small size of a tile, we use two bytes to store a row number and a column index entry, which further reduces the storage size. Each non-zero entry requires at most four bytes to indicate its location in a matrix. Because the most significant bit is used to indicate the beginning of a row, this format supports a maximum tile size of 32K × 32K.
SCSR is much more compact than DCSC [22] . Figure 2 shows that SCSR uses 45%-70% of the storage size used by DCSC for large real-world graphs (Table 1) . For a tile with nnr non-empty rows and nnz non-zero entries, SCSR requires 2 × nnr bytes for row ids, 2 × nnz bytes for column indices of non-zero entries and c × nnz bytes for The ratio of the storage size required by SCSR and DCSC [22] format for real-world graphs.
non-zero values, where c is the number of bytes for a nonzero value. As such, S SCSR = 2 × nnr + (2 + c) × nnz.
In contrast, DCSC requires much more metadata for a tile with nnc non-empty columns and nnz non-zero entries.
On average, nnr ≈ nnc in most sparse matrices. As such, for a binary sparse matrix, 0.4 ≤ S SCSR /S DCSC < 1. SCSR saves more space in a sparse matrix with larger nnr/nnz. Inside each cache tile of the SCSR, we use the coordinate format (COO) for the rows that have only a single non-zero entry. For the adjacency matrices of real-world graphs, many rows in a cache tile have only one non-zero entry. Iterating over single-entry rows in the SCSR format requires to test the end of a row for every non-zero entry, which leads to many conditional jumps. In contrast, COO is more suitable for storing these single-entry rows. It does not increase the storage size but significantly reduces the number of conditional jump instructions. As a result, we combine SCSR with COO and store non-zero entries in the COO format behind the row headers of SCSR (Figure 1 ).
Dense matrices
In many applications, the dense matrices in SpMM are tall and skinny with millions or even billions of rows and a small number of columns. Because many real-world graphs are very large and sparse, the dense matrices may be too large to fit in memory. Thus, for the general case of SpMM, we split the dense matrices into vertical partitions of the memory size. To increase data locality in SpMM, the elements in each vertical partition are stored in rowmajor order (Figure 3 (a) ). When performing SpMM in semiexternal memory, we load the input dense matrix or one of its vertical partitions into memory.
For a non-uniform memory architecture (NUMA), we partition the input dense matrix or one of its vertical partitions horizontally and store horizontal partitions evenly across NUMA nodes (Figure 3(b) ). Striping partitions evenly across all NUMA nodes helps to fully utilize the bandwidth of memory. We assign multiple contiguous rows in a row interval with 2 i rows to a NUMA node. The row interval size is multiple of the tile size of a sparse matrix so that multiplication on a tile accesses rows only from a single row interval. causes the dense matrices to fill the CPU cache. The arrows indicate the order of multiplication on tiles with the input dense matrix.
Parallel Execution
We parallelize sparse matrix multiplication and deploy only computation optimizations with a small memory footprint. Memory is precious resource in semi-external memory because memory should be used to keep more columns in the input dense matrix to reduce I/O from SSDs. Algorithm 1 and Figure 4 illustrate the execution of sparse matrix multiplication in semi-external memory.
Semi-external memory favors horizontal partitioning on a sparse matrix for parallelization because this partitioning scheme minimizes writes to remote memory and SSDs. Horizontal partitioning requires only one thread to allocate local memory buffers for computation on a tile row. All intermediate computation results on tiles are merged into the local memory buffers. As such, we write the output matrix at most once to SSDs and there are no remote memory writes. In contrast, the vertical partitioning and 2D partitioning [20] would require each thread to maintain a local memory buffer for the same tile rows to aggregate writes to SSDs and remote memory. Input: spm, a n × n sparse matrix on SSDs Input: inM , a n × p dense matrix in memory Output: outM , a n × p dense matrix on SSDs 2:
trQ ← get tile row ids(spm) 3: outM ← zeros SSD(n, p)
4:
parfor thread ∈ threads do 5:
P rocessT ileRows(spm, trQ, inM, outM )
6:
end parfor 7:
8: procedure PROCESSTILEROWS(spm, trQ, inM , outM ) Input: spm, a n × n sparse matrix on SSDs Input: trQ, a queue that contains all tile row ids Input: inM , a n × p input dense matrix in memory In-Out: outM , a n × p output dense matrix on SSDs 9:
t ← get tile size(spm) 10: numT Rs ← CP U cache size 2×p×t
11:
while | trQ| > 0 do 12: if | trQ| <= #threads then 13: numT Rs ← 1
14:
ids ← get tile rows( trQ, numT Rs)
trs ← read tilerow async(spm, ids) 16: res ← M ulT Rows(trs, inM ) when trs is ready 17: write rows async(outM, res)
18:
19: procedure MULTROWS(trs, inM ) Input: trs, a s × n submatrix in the sparse matrix Input: inM , n × p dense matrix Output: outBuf , a s × p dense matrix 20: outBuf ← zeros(s, p)
21:
t ← get tile size(trs) 22: for k = 0; k < n t ; k = k + s t do 23 :
lInM at ← rows from inM for tile 27: lOutM at ← rows from outBuf for tile 28: lOutM at+ = tile * lInM at
We deploy a fine-grain dynamic load balancing scheme for parallel computation with a global task queue trQ (Algorithm 1). Each thread gets one computation task (tile rows) at a time from trQ with get tile rows and performs computation (MulTRows). At the beginning, a thread gets larger computation tasks that contains multiple tile rows; as the computation approaches completion, a thread gets smaller tasks that contain only one tile row. This design reduces concurrent access to the global data structure while realizing good load balance.
The algorithm performs asynchronous I/O and merge writes from multiple threads into larger ones. Large writes ensure sustainable write throughput to SSDs and increase their duration. To assist in I/O merging, we use get tile rows() to control a global execution order that ensures that all threads are processing contiguous tile rows and the results from the tile rows are located closely on
To better utilize the CPU cache, MulTRows, reorganizes t×t tiles from multiple contiguous tile rows into s×s blocks, where s = CP U cache 2×p
( Figure 4) . We process all tiles in a s × s block before moving to the next block. The tile size of a sparse matrix is relatively small. As such, this execution order helps to reuse data in the CPU cache from the computation on the previous tile. MulTRows also maintains a local memory buffer outBuf to store the computation results, which minimizes remote memory access. Once the computation in MulTRows is complete, outBuf contains complete results.
We vectorize the computation on rows of dense matrices. In SpMM, we multiply a non-zero entry from a tile with all elements in a row of the input dense matrix and add the results to the corresponding row of the output dense matrix. We perform these operations with vector CPU instructions, such as AVX [37] , to enable more efficient memory access and computation.
I/O optimizations
Semi-external memory sparse matrix multiplication streams a sparse matrix from SSDs. When we access fast SSDs sequentially, the overhead of operating systems, such as thread context switches and memory allocation, becomes noticeable. We tackle these obstacles to maximize I/O throughput.
We issue asynchronous I/O and poll for I/O to avoid thread context switches. I/O polling prevents a thread from being switched out after it completes all available computation. Without polling, when a thread issues an I/O request and waits for I/O completion, the operating system switches the thread out; the operating system reschedules the thread for execution once I/O is complete. However, there is latency for thread rescheduling and the latency from frequent rescheduling can cause noticeable performance degradation on a high-speed SSD array.
When accessing a sparse matrix or a dense matrix from SSDs, we maintain a set of memory buffers for I/O access to reduce the overhead of memory allocation. We use large I/O to access matrices on SSDs to increase I/O throughput. Large memory allocation is expensive because the operating system usually allocates a large memory buffer with mmap() and populates the buffer with physical pages. Therefore, we keep a set of memory buffers allocated previously and reuse them for new I/O requests. For accessing a sparse matrix, tile rows usually have different sizes, so we resize a previously allocated memory buffer if it is too small for a new I/O request.
The impact of the memory size on I/O
More memory reduces I/O in semi-external memory. The minimum memory requirement for SEM-SpMM is nc + t , where n is the number of rows of the input dense matrix, c is the element size in bytes, t is the number of threads processing the sparse matrix and is the buffer size for the sparse matrix and the output dense matrix. When a machine does not have sufficient memory to keep the entire input dense matrix in memory, we need multiple passes on the sparse matrix to complete the computation. Reducing memory consumption is essential to achieve performance in semi-external memory. By keeping more columns of the input dense matrix in memory, we reduce the number of I/O passes.
Although we can cache part of a sparse matrix, keeping more columns of the input dense matrix in memory saves more I/O than using the same amount of memory for the sparse matrix. Assume the input dense matrix has n rows and p columns. Again, c is the element size in bytes. The storage size of the sparse matrix is E bytes and the memory size is M bytes. We further assume we use M bytes to keep some columns of the dense matrices in memory (M < M , ncp mod M ≡ 0) and the remaining memory (M − M ) to cache the sparse matrix. The amount of data in the sparse matrix read from SSDs is
Because E > M in semi-external memory, we minimize IO in by maximizing M , i.e., using more memory for the input dense matrix.
As the number of columns in memory from the input dense matrix increases, the bottleneck of the system may switch. When we keep only one column of the input dense matrix in memory, the system is usually I/O bound; when we keep more columns of the dense matrix in memory, the system will become CPU bound and the I/O complexity does not affect its performance.
APPLICATIONS
We apply sparse matrix multiplication to three important applications widely used in graph analysis and machine learning: PageRank [3] , eigensolver [16] and non-negative matrix factorization [2] . Each application demonstrates a different strategy of using memory for SpMM.
PageRank
PageRank is an algorithm to rank the Web pages by using hyperlinks between Web pages. It was first used by Google and is identified as one of the top 10 data mining algorithms [38] . PageRank is a representative of a set of graph algorithms that can be expressed with SpMM or generalized SpMM, in which the dense matrices have only one column. Other important examples are label propagation [39] and belief propagation [40] . The algorithm runs iteratively and its update rule for each Web pages in an iteration is
where B(u) denotes the neighbor list of vertex u and L(v) denotes the out-degree of vertex v.
Eigensolver
An eigensolver is another commonly used application that requires sparse matrix multiplication. Many algorithms [41] , [42] , [43] and frameworks [16] , [44] , [45] have been developed to solve a large eigenvalue problem.
We take advantage of the Anasazi eigensolver framework [16] and replace its original matrix operations with our SEM-SpMM. To compute eigenvalues of a n × n sparse matrix, many eigenvalue algorithms construct a vector subspace with a sequence of sparse matrix multiplications and each vector in the subspace has the length of n. Due to the sparsity of real-world graphs, the vector subspace is large and we keep the vectors on SSDs. The Anasazi eigensolvers have block extension to update multiple vectors in the subspace simultaneously, which results in sparse matrix dense matrix multiplication. The most efficient Anasazi eigensolver on sparse graphs is the KrylovSchur eigensolver [43] , which updates a small number of vectors (1-4) in the subspace simultaneously. Zheng et al. [46] provides the details of extending the Anasazi eigensolver.
Non-negative matrix factorization
Non-negative matrix factorization (NMF) [2] finds two nonnegative low-rank matrices W and H to approximate a matrix A ≈ W H. NMF has many applications in machine learning and data mining. A well-known example is collaborative filtering [47] in recommender systems. NMF can also be applied to graphs to find communities [48] , [49] .
Many algorithms are designed to solve NMF and here we describe an algorithm [2] that requires a sequence of sparse matrix dense matrix multiplications. The algorithm uses multiplicative update rules and updates matrices W and H alternately. In each iteration, the algorithm first fixes W to update H and then fixes H to update W .
We apply SEM-SpMM to NMF differently based on the memory size and the number of columns in W and H. Due to the sparsity of a graph, W and H may require storage as large as the sparse matrix and can no longer fit in memory. Therefore, we partition W and H vertically and run multiple sparse matrix multiplications to compute W T A and AH T , if the memory is not large enough.
EXPERIMENTAL EVALUATION
We evaluate the performance of our SEM-SpMM on multiple real-world billion-scale graphs including a web-page graph with 3.4 billion vertices. We first measure the performance of SEM-SpMM and compare it with our in-memory implementation (IM-SpMM), which is simply the SEMSpMM implementation with the sparse matrix in memory. We also compare SEM-SpMM with state-of-the-art inmemory implementations in Intel MKL (mkl dcsrmm) and Trilinos Tpetra. We use Intel MKL 2015 and Trilinos v12.0.1 for the experiments. We demonstrate the effectiveness of CPU and I/O optimizations on SEM-SpMM and evaluate the overall performance of the applications in Section 4.
We conduct experiments on a NUMA machine with four Intel Xeon E7-4860 processors, clocked at 2.6 GHz, and 1TB memory of DDR3-1600. Each processor has 12 cores. The machine has three LSI SAS 9300-8e host bus adapters (HBA) connected to a SuperMicro storage chassis, in which 24 OCZ Intrepid 3000 SSDs are installed. The 24 SSDs together are capable of delivering 12 GB/s for read and 10 GB/s for write at maximum. The machine runs Linux kernel v3.13.0. We use 48 threads for all implementations. We use the adjacency matrices of the graphs in Table 1 for performance evaluation. The smallest graph we use has 42 million vertices and 1.5 billion edges. The largest graph is the Page graph with 3.4 billion vertices and 129 billion edges. We generate two synthetic graphs with R-Mat [52] to fill the size gap between the smallest and largest graphs. We construct a directed and undirected version for each of the synthetic graphs because some applications in Section 4 run on directed graphs and others run on undirected graphs. The real-world datasets are publicly available and the synthetic datasets are generated with the boost RMAT generator 1 . We always use the undirected version of the synthetic graphs for the performance evaluation of sparse matrix multiplication.
SEM-SpMM vs. IM-SpMM
We compare the performance of SEM-SpMM with IMSpMM to investigate the performance penalty of scaling SpMM with SSDs. In this case, the dense matrices have a small number of columns and are stored in memory.
There is only a small performance penalty for semiexternal memory (Figure 5a ). The performance gap between IM-SpMM and SEM-SpMM is affected by randomness of vertex connection. The gap is smaller if vertex connection in a graph is more random. The Page graph is relatively well clustered, so SpMM on this graph is less CPU-bound than others. Even for the Page graph, SEM-SpMM gets 65% performance of IM-SpMM. The other factor of affecting the performance gap is the number of columns in the dense matrices. The gap gets smaller as the number of columns in the dense matrices increases.
We further measure the average I/O throughput generated by SEM-SpMM to indicate the bottleneck of the system (Figure 5b) . These experiments vary the number of columns in the dense matrix from 1 (Sparse Matrix Vector multiplication, SpMV) to 8. SpMV on the Page graph saturates the I/O bandwidth of SSDs and is clearly bottlenecked by I/O. SpMV on other graphs (except Twitter) also generates high I/O throughput, which consumes memory bandwidth and potentially interferes with random memory access in SpMV. SEM-SpMV on the Twitter graph takes only about 0.5 second to complete and, thus, startup overhead has significant impact on the average I/O throughput. As the number of columns in the dense matrix increases, SEMSpMM becomes more CPU-bound and requires a small number of columns to achieve close to 100% performance of IM-SpMM.
Multiple factors affect CPU cache hits in sparse matrix multiplication and the performance gap between IM-SpMV and SEM-SpMV ( Figure 6 ). We illustrate some of the factors with stochastic block model (SBM) [53] , a random graph generation model with well-defined clusters widely used in the graph community. In this experiment, we generate SBM graphs with 100 million vertices and 3 billion edges. If the vertices in a graph are randomly ordered, SpMV on the graph generates many random memory access. Thus, the system is bottlenecked by memory bandwidth and the performance gap between IM-SpMV and SEM-SpMV is small. On a graph with vertices ordered based on cluster structures, the number of clusters and the ratio of edges inside and outside clusters have significant impact on the performance gap. As the number of clusters increases, the size of clusters gets smaller and there are fewer random memory accesses inside clusters. Similarly, more edges inside clusters also lead to fewer random memory accesses. In both cases, the performance gap grows.
Comparison with other in-memory SpMM
In this section, we compare our SpMM implementation with the Intel MKL and Trilinos Tpetra. Intel MKL runs on shared-memory machines. Trilinos Tpetra runs in both shared memory and distributed memory, so we measure its performance in our 48-core NUMA machine and an EC2 cluster. We run Tpetra in the largest EC2 instances r3.8xlarge with 16 physical CPU cores and 244GB of RAM each. The EC2 instances are connected with 10Gbps network in the same placement group. MKL and Tpetra cannot run on the Page graph on our NUMA machine because their memory consumption on this graph exceeds its memory capacity. Our IM-SpMM and SEM-SpMM significantly outperform Intel MKL and Trilinos Tpetra on the natural graphs on our NUMA machine for both SpMV and SpMM ( Figure  7) . Even though the Tpetra implementation is optimized for SpMV, our SEM-SpMM constantly outperforms Tpetra by a factor of 2 − 3 even in SpMV. MKL has better optimizations for SpMM than Tpetra. Our SEM-SpMM is still almost twice as fast as MKL in SpMM with a dense matrix of eight columns. Our SpMM implementation benefits all of the in-memory optimizations to achieve high computation efficiency and has substantial I/O optimizations to reduce the performance gap between IM-SpMM and SEM-SpMM (Section 5.4). Specifically, the fine-grain dynamic load balancer achieves better load balancing than MKL and Tpetra. Cache blocking significantly reduces CPU cache misses, whereas MKL and Tpetra store a sparse matrix in CSC or CSR format and have more CPU cache misses. The performance of SEM-SpMM on our 48-core machine (SEM) and Trilinos Tpetra on EC2 clusters (2xEC2, 4xEC2, 8xEC2 and 16xEC2), normalized to IM-SpMM on our 48-core machine for the same graphs. We also show the performance of IM-SpMM on one of the EC2 instance (IM-EC2) where Tpetra runs.
SEM-SpMM consumes a small fraction of memory when compared with IM-SpMM and other SpMM implementations (Figure 8 ). SEM-SpMM consumes memory for the input dense matrix and per-thread memory buffers for the sparse matrix and the output dense matrix. The memory used by memory buffers in each thread is significant but is relatively constant for different graph sizes. We show the memory consumption on one of the large graphs RMAT-160 (Figure 8 ). SEM-SpMM uses about one tenth of the memory used by IM-SpMM, while IM-SpMM consumes much less memory than MKL and Tpetra owing to its compact format for sparse matrices.
Our SEM-SpMM on a large parallel machine achieves comparable performance and, in many cases, outperforms Trilinos Tpetra using five times as much processing power, especially on real-world graphs (Figure 9 ). In this experiment, we run our SpMM implementation on both our NUMA machine with 48 CPU cores and one of the EC2 machines with 16 CPU cores. Owing to the compact format for a sparse matrix, our IM-SpMM runs on all of the graphs on an EC2 instance and achieves around half of the performance of our IM-SpMM on our NUMA machine. When Tpetra runs on 16 EC2 instances, it has 5 times as many CPU cores as our NUMA machine. Tpetra cannot run SpMV on RMAT-160 on two EC2 nodes. Tpetra uses many more computation resources and still barely reaches the same performance as our IM-SpMM and SEM-SpMM on our NUMA machine. One of the main reasons that our SpMM implementation performs much better on realworld graphs is that these graphs are more likely to cause load imbalance. The dynamic scheduling of our SpMM implementation balances load much better than Tpetra.
SEM-SpMM with a large input dense matrix
We further measure the performance of SEM-SpMM with a large input dense matrix, in which neither the sparse matrix nor the dense matrices can fit in memory. In this experiment, we measure the performance of multiplying a sparse matrix with a vertically partitioned dense matrix and the input and output dense matrices are stored on SSDs. We study the impact of memory size on the performance of SEM-SpMM by artificially varying the number of columns that can fit in memory. SEM-SpMM accesses the sparse matrix with direct I/O and, thus, varying the number of columns stored in memory does not affect I/O access to the sparse matrix. IM-SpMM does not need to partition the dense matrices vertically. We do not show the result on the Page graph because the dense matrix with 32 columns for this graph cannot fit in memory.
As more columns in the input dense matrix can fit in memory, the performance of SEM-SpMM constantly increases ( Figure 10 ). When the memory can fit over four columns of the input dense matrix, SEM-SpMM gets over 50% of the performance of IM-SpMM. Even when only one column of the input dense matrix can fit in memory, SEMSpMM still gets 25% of the in-memory performance. When the entire input dense matrix can fit in memory, we get about 80% of the in-memory performance.
Two main factors lead to performance loss in SEMSpMM when the input dense matrix cannot fit in memory. We illustrate the contribution of four potential overheads in SEM-SpMM on the Friendster graph ( Figure 11 ). The main performance loss comes from the loss of data locality in SpMM caused by vertical partitioning of the input dense matrix (Vert-part). Partitioning the dense matrix into onecolumn matrices contributes 60% of performance loss. It drops quickly when the vertical partition size increases. Keeping the sparse matrix on SSDs (SpM-EM) also contributes some performance loss when the dense matrix is partitioned into small matrices. The overhead almost goes away when more than four columns of the dense matrix can fit in memory. The overhead of streaming the output dense matrix to SSDs (Out-EM) and reading the input dense matrix to memory (In-EM) is less significant and remains the same for different memory sizes.
Optimizations on sparse matrix multiplication
Accelerating SEM-SpMM requires both computation and I/O optimizations. We first evaluate the effectiveness of computation optimizations by deploying them on IMSpMM. We further show the effectiveness of I/O optimizations on SEM-SpMM with all computation optimizations. The number of columns in the dense matrix affects the effectiveness of the optimizations. As such, we illustrate their effectiveness on SpMM for dense matrices with one column (SpMV) up to eight columns.
Here we illustrate the most significant computation optimizations from Section 3. We start with an in-memory implementation that performs sparse matrix multiplication on a sparse matrix in the CSR format and apply the optimizations incrementally in the following order:
• dispatch partitions of a sparse matrix to threads dynamically to balance load (Load balance), • partition dense matrices for NUMA (NUMA), • organize the non-zero entries in a sparse matrix into tiles to increase CPU cache hits (Cache blocking), • use CPU vectorization instructions to accelerate arithmetic computation (Vec), All of these optimizations have positive effects on sparse matrix multiplication and all optimizations together speed up SpMM by 3−5 times (Figure 12 ). The degree of effectiveness varies between different graphs and different numbers of columns in the dense matrices. The largest performance boost is from cache blocking, especially for SpMV. This is expected because vertices in these graph can connect to arbitrary vertices, which leads many random memory access in sparse matrix multiplication. Cache blocking significantly increases CPU cache hits to reduce random memory access. CPU vectorization is only effective on SpMM because it optimizes computation on a row of the dense matrix. With all optimizations, we have a fast in-memory implementation for both sparse matrix vector multiplication and sparse matrix dense matrix multiplication. We evaluate I/O optimizations on SEM-SpMV against a base implementation that has all of the computation optimizations and use doubly compressed sparse row format (DCSR) to store tiles of a sparse matrix. We illustrate their effectiveness on the Friendster graph and the Page graph. The first one represents a graph that is not well clustered; the other one is clustered with domain names. We apply the I/O optimizations in the following order:
• use SCSR to reduce data read from SSDs (SCSR), • reduce memory allocation overhead for I/O with perthread buffer pools (buf-pool), • reduce the number of thread context switches for I/O accesses with I/O polling (IO-poll), The I/O optimizations lead to substantial speedup over the base implementation, but behave very differently on these two graphs ( Figure 13 ). On the unclustered graph (Friendster), SCSR requires a much smaller storage size than DCSR ( Figure 2 ) and thus achieves significant speedup. The Page graph, on the other hand, is well clustered and DCSR already achieves a small storage size. SCSR further reduces the storage size, but is less significant. SpMV on the Page graph has less random memory access and is I/O-bound even on a large SSD array. Buf-pool and IO-poll increases I/O throughput and, thus, improves performance. In contrast, Format conversion from CSR to SCSR has low cost with linear time complexity and is bottlenecked by SSDs on our largest graphs (Table 2) . It reads the CSR image once and write the SCSR image back once, both sequentially. This results in the minimum amount of I/O and I/O bounds the computation. Format conversion generates only onetime overhead. Most of the applications such as PageRank and eigensolving require many iterations of sparse matrix multiplication to compute accurate results. As such, the format conversion overhead is amortized.
Application performance
We evaluate the performance of our implementations of the applications in Section 4. We show the effectiveness of additional memory for these applications and compare their performance with state-of-the-art implementations.
PageRank
We evaluate the performance of our SpMM-based PageRank implementation (SpMM-PageRank). This implementation requires the input vector to be in memory, but it is optional to keep the output vector and the degree vector in memory. PageRank is a benchmarking graph algorithm implemented by many graph processing frameworks. We compare the performance of SpMM-PageRank with stateof-the-art implementations in FlashGraph [12] , a semiexternal memory graph engine, and GraphLab Create, the next generation of PowerGraph [54] . The PageRank implementation in FlashGraph computes approximate PageRank values while SpMM-PageRank and GraphLab Create compute exact PageRank values. We run GraphLab Create completely in memory and FlashGraph in semi-external memory. GraphLab Create is not able to compute PageRank on the Page graph. We use FlashGraph v0.3 and a trial version of GraphLab Create v1.9.
SpMM-PageRank in memory and in semi-external memory both significantly outperform the implementations in FlashGraph and GraphLab Create (Figure 14) . Our SpMM is highly optimized for both CPU and I/O. Even though SpMM-PageRank performs more computation than FlashGraph, it performs the computation much more efficiently and reads less data from SSDs than FlashGraph. SpMMPageRank and the implementation in GraphLab create performs the same computation, but SpMM-PageRank performs the computation much more efficiently.
The experiment results also show that keeping more vectors in memory has modest performance improvement for SpMM-PageRank. As such, SpMM-PageRank only needs to keep one vector in memory, which results in very small memory consumption. 
Eigensolver
We evaluate the performance of our SEM KrylovSchur eigensolver and compare its performance with our inmemory eigensolver and the Trilinos KrylovSchur eigensolver. Usually, spectral analysis only requires a very small number of eigenvalues, so we compute eight eigenvalues in this experiment. We run the eigensolvers on the smaller undirected graphs in Table 1 . To evaluate the scalability of the SEM eigensolver, we compute singular value decomposition (SVD) on the Page graph. Among all of the eigensolvers, only our SEM eigensolver is able to compute eigenvalues on the Page graph. For computing 8 eigenvalues, our SEM eigensolver achieves performance comparable to our in-memory eigensolver and the Trilinos eigensolver and can scale to very large graphs (Figure 15 ). Unlike PageRank, an eigensolver has many more vector or dense matrix operations. As such, the memory size has noticeable impact on performance. For the setting with the minimum memory consumption, it has at least 45% performance of our in-memory eigensolver; when keeping the entire subspace in memory, it has almost the same performance as our in-memory eigensolver.
NMF
We evaluate the performance of our NMF implementation (SEM-NMF) on the directed graphs in Table 1 . The dense matrices for NMF can be as large as the sparse matrix. As such, we experiment with the effect of the memory size on the performance of SEM-NMF by varying the number of columns in memory from the dense matrices. We also compare the performance of SEM-NMF with a highperformance NMF implementation SmallK [55] , built on top of the numeric library Elemental [56] . We factorize each of the graphs into two n × k non-negative dense matrices and we use k = 16 because 16 is the largest k that SmallK supports for the graphs in Table 1 . We use SmallK v1.6 and Elemental v0.85. We significantly improve the performance of SEM-NMF by keeping more columns of the input dense matrix in memory ( Figure 16 ). The performance improvement is more significant when the number of columns that fit in memory is small. When we keep eight columns of the input dense matrix in memory, SEM-NMF achieves over 60% of the performance of the in-memory implementation.
SEM-NMF significantly outperforms other NMF implementations. SmallK is the closest competitor. We run the same NMF algorithm in SmallK and SEM-NMF outperforms SmallK by a large factor on all graphs (Figure 16 ).
CONCLUSIONS
We present an alternative solution for scaling sparse matrix dense matrix multiplication (SpMM) to large sparse matrices by utilizing commodity SSDs in a large parallel machine. We perform this operation in semi-external memory (SEM), in which we keep the sparse matrix on SSDs and the dense matrices in memory. Semi-external memory increases scalability in proportion to the ratio of non-zero entries to rows or columns in a sparse matrix, while achieving inmemory performance. We demonstrate that our approach provides a promising alternative to distributed computation for large-scale data analysis.
With substantial memory and I/O optimizations, our SEM-SpMM achieves high efficiency while scaling to large graphs with billions of vertices and hundreds of billions of edges. It significantly outperforms the Intel MKL and Trilinos implementations. We run our implementation in memory (IM-SpMM) to quantify the overhead of keeping data on SSDs. SEM-SpMM achieves almost 100% performance of IM-SpMM on some graphs when the dense matrices have more than four columns, and achieves at least 65% of the performance of IM-SpMM on all graphs even when the dense matrix has only one column.
For a machine with insufficient memory to keep the entire input dense matrix in memory, we partition the dense matrix vertically and run SEM-SpMM multiple times. In this case, the main overhead of SEM-SpMM comes from the loss of data locality caused by vertical partitioning on the dense matrix. However, given sufficient memory to keep a small number of columns of the input dense matrix, we achieve performance comparable to IM-SpMM.
We apply our sparse matrix multiplication to three important applications: PageRank, eigendecomposition and non-negative matrix factorization. We demonstrate how additional memory should be used in semi-external memory in each application. We further demonstrate that each of our implementations significantly outperforms state of the art and scales to large graphs.
Our SSD-based solution also achieves high energy efficiency even though we have not measured energy consumption explicitly. SSDs are energy-efficient storage media [57] compared with RAM and hard drives. When processing large datasets, our solution only uses a single machine and requires a relatively small amount of memory. In contrast, a distributed solution requires many more machines and much more aggregate memory in order to process datasets of the same size. As such, our solution introduces an energyefficient architecture for large-scale data analysis tasks.
