Owing to random memory access patterns, sparse matrix multiplication is traditionally performed in memory and scales to large matrices using the distributed memory of multiple nodes. In contrast, we scale sparse matrix multiplication by utilizing commodity SSDs. We implement sparse matrix dense matrix multiplication (SpMM) in a semi-external memory (SEM) fashion, i.e., we keep the sparse matrix on SSDs and dense matrices in memory. Our SEM SpMM can incorporate many in-memory optimizations for large powerlaw graphs with near-random vertex connection. Coupled with many I/O optimizations, our SEM SpMM achieves performance comparable to our in-memory implementation on a large parallel machine and outperforms the implementations in Trilinos and Intel MKL. Our experiments show that the SEM SpMM achieves almost 100% performance of the in-memory implementation on graphs when the dense matrix has more than four columns; it achieves at least 65% performance of the in-memory implementation for all of our graphs when the dense matrix has only one column. We apply our SpMM to three important data analysis applications and show that our SSD-based implementations can significantly outperform state of the art of these applications and scale to billion-node graphs.
INTRODUCTION
Sparse matrix multiplication is very important computation with a wide variety of applications in scientific computing, machine learning and data mining. For example, matrix factorization algorithms on a sparse matrix such as singular value decomposition (SVD) [11] and non-negative matrix factorization (NMF) [18] requires sparse matrix multiplica-ACM ISBN 978-1-4503-2138-9. DOI: 10.1145/1235 tion. Graph analysis algorithms such as PageRank [8] can be formulated as sparse matrix multiplication or generalized sparse matrix multiplication [24] . Some of the algorithms, such as PageRank and SVD, require sparse matrix vector multiplication. Others, such as NMF, require sparse matrix dense matrix multiplication.
The largest sparse matrices arise from graph datasets in which one performs sparse matrix multiplication for graph analysis such as community detection with NMF and spectral analysis with SVD. These matrices inherit structure from natural graphs. Specifically, these matrices are typically very sparse and have near-random distribution for nonzero entries. They also have a power law distribution that governs the number of non-zero entries per row and column.
It is challenging to have an efficient implementation of sparse matrix multiplication, especially for sparse matrices that encode real-world graphs such as social networks and Web graphs. Sparse matrix multiplication is notorious for achieving only a small fraction of the peak performance of a modern processor [34] . It becomes even more challenging to perform this operation on graphs due to random memory access caused by near-random connection and load imbalancing caused by the power-law distribution in vertex degree. Many real-world graphs are enormous. For example, Facebook's social network has billions of vertices and today's web graphs are much larger. Furthermore, graphs cannot be clustered or partitioned effectively [20] to localize access. Therefore, sparse matrix multiplication on graphs is frequently the bottleneck in an application.
Current research focuses on sparse matrix vector multiplication (SpMV) in memory for small matrices and scaling to a large sparse matrix in a large cluster, where the aggregate memory is sufficient to store the sparse matrix [34, 40, 6] . The distributed solution for sparse matrix multiplication leads to significant network communication and network bandwidth is usually the bottleneck. As such, this operation requires a fast network to achieve performance. A supercomputer or a large cluster with a fast network is inaccessible or too expensive for many users.
On the other hand, a current trend for hardware design is to scale up a single machine for high performance computing. These machines typically have multiple processors with many CPU cores and a large amount of memory. They are also equipped with fast flash memory such as solid-state drives (SSDs) to further extend memory capacity. This conforms to the node design for supercomputers [3] .
We explore a solution that scales sparse matrix dense matrix multiplication (SpMM) on a multi-core machine with commodity solid-state drives (SSDs) and perform SpMM in semi-external memory (SEM). The concept of semi-external memory arose as a functional computing approach for graphs [1] in which the vertex state of a graph is stored in memory and the edges accessed from external memory. We introduce a similar construct for SpMM in which one or more columns of a dense matrix are kept in memory and the sparse matrix is accessed from external memory. In semi-external memory, we assume that the memory of a machine is sufficient to keep at least one column of the input dense matrix but is insufficient to hold the sparse matrix and the dense matrices in memory. Even though SpMM could be implemented with SpMV, such an implementation would fail to explore data locality in SpMM and result in higher I/O access in semi-external memory. We optimize SpMM directly to overcome these problems. Given fast SSDs, we demonstrate that the SEM solution uses the resources of a multi-core machine well and achieves performance comparable to state-of-art inmemory implementations for sparse matrix multiplication while increasing the scalability in proportion to the ratio of non-zero entries to rows or columns in a sparse matrix.
Although SSDs can deliver high IOPS and high sequential I/O throughput, we have to overcome many technical challenges to construct a sparse matrix multiplication implementation to achieve performance comparable to in-memory counterparts. Even though SSDs have high IOPS, their sequential I/O throughput is still significantly higher than random I/O [41] . SSDs are still an order of magnitude slower than DRAM in throughput. Furthermore, random writes are harmful to SSDs [25] . They increase write amplification, decrease I/O throughput, increase latency and shorten the lives of SSDs.
Semi-external memory provides a scalable SpMM solution that incorporates well with I/O access to SSDs, parallelization and in-memory optimizations. It streams the sparse matrix from SSDs for computation, which results in maximal I/O throughput from SSDs. It streams the output matrix to SSDs only once if the memory is insufficient to store the output matrix, resulting in the minimum amount of data written to SSDs and maximizing I/O throughput. We compress the sparse matrix to further accelerate retrieving the sparse matrix from SSDs. In the parallel setting, each thread streams its own partitions of the sparse matrix to memory and performs computation. We deploy multiple in-memory optimizations specifically designed for power-law graphs. For example, we assign partitions of the sparse matrix dynamically to threads for load balancing, deploy cache blocking to increase CPU cache hits, distribute the dense matrix to NUMA nodes to fully utilize the memory bandwidth of a NUMA machine, and organize the dense matrix in the row-major order to explore data locality in SpMM.
Our semi-external memory solution adapts to machines with different memory capacities. When the dense matrix is larger than memory, we split the dense matrix vertically into multiple partitions so that each partition can fit in memory. As such, the minimum memory requirement of our solution is O(n), where n is the number of rows in the dense matrix. By keeping more columns in the dense matrix in memory, we reduce I/O from SSDs in SpMM. When the number of columns in a dense matrix increases, SEM SpMM becomes CPU bound, instead of I/O bound on fast SSDs.
We develop three important applications in scientific computing and data mining with our SEM SpMM: PageRank [8] , eigensolver [5] and non-negative matrix factorization [18] . Each of them requires SpMM with different numbers of columns in dense matrices, resulting in different strategies of placing data in memory. With the three applications, we demonstrate data placement choices for different memory capacities in a machine and the impact of the memory size on the performance of the applications.
Our result shows that for real-world sparse graphs, our SEM sparse matrix multiplication can achieve almost 100% performance of our in-memory implementation on a large parallel machine with 48 CPU cores when the dense matrix has more than four columns. Even for sparse matrix vector multiplication, our SEM implementation achieves at least 65% performance of our in-memory implementation and significantly outperforms Trilinos [14] and MKL [26] . The applications implemented with our SpMM significantly outperform the state-of-art implementations of these applications. As such, we conclude that semi-external memory coupled with SSDs delivers an efficient solution for largescale sparse matrix multiplication. It can also serve as a building block and offers new design possibilities for largescale data analysis, replacing memory with larger, cheaper, more energy-efficient SSDs and processing bigger problems on fewer machines.
RELATED WORK
Recent sparse matrix multiplication studies focus on inmemory optimizations for sparse matrix vector multiplication (SpMV). Williams et al. [34] describe optimizations for SpMV in multicore architecture. Yoo et al. [40] and Boman et al. [6] optimize distributed SpMV for large scale-free graphs with 2D edge partitioning to reduce communication between machines. In contrast, Sparse matrix dense matrix multiplication (SpMM) receives less attention from the highperformance computing community. Even though SpMM can be implemented with SpMV, SpMV fails to explore data locality in SpMM. Aktulga et al. [2] optimize SpMM with cache blocking. We advance SpMM with a focus on optimizations for semi-external memory.
Abello et al. [1] introduced the semi-external memory algorithmic framework for graphs. Pearce et al. [27] implement several semi-external memory graph traversal algorithms for SSDs. FlashGraph [43] adopted the concept and performs graph algorithms with vertex state in memory and edge lists on SSDs. This work extends the semi-external memory concept to matrix operations.
Zhou et al. [44] implemented an LOBPCG [4] eigensolver in an SSD cluster. Their implementation targets nuclear many-body Hamiltonian matrices, which are much denser and have smaller dimensions than many sparse graphs. Therefore, their solution stores the sparse matrix on SSDs and keep the entire vector subspace in RAM. They focus on optimizations in the distributed environment. In contrast, our eigensolver based on our SEM SpMM stores both the sparse matrix and the vector subspace on SSDs due to the large number of vertices in our target graphs. We focus on external-memory optimizations in a single machine.
Anasazi [5] is an eigensolver framework in the Trilinos project [14] . This framework implements block extension of multiple eigensolver algorithms such as Block Krylov-Schur [29] , Block Davidson [4] and LOBPCG [4] . This is a very flexible framework that allows users to redefine sparse matrix dense matrix multiplication and dense matrix operations. By default, Anasazi uses the matrix implementations in Trilinos that runs in distributed memory.
Intel Math Kernel Library [26] is an efficient and parallel linear algebra library with matrix operations specifically optimized for Intel platforms. It provides an efficient sparse matrix multiplication optimized for regular sparse matrices. In contrast, our sparse matrix multiplication optimizes for power-law graphs with near-random vertex connection.
SPARSE MATRIX MULTIPLICATION
Sparse matrix multiplication leads to many random memory accesses and its performance is usually limited by random memory performance of DRAM. We perform sparse matrix multiplication in semi-external memory (SEM) to scale to a sparse matrix with billions of rows and columns. This strategy enables nearly in-memory performance while achieving scalability in proportion to the ratio of non-zero entries to rows or columns in the sparse matrix.
Semi-external memory
Our definition of semi-external memory for sparse matrix multiplication keeps the sparse matrix on SSDs and the input dense matrix or some columns of the input dense matrix in memory. During the computation, we stream data in the sparse matrix from SSDs to maximize I/O throughput.
There are two options for keeping the output dense matrix. In applications such as PageRank and many other graph algorithms, dense matrices have only a few columns, so we can keep the output dense matrix in memory for many machines. If a machine has insufficient memory to keep the output dense matrix, we can stream the output dense matrix to SSDs or to the subsequent computation to reduce memory consumption and potentially I/O as well.
In some applications such as non-negative matrix factorization (Section 4), even the input dense matrix cannot fit in memory. In this case, we partition the input dense matrix vertically so that each partition has complete columns of the original input dense matrix and can fit in memory. Each vertical partition stores elements in the row-major order to increase data locality. For each partition, we perform sparse matrix multiplication in semi-external memory as before and stream the output matrix to SSDs.
Sparse matrix format
To support efficient sparse matrix multiplication in semiexternal memory, we need to use an alternative format for sparse matrices to increase CPU cache hits and reduce the amount of data read from SSDs. The state-of-art numeric libraries store a sparse matrix in compressed row storage (CSR) or compressed column storage (CSC) format. However, these formats incur many CPU cache misses in sparse matrix multiplication on real-world graphs due to nearly random vertex connection. They also require a relatively large storage size. For a sparse matrix with billions of nonzero entries, we have to use eight bytes to store the row and column indices. For SEM sparse matrix multiplication, SSDs may become the bottleneck if a sparse matrix has a large storage size.
To increase CPU cache hits, we deploy cache blocking [15] and store non-zero entries of a sparse matrix in tiles ( Figure  1 ). When a tile is small, the rows from the input and out- put dense matrices involved in the multiplication with the tile are always kept in the CPU cache during the computation. The optimal tile size should fill the CPU cache with the rows of the dense matrices involved in the multiplication with the tile and is affected by the number of columns of the dense matrices, specified by applications. Instead of generating a sparse matrix with different tile sizes optimized for different numbers of columns in the dense matrices, we use a relatively small tile size and rely on the runtime system to optimize for different numbers of columns (Section 3.4). However, a small tile size potentially increases the storage size of the sparse matrix. In semi-external memory, the dense matrices usually have a very small number of columns in sparse matrix multiplication. Therefore, we use the tile size of 16K × 16K by default to balance the matrix storage size and the adaptibility to different numbers of columns.
To reduce the overall storage size of a sparse matrix, we use a compact format to store non-zero entries in a tile. In very sparse matrices many rows in a tile do not have any non-zero entries. The CSR (CSC) format requires an entry for each row (column) in the row (column) index. Therefore, the CSR or CSC format wastes space when storing elements in a tile. Instead, we only keep data for rows with nonzero entries in a tile shown in Figure 2 and refer to this format as SCSR (Super Compressed Row Storage). This format maintains a row header for each non-empty row. A row header has an identifier to indicate the row number, followed by column indices. The most significant bit of the identifier is always set to 1, while the most significant bit of a column index entry is always set to 0. As such, we can easily distinguish a row identifier from a column index entry and determine the end of a row. Thanks to the small size of a tile, we use two bytes to further store a row number and a column index entry to reduce the storage size. Since the most significant bit is used to indicate the beginning of a row, this format allows a maximum tile size of 32K × 32K.
Inside each cache tile of the SCSR, we use the coordinate format (COO) for the rows that have only a single nonzero entry. For the adjacency matrices of real-world graphs, many rows in a cache tile have only one non-zero entry, owing to the sparsity of the graphs and nearly random vertex connection. This design avoids many conditional jumps because iterating over single-entry rows in the SCSR format requires to test the end of a row for every non-zero entry, which leads to many conditional jumps. In contrast, COO is more suitable for storing these single-entry rows. It does not increase the storage size but significantly reduces the number of conditional jump instructions. As a result, we combine SCSR with COO and store non-zero entries in the COO format behind the row headers of SCSR ( Figure 2 ). All non-zero entries are stored together at the end of a tile.
Dense matrices
In many applications, the dense matrices in SpMM are tall-and-skinny matrices with millions or even billions of rows but only a small number of columns. The number of columns is determined by applications. In semi-external memory, we keep the input dense matrix in memory, so its size governs memory consumption of sparse matrix multiplication. To increase data locality in SpMM, the elements in the dense matrices are stored in row-major order.
For a non-uniform memory architecture (NUMA), we partition the input dense matrix horizontally and store partitions evenly across NUMA nodes. The NUMA architecture is prevalent in today's multi-processor servers, where each processor connects to its own memory banks. Therefore, keeping partitions evenly across all NUMA nodes helps to fully utilize the bandwidth of memory and inter-processor links. For horizontal partitioning, we assign multiple contiguous rows in a row interval to a partition, which is assigned to a NUMA node. A row interval in a partition always has 2 i rows for efficiently locating a row with bit operations. The row interval size is multiple of the tile size of a sparse matrix so that multiplication on a tile only needs to access rows from a single row interval.
Runtime CPU optimizations
We perform runtime optimizations to speed up sparse matrix dense matrix multiplication. Runtime optimizations further increase CPU cache hits and allows more efficient computation and memory access.
To better utilize CPU cache, we process tiles of a partition in super tiles (Figure 1 ). The tile size of a sparse matrix is specified when the sparse matrix image is created and is relatively small to handle different numbers of columns in the dense matrices. A super tile is composed of tiles from multiple tile rows and its size is determined at runtime by three factors: the number of columns in the dense matrices, the CPU cache size and the number of threads that share the CPU cache. An optimal size for a super tile fills the CPU cache with the rows from the dense matrices involved in the computation with the super tile.
In spite of nearly random edge connection in a real-world graph, we can explore regularity in SpMM with row-major dense matrices to improve performance because in this case we multiply a non-zero entry with all elements in a row of the input dense matrix and add the results to the corresponding row of the output dense matrix. These operations can be accomplished by the vector CPU instructions, such as AVX [23] to enable more efficient memory access and computation. The current implementation relies on GCC's auto-vectorization to translate the C code to the vector CPU instructions by predefining the matrix width in the code.
I/O optimizations
Semi-external memory sparse matrix multiplication results in sequential I/O. For accessing data in fast SSDs sequentially, the overhead of operating systems such as thread context switch and memory allocation becomes noticeable. We need to tackle these obstacles in order to achieve the maximal I/O throughput of fast SSDs.
An application thread issues asynchronous I/O and polls for I/O to avoid thread context switches because the latency of a context switch can undermine the sequential I/O throughput of a high-speed SSD array. When a thread issues I/O and waits for I/O completion, the operating system switches the thread out; the operating system reschedules the thread for execution once I/O is complete. However, there is latency for thread rescheduling and the latency from frequent rescheduling can cause noticeable performance degradation on a high-speed SSD array. As such, we use I/O polling to avoid a thread from being switched out after the thread completes all computation available to it.
When accessing a sparse matrix or a dense matrix from SSDs, we maintain a set of memory buffers for I/O access to reduce the overhead of memory allocation. We use large I/O to access matrices on SSDs to increase I/O throughput. The operating system usually allocates a large memory buffer with mmap() and populates the buffer with physical pages when it is used. It is computationally expensive to populate large memory buffers frequently. When accessing high-throughput I/O devices, such overhead can cause substantial performance loss. Therefore, we keep a set of memory buffers allocated previously and reuse them for new I/O requests. For accessing a sparse matrix, tile rows usually have different sizes, so we resize a previously allocated memory buffer if it is too small for a new I/O request.
Parallelization
When parallelizing sparse matrix multiplication, we take into account semi-external memory and the power-law distribution of non-zero entries in each row of the sparse matrix.
We partition a sparse matrix horizontally for parallelization ( Figure 1 ) to reduce thread synchronization and memory consumption. With horizontal partitioning, a thread only needs to read tile rows assigned to it from SSDs and processes them independently. Once the tile rows are ready in memory, the thread multiplies the tile rows with the input dense matrix. Horizontal partitioning also significantly reduces memory consumption. For example, horizontal partitioning ensures that only the thread assigned tile rows need to allocate local memory buffers to store the intermediate results when going through all the tiles in a partition. This is essential to reduce remote memory access in a NUMA machine. Horizontal partitioning also allows a small matrix index to locate non-zero entries in the sparse matrix quickly, because we only need to locate tile rows. As such, the matrix index is tiny even for a sparse matrix with billions of rows and we can keep the entire matrix index in memory.
We maintain a global task queue for sparse matrix multiplication to achieve better load blancing and large I/O writes to SSDs. A task may indicate the computation on a super tile row or a single tile row. A thread gets one task at a time from the queue. To achieve good load balancing, threads get larger tasks at the beginning and smaller tasks when the computation gets close to the end. The global task queue maintains the global execution order, which becomes essential when the output dense matrix needs to be written to SSDs. When a thread gets a task, the computation result from the task may be small. Instead of writing the computation result immediately, we delay writes in order to merge computation results from multiple threads and write them with a single I/O. The global task queue to ensures that all threads are processing contiguous tile rows to help I/O merging.
The impact of the memory size on I/O
More memory reduces I/O in semi-external memory. The minimum memory requirement for semi-external memory sparse matrix multiplication is nc + t , where n is the number of rows of the input dense matrix, c is the element size in bytes, t is the number of threads processing the sparse matrix and is the buffer size for the sparse matrix and the output dense matrix. When a machine does not have sufficient memory to keep the entire input dense matrix in memory, we need multiple passes on the sparse matrix to perform sparse matrix multiplication. Reducing memory consumption is essential to achieve performance in semi-external memory. By keeping more columns of the input dense matrix in memory, we reduce the number of I/O passes.
When a machine does not have sufficient memory to keep the entire input dense matrix, we use the existing memory to keep as many columns in the input dense matrix in memory as possible. Although we can use some memory to cache part of the sparse matrix, keeping more columns of the input dense matrix in memory saves more I/O than using the same amount of memory to cache the sparse matrix. Assume the input dense matrix has n rows and k columns. Again, c is the element size in bytes. The storage size of the sparse matrix is E bytes and the memory size is M bytes. We further assume we use M bytes to keep some columns of the dense matrices in memory (M < M , nck mod M ≡ 0) and the remaining memory (M − M ) to cache the sparse matrix. The amount of data in the sparse matrix read from SSDs is
Because E > M in semi-external memory, we minimize IOin by maximizing M . Therefore, using memory for the input dense matrix always results in a smaller amount of I/O than using memory for caching the sparse matrix.
As the number of columns in memory from the input dense matrix increases, the bottleneck of the system may switch. When we keep only one column of the input dense matrix in memory, the system is usually I/O bound; when we can keep more columns of the dense matrix in memory, the system will become CPU bound. Once sparse matrix multiplication becomes CPU bound, the I/O complexity does not affect its performance.
I/O complexity
The semi-external memory (SEM) solution for sparse matrix multiplication leads to no more I/O than the externalmemory (EM) solution for many real-world graphs.
When a machine has sufficient memory to keep the entire input dense matrix in memory, the SEM solution only needs to read the sparse matrix and the input dense matrix once and write the output dense matrix once. This is the minimum amount of I/O.
When a machine has insufficient memory to keep the input dense matrix, the SEM solution still leads to less I/O than the EM solution when E < nckt. In the analysis, we assume a square sparse matrix. The same analysis applies to a rectangular sparse matrix as well. In this case, the SEM solution scans the sparse matrix multiple times.
To minimize writes, the EM solution scans the sparse matrix once but reads the input dense matrix multiple times. Due to near random vertex connection in real-world graphs, the EM solution needs to read the entire input dense matrix each time. In the parallel setting, the EM solution requires each thread to keep local memory buffers for portions of the input and output dense matrices. Assume the EM solution keeps j rows from the input dense matrix and i rows from the output dense matrix in memory in each thread.
As such, the SEM solution in practice causes less I/O in many natural graphs. For the natural graphs that we have seen, such as Twitter [16] , the Page graph [33] and Friendster [36] , the number of edges is of 10 − 100× the number of vertices. Essentially, natural graphs have sparse edge matrices. We target multi-core machines with 10s to 100s of threads. For most of our applications, k is of size 1-30. For very small k, the SEM solution can keep the entire input dense matrix in memory and leads to the minimum I/O. For a relatively larger k, E < nckt holds for most of natural graphs when the graphs are processed in a large parallel machine. Therefore, the SEM solution usually performs less I/O than EM.
APPLICATIONS
We apply sparse matrix multiplication to three important applications widely used in data mining and machine learning: PageRank [8] , eigensolver [5] and non-negative matrix factorization [18] . Each application demonstrates a different strategy of using memory for sparse matrix multiplication.
PageRank
PageRank is an algorithm to rank the Web pages by using hyperlinks between Web pages. It was first used by Google and is identified as one of the top 10 data mining algorithms [35] . PageRank is a representative of a set of graph algorithms that can be expressed with sparse matrix multiplication or generalized sparse matrix multiplication. Other important examples are label propagation [45] and belief propagation [38] . The algorithm runs iteratively and its update rule for each Web pages in an iteration is
where B(u) denotes the neighbor list of vertex u and L(v) denotes the out-degree of vertex v. We implement the PageRank algorithm with sparse matrix multiplication ( Figure 3 ).
w h i l e ( L 1 >= e p s i l o n && n i t e r s < max . n i t e r s ) { p r 2 <− d / N+(1−d ) * ( A % * % ( p r 1 / o ut . deg ) ) L 1 <− sum ( abs ( pr1−p r 2 ) ) p r 1 <− p r 2 n i t e r s <− n i t e r s + 1 } Figure 3 : The implementation of PageRank [8] with SpMM.
Eigensolver
An eigensolver is another commonly used application that requires sparse matrix multiplication. Many algorithms [17, 9, 29] and frameworks [19, 5, 13] have been developed to solve a large eigenvalue problem.
We take advantage of the Anasazi eigensolver framework [5] and replace its original matrix operations with our SEM sparse matrix multiplication and external-memory dense matrix operations. To compute eigenvalues of a n × n matrix, many eigenvalue algorithms for a large sparse matrix require to construct a vector subspace with a sequence of sparse matrix multiplication and each vector in the subspace has the length of n. Due to the sparsity of real-world graphs, the vector subspace is large and we keep vectors in the subspace on SSDs. In addition to sparse matrix multiplication, eigensolvers perform some dense matrix operations on the subspace. For example, eigensolvers need to orthogonalize the vectors in the subspace with dense matrix multiplication. The Anasazi eigensolvers have block extension to update multiple vectors in the subspace simultaneously and require sparse matrix dense matrix multiplication. The most efficient Anasazi eigensolver on sparse graphs is the KrylovSchur eigensolver [29] , which updates a small number of vectors (1-4) in the subspace simultaneously. Zheng et al. [42] provides the details of extending the Anasazi eigensolver with external-memory matrix operations.
Non-negative matrix factorization
Non-negative matrix factorization (NMF) [18] finds two non-negative low-rank matrices W and H to approximate a matrix A ≈ W H. NMF is typically used to find factorization on sparse matrices. NMF has many applications in machine learning and data mining. A well-known example is collaborative filtering [30] in recommender systems. NMF can also be applied to graphs to find communities [37, 32] .
Many algorithms are designed to solve NMF and here we describe an algorithm [18] that requires a sequence of sparse matrix multiplication. The algorithm uses multiplicative update rules and updates matrices W and H alternately. In each iteration, the algorithm first fixes W to update H and then fixes H to update W .
We apply SEM sparse matrix multiplication to NMF differently based on the memory size and the number of columns in W and H. Due to the sparsity of a graph, it is possible that the non-negative matrices W and H may require the storage as large as the sparse matrix and can no longer fit in memory. Therefore, we need to partition W and H vertically and run multiple sparse matrix multiplications to compute W T A and AH T , if the memory is not large enough to keep W and H.
Graph datasets # Vertices # Edges Directed
Twitter [16 
EXPERIMENTAL EVALUATION
We evaluate the performance of the semi-external memory sparse matrix multiplication on multiple real-world billionscale graphs including a web-page graph with 3.4 billion vertices. We first measure the performance of our semi-external memory implementation and compare it with multiple inmemory implementations: (i) our in-memory implementation, (ii) MKL and (iii) Trilinos. We also demonstrate the effectiveness of CPU and I/O optimizations on sparse matrix multiplication. We then evaluate the overall performance of the applications in Section 4 and demonstrate the impact of the memory size on the applications.
We conduct experiments on a non-uniform memory architecture machine with four Intel Xeon E7-4860 processors, clocked at 2.6 GHz, and 1TB memory of DDR3-1600. Each processor has 12 cores. The machine has three LSI SAS 9300-8e host bus adapters (HBA) connected to a SuperMicro storage chassis, in which 24 OCZ Intrepid 3000 SSDs are installed. The 24 SSDs together are capable of delivering 12 GB/s for read and 10 GB/s for write at maximum. The machine runs Linux kernel v3.13.0. We use 48 threads for our in-memory and semi-external implementation.
We use the adjacency matrices of the graphs in Table 1 for performance evaluation. The smallest graph we use has 42 million vertices and 1.5 billion edges. The largest graph is the Page graph with 3.4 billion vertices and 129 billion edges, which is two orders of magnitude larger than the smallest graphs. We generate two synthetic graphs with R-Mat [10] to fill the size gap between the smallest and largest graph. We construct a directed and undirected version for each of the synthetic graphs because some applications in Section 4 run on directed graphs and some run on undirected graphs. We always use the undirected version for performance evaluation of sparse matrix multiplication. The Page graph is clustered by domain, generating good CPU cache hit rates in sparse matrix multiplication.
The performance of sparse matrix multiplication
We evaluate the performance of our semi-external memory sparse matrix multiplication (SEM-SpMM) and compare its performance with our in-memory sparse matrix multiplication (IM-SpMM) and other state-of-art in-memory sparse matrix multiplication implementations including the ones in Intel MKL and Trilinos on the graphs in Table 1 . SEMSpMM always accesses the sparse matrix from SSDs while IM-SpMM accesses all data in memory. The MKL and Trilinos implementations cannot run on the Page graph because its size exceeds memory capacity of our NUMA machine.
SEM-SpMM vs. IM-SpMM
We first compare the performance of SEM-SpMM against IM-SpMM on all graphs with the input and output dense matrices stored in memory. In this case, the dense matrices There is only a small performance penalty for semi-external memory ( Figure 4 ). The performance gap between IMSpMM and SEM-SpMM is affected by randomness of vertex connection. The gap is smaller if vertex connection in a graph is more random. The Page graph is relatively well clustered, so SpMM on this graph is less CPU-bound than others. Even for the Page graph, SEM-SpMM gets 65% performance of IM-SpMM. The other factor of affecting the performance gap is the number of columns in the dense matrices. The gap gets smaller as the number of columns in the dense matrices increases. For all graphs, SEM-SpMM requires a very small number of columns to become CPUbound and achieve 100% performance of IM-SpMM.
SEM-SpMM vs. other in-memory SpMM
In this section, we compare SEM-SpMM with the Intel MKL and Trilinos implementations. Intel MKL runs on shared-memory machines. Trilinos can run in both shared memory and distributed memory, so we measure its performance in our 48-core NUMA machine as well as an EC2 cluster. We run Trilinos in the largest EC2 instances r3.8xlarge, where each has 16 physical CPU cores and 244GB of RAM and is optimized for memory-intensive applications. The EC2 instances are connected with 10Gbps network.
Our SEM-SpMM significantly outperforms Intel MKL and Trilinos on the natural graphs on our NUMA machine (Figure 5) . In this case, we compare performance of our SEMSpMM with Intel MKL and Trilinos for both sparse matrix vector multiplication (SpMV) and sparse matrix dense matrix multiplication (SpMM). The Trilinos implementation is optimized for SpMV. Our SEM-SpMM still constantly outperforms Trilinos by a factor of 2 − 3 even for SpMV. The MKL implementation has better optimizations for SpMM, but our SEM-SpMM can still almost twice as fast as MKL when the dense matrix has eight columns.
SEM-SpMM only consumes a small fraction of memory compared with IM-SpMM and other SpMM implementations ( Figure 6 ). SEM-SpMM consumes memory for the input dense matrix as well as per-thread local memory buffers for the sparse matrix and the output dense matrix. When we use 48 threads for SpMM, the memory used by local memory buffers in each thread is significant but is relatively constant for different graph sizes. As such, we only show the memory consumption on the largest graph RMAT-160 in Figure  5 . Despite considerable memory consumed by local memory buffers for SEM-SpMM, SEM-SpMM uses about one tenth of the memory used by IM-SpMM. We also observe that IMSpMM consumes much less memory than MKL and Trilinos owing to its compact format for sparse matrices.
Our SEM-SpMM also outperforms Trilinos that runs in the Amazon cloud by a large margin for SpMV, especially on real-world graphs (Figure 7 ). When Trilinos runs on 8 EC2 instances, it has 2.5 times as many CPU cores as our SEM-SpMM on the NUMA machine. We do not compare SEM-SpMM with Trilinos for SpMM because Trilinos is not optimized for SpMM as shown above. Trilinos is not able to run SpMV on RMAT-160 on two EC2 nodes. One of the main reasons that our SEM-SpMM performs much better on real-world graphs is that these graphs are more likely to cause load imbalance and our SEM-SpMM is able to balance load much better than distributed implementations that partition data.
SEM-SpMM with a large input dense matrix
We further measure the performance of SEM-SpMM with a large input dense matrix, in which neither the sparse matrix nor the dense matrices can fit in memory. In this experiment, we measure the performance of multiplying a sparse matrix with a dense matrix of 32 columns and the input dense matrix is stored on SSDs initially. We study the impact of memory size on the performance of SEM-SpMM by artificially varying the number of columns that can fit in memory. In each run, we need to load the input dense matrix from SSDs and stream the output dense matrix to SSDs. We do not show the result on the Page graph because the dense matrix with 32 columns for the Page graph cannot fit in memory.
As more columns in the input dense matrix can fit in memory, the performance of SEM-SpMM constantly increases (Figure 8 gets 25% of the in-memory performance. When the entire input dense matrix can fit in memory, we get about 80% of the in-memory performance. Two main factors lead to performance loss in SEM-SpMM when the input dense matrix cannot fit in memory. We illustrate the contribution of four potential overheads in SEM-SpMM on the Friendster graph ( Figure 9 ). The main performance loss comes from the loss of data locality in SpMM caused by vertical partitioning of the input dense matrix (Vert-part). Partitioning the dense matrix into onecolumn matrices contributes 60% of performance loss. It drops quickly when the vertical partition size increases. Keeping the sparse matrix on SSDs (SpM-EM) also contributes some performance loss when the dense matrix is partitioned into small matrices. The overhead almost goes away when more than four columns of the dense matrix can fit in memory. The overhead of streaming the output dense matrix to SSDs (Out-EM) and reading the input dense matrix to memory (In-EM) is less significant but remains the same for different memory sizes.
Optimizations on sparse matrix multiplication
Accelerating SEM-SpMM requires both computation and I/O optimizations. Due to the limit of space, we only illustrate the effectiveness of computation optimizations.
Here we illustrate the most significant optimizations from Section 3. We start with an in-memory implementation that performs sparse matrix multiplication on a sparse matrix in the CSR format and apply the optimizations incrementally in the following order:
• dispatch partitions of a sparse matrix to threads dynamically to balance load (Load balance), • partition dense matrices for NUMA (NUMA), • organize the non-zero entries in a sparse matrix into tiles to increase CPU cache hits (Cache blocking), of columns in the dense matrices. The largest performance boost is from cache blocking, especially for SpMV. This is expected because the main overhead of SpMV comes from random memory access and cache blocking significantly increases CPU cache hits to reduce random memory access. CPU vectorization is only effective on SpMM because it optimizes computation on a row of the dense matrix. With all optimizations, we have a fast in-memory implementation for both sparse matrix vector multiplication and sparse matrix dense matrix multiplication.
Performance of the applications
We evaluate the performance of the implementations of the applications in Section 4. We show the effectiveness of additional memory for these applications in our implementation and compare their performance with the state-of-art implementations on smaller graphs.
PageRank
We evaluate the performance of our SpMM-based PageRank implementation (SpMM-PageRank). This implementation requires the input vector to be in memory, but it is optional to keep the output vector and the degree vector in memory. PageRank is a benchmarking graph algorithm implemented by many graph processing frameworks. We compare the performance of SpMM-PageRank with the stateof-art implementations in FlashGraph [43] , a semi-external memory graph engine, and GraphLab create, the next generation of PowerGraph [12] . The PageRank implementation in FlashGraph computes approximate PageRank values while SpMM-PageRank and GraphLab Create computes exact PageRank values. We run GraphLab Create completely in memory and run FlashGraph in semi-external memory. GraphLab Create is not able to compute PageRank on the Page graph.
SpMM-PageRank in memory and in semi-external memory both significantly outperforms the implementations in FlashGraph and GraphLab Create (Figure 11 The experiment results also show that keeping more vectors in memory has modest performance improvement for SpMM-PageRank. As such, SpMM-PageRank only needs to keep one vector in memory, which results in very small memory consumption.
Eigensolver
We evaluate the performance of our SEM KrylovSchur eigensolver and compare its performance with our in-memory eigensolver and the Trilinos KrylovSchur eigensolver. Usually, spectral analysis only requires a very small number of eigenvalues, so we compute eight eigenvalues in this experiment. We run the eigensolvers on the smaller undirected graphs in Table 1 . To evaluate the scalability of the SEM eigensolver, we compute singular value decomposition (SVD) on the Page graph. Among all of the eigensolvers, only our SEM eigensolver is able to compute eigenvalues on the Page graph.
For computing 8 eigenvalues, our SEM eigensolver achieves performance comparable to our in-memory eigensolver and the Trilinos eigensolver and can scale to very large graphs ( Figure 12 ). Unlike PageRank, an eigensolver has many more vector or dense matrix operations. As such, the memory size has noticeable impact on performance. For the setting with the minimum memory consumption, it has at least 45% performance of our in-memory eigensolver; when keeping the entire subspace in memory, it has almost the same performance as our in-memory eigensolver.
NMF
We evaluate the performance of our NMF implementation (SEM-NMF) on the directed graphs in Table 1 . The dense matrices for NMF can be as large as the sparse matrix. As such, we experiment the effect of the memory size on the performance of SEM-NMF by varying the number of columns in memory from the dense matrices. We also compare the performance of SEM-NMF with a high-performance NMF implementation SmallK [7] , built on top of the numeric library Elemental [28] . We factorize each of the graphs into two n × k non-negative dense matrices and we use k = 16 because 16 is the largest k that SmallK supports for the graphs in Table 1 .
We significantly improve the performance of SEM-NMF by keeping more columns of the input dense matrix in memory ( Figure 13 ). The performance improvement is more significant when the number of columns that fit in memory is small. When we keep eight columns of the input dense matrix in memory, SEM-NMF achieves over 60% of the performance of the in-memory implementation.
SEM-NMF significantly outperforms other NMF implementations in the literature. SmallK is the closest competitor in the literature. We run the same NMF algorithm in SmallK. As shown in Figure 13 , SEM-NMF outperforms SmallK by a large factor on all graphs. There are many MapReduce implementations in the literature [21, 39, 22] . They run on sparse matrices with tens of millions of nonzero entries but generally take one or two orders of magnitude more time than our SEM-NMF on the sparse matrices with billions or even tens of billions of non-zero entries.
CONCLUSIONS
We present an alternative solution for scaling sparse matrix multiplication to large sparse matrices by utilizing commodity SSDs. We perform sparse matrix multiplication in semi-external memory (SEM), in which we keep the sparse matrix on SSDs and the dense matrices in memory. Semiexternal memory increases scalability in proportion to the ratio of non-zero entries to rows or columns in a sparse matrix. It incorporates well with many memory optimizations for sparse matrix multiplication such as cache blocking and NUMA organization. We also deploy a set of I/O optimizations for high-speed SSDs such as I/O polling and preallocating memory buffers for I/O.
Our SEM sparse matrix multiplication achieves performance comparable to our highly optimized in-memory implementation while significantly outperforming the Intel MKL and Trilinos implementations. Our SEM implementation achieves almost 100% performance of the in-memory implementation on some graphs when the dense matrices can fit in memory and have more than four columns. Even when the dense matrix has only one column, it achieves at least 65% performance of its in-memory counterpart on different graphs. Our SEM sparse matrix multiplication also scales to very large graphs with billions of vertices and hundreds of billions of edges.
For a machine with insufficient memory to keep the entire input dense matrix in memory, we partition the dense matrix vertically and run SEM SpMM multiple times. In this case, the main overhead of SEM SpMM comes from the loss of data locality caused by vertical partitioning on the dense matrix. However, given sufficient memory to keep a small number of columns of the input dense matrix, we can achieve performance comparable to the in-memory counterpart.
We apply our sparse matrix multiplication to three important applications: PageRank, eigendecomposition and nonnegative matrix factorization. We demonstrate how additional memory should be used in semi-external memory in each application. We further demonstrate that each of our implementations significantly outperforms the state of art and can scale to very large graphs.
Through the thorough evaluation, we demonstrate that semi-external memory coupled with fast SSDs achieves performance very close to highly optimized in-memory implementations and scales to massive datasets. As such, our approach provides a very promising alternative to distributed computation for large-scale data analysis.
Our SSD-based solution also achieves very high energy efficiency even though we have not measured energy consumption explicitly. SSDs are energy-efficient storage media [31] compared with RAM and hard drives. When processing large datasets, our solution only uses a single machine and requires a relatively small amount of memory. In contrast, a distributed solution requires many more machines and much more aggregate memory in order to process datasets of the same size. As such, our solution introduces an energyefficient architecture for large-scale data analysis tasks.
ACKNOWLEDGMENTS
This work is partially supported by NSF ACI-1261715, DARPA GRAPHS N66001-14-1-4028 and DARPA SIMPLEX program through SPAWAR contract N66001-15-C-4041.
