6 research outputs found
CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
Sparse matrix-vector multiplication (SpMV) is a fundamental building block
for numerous applications. In this paper, we propose CSR5 (Compressed Sparse
Row 5), a new storage format, which offers high-throughput SpMV on various
platforms including CPUs, GPUs and Xeon Phi. First, the CSR5 format is
insensitive to the sparsity structure of the input matrix. Thus the single
format can support an SpMV algorithm that is efficient both for regular
matrices and for irregular matrices. Furthermore, we show that the overhead of
the format conversion from the CSR to the CSR5 can be as low as the cost of a
few SpMV operations. We compare the CSR5-based SpMV algorithm with 11
state-of-the-art formats and algorithms on four mainstream processors using 14
regular and 10 irregular matrices as a benchmark suite. For the 14 regular
matrices in the suite, we achieve comparable or better performance over the
previous work. For the 10 irregular matrices, the CSR5 obtains average
performance improvement of 17.6\%, 28.5\%, 173.0\% and 293.3\% (up to 213.3\%,
153.6\%, 405.1\% and 943.3\%) over the best existing work on dual-socket Intel
CPUs, an nVidia GPU, an AMD GPU and an Intel Xeon Phi, respectively. For
real-world applications such as a solver with only tens of iterations, the CSR5
format can be more practical because of its low-overhead for format conversion.
The source code of this work is downloadable at
https://github.com/bhSPARSE/Benchmark_SpMV_using_CSR5Comment: 12 pages, 10 figures, In Proceedings of the 29th ACM International
Conference on Supercomputing (ICS '15
Optimizing Sparse Matrix-Vector Multiplications on an ARMv8-based Many-Core Architecture
Sparse matrix–vector multiplications (SpMV) are common in scientific and HPC applications but are hard to be optimized. While the ARMv8-based processor IP is emerging as an alternative to the traditional x64 HPC processor design, there is little study on SpMV performance on such new many-cores. To design efficient HPC software and hardware, we need to understand how well SpMV performs. This work develops a quantitative approach to characterize SpMV performance on a recent ARMv8-based many-core architecture, Phytium FT-2000 Plus (FTP). We perform extensive experiments involved over 9500 distinct profiling runs on 956 sparse datasets and five mainstream sparse matrix storage formats, and compare FTP against the Intel Knights Landing many-core. We experimentally show that picking the optimal sparse matrix storage format and parameters is non-trivial as the correct decision requires expert knowledge of the input matrix and the hardware. We address the problem by proposing a machine learning based model that predicts the best storage format and parameters using input matrix features. The model automatically specializes to the many-core architectures we considered. The experimental results show that our approach achieves on average 93% of the best-available performance without incurring runtime profiling overhead
Learning Compact Compositional Embeddings via Regularized Pruning for Recommendation
Latent factor models are the dominant backbones of contemporary recommender
systems (RSs) given their performance advantages, where a unique vector
embedding with a fixed dimensionality (e.g., 128) is required to represent each
entity (commonly a user/item). Due to the large number of users and items on
e-commerce sites, the embedding table is arguably the least memory-efficient
component of RSs. For any lightweight recommender that aims to efficiently
scale with the growing size of users/items or to remain applicable in
resource-constrained settings, existing solutions either reduce the number of
embeddings needed via hashing, or sparsify the full embedding table to switch
off selected embedding dimensions. However, as hash collision arises or
embeddings become overly sparse, especially when adapting to a tighter memory
budget, those lightweight recommenders inevitably have to compromise their
accuracy. To this end, we propose a novel compact embedding framework for RSs,
namely Compositional Embedding with Regularized Pruning (CERP). Specifically,
CERP represents each entity by combining a pair of embeddings from two
independent, substantially smaller meta-embedding tables, which are then
jointly pruned via a learnable element-wise threshold. In addition, we
innovatively design a regularized pruning mechanism in CERP, such that the two
sparsified meta-embedding tables are encouraged to encode information that is
mutually complementary. Given the compatibility with agnostic latent factor
models, we pair CERP with two popular recommendation models for extensive
experiments, where results on two real-world datasets under different memory
budgets demonstrate its superiority against state-of-the-art baselines. The
codebase of CERP is available in https://github.com/xurong-liang/CERP.Comment: Accepted by ICDM'2
Characterizing Scalability of Sparse Matrix–Vector Multiplications on Phytium FT-2000+
Understanding the scalability of parallel programs is crucial for software optimization and hardware architecture design. As HPC hardware is moving towards many-core design, it becomes increasingly difficult for a parallel program to make effective use of all available processor cores. This makes scalability analysis increasingly important. This paper presents a quantitative study for characterizing the scalability of sparse matrix–vector multiplications (SpMV) on Phytium FT-2000+, an ARM-based HPC many-core architecture. We choose SpMV as it is a common operation in scientific and HPC applications. Due to the newness of ARM-based many-core architectures, there is little work on understanding the SpMV scalability on such hardware design. To close the gap, we carry out a large-scale empirical evaluation involved over 1000 representative SpMV datasets. We show that, while many computation-intensive SpMV applications contain extensive parallelism, achieving a linear speedup is non-trivial on Phytium FT-2000+. To better understand what software and hardware parameters are most important for determining the scalability of a given SpMV kernel, we develop a performance analytical model based on the regression tree. We show that our model is highly effective in characterizing SpMV scalability, offering useful insights to help application developers for better optimizing SpMV on an emerging HPC architecture