44 research outputs found

    swTVM: Exploring the Automated Compilation for Deep Learning on Sunway Architecture

    Full text link
    The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the exiting deep learning compilers, TVM is well known for its efficiency in code generation and optimization across diverse hardware devices. In the meanwhile, the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific and deep learning applications. This paper combines the trends in these two directions. Specifically, we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway. In addition, we leverage the architecture features during the compilation such as core group for massive parallelism, DMA for high bandwidth memory transfer and local device memory for data locality, in order to generate efficient code for deep learning application on Sunway. The experimental results show the ability of swTVM to automatically generate code for various deep neural network models on Sunway. The performance of automatically generated code for AlexNet and VGG-19 by swTVM achieves 6.71x and 2.45x speedup on average than hand-optimized OpenACC implementations on convolution and fully connected layers respectively. This work is the first attempt from the compiler perspective to bridge the gap of deep learning and high performance architecture particularly with productivity and efficiency in mind. We would like to open source the implementation so that more people can embrace the power of deep learning compiler and Sunway many-core processor

    MG3MConv: Multi-Grained Matrix-Multiplication-Mapping Convolution Algorithm toward the SW26010 Processor

    Full text link
    As the core of artificial intelligence applications, the research of convolution has become a hot topic in high performance computing. With the rapid development of the emerging SW26010 processor in artificial intelligence, there is an urgent need for high-performance convolution algorithms on the processor. However, the current support of convolution on SW26010 is still rudimentary. The only studies provide sufficient runtime peak performance but lack the adaptability to various convolution scenes. To perfect convolution algorithms on SW26010, we propose a multi-grained matrix-multiplication-mapping convolution algorithm called MG3MConv, which targets the architectural features of SW26010. MG3MConv supports diversified mapping schemes of convolution tasks based on the concept of the thread block proposed in this paper. All the architecture-oriented optimization methods are elaborately designed from four levels to fully exploit the hardware efficiency of SW26010. The experiments show that the hardware efficiency of MG3MConv can reach 84.78% in max, which is 1.75 times compared with that of cuDNN based on NVIDIA K80m GPU. Moreover, MG3MConv can overperform cuDNN in most convolution scenes. We also use six representative CNNs as real-world cases, and the hardware efficiency of MG3MConv reaches up to 67.04% on the VGG network model, which is 1.37 times and 1.96 times that of cuDNN and swDNN, respectively

    NUMA-Aware Strategies for the Heterogeneous Execution of SPMV on Modern Supercomputers

    Get PDF
    The sparse matrix-vector product is a widespread operation amongst the scientific computing community. It represents the dominant computational cost in many large-scale simulations relying on iterative methods, and its performance is sensitive to the sparse pattern, the storage format, and kernel implementation, and the target computing architecture. In this work, we are devoted to the efficient execution of the sparse matrix-vector product on (potentially hybrid) modern supercomputers with non-uniform memory access configurations. A hierarchical parallel implementation is proposed to minimize the number of processes participating in distributed-memory parallelization. As a result, a single process per computing node is enough to engage all its hardware and ensure efficient memory access on manycore platforms. The benefits of this approach have been demonstrated on up to 9,600 cores of MareNostrum 4 supercomputer, at Barcelona Supercomputing Center.The work of A. Gorobets has been funded by the Russian Science Foundation, project 19- 11-00299. The work of X. Alvarez-Farr ´ e, F. X. Trias and A. Oliva has been financially supported ´ by the ANUMESOL project (ENE2017-88697-R) by the Spanish Research Agency (Ministerio de Economía y Competitividad, Secretaría de Estado de Investigacion, Desarrollo e Inno- ´ vacion), and the FusionCAT project (001-P-001722) by the Government of Catalonia (RIS3CAT ´ FEDER). The studies of this work have been carried out using the MareNostrum 4 supercomputer of the Barcelona Supercomputing Center (projects IM-2020-2-0029 and IM-2020-3-0030); the TSUBAME3.0 supercomputer of the Global Scientific Information and Computing Center at Tokyo Institute of Technology; the Lomonosov-2 supercomputer of the shared research facilities of HPC computing resources at Lomonosov Moscow State University; the K-60 hybrid cluster of the collective use center of the Keldysh Institute of Applied Mathematics. The authors thankfully acknowledge these institutions for the compute time and technical support.Postprint (published version

    A Systematic Survey of General Sparse Matrix-Matrix Multiplication

    Full text link
    SpGEMM (General Sparse Matrix-Matrix Multiplication) has attracted much attention from researchers in fields of multigrid methods and graph analysis. Many optimization techniques have been developed for certain application fields and computing architecture over the decades. The objective of this paper is to provide a structured and comprehensive overview of the research on SpGEMM. Existing optimization techniques have been grouped into different categories based on their target problems and architectures. Covered topics include SpGEMM applications, size prediction of result matrix, matrix partitioning and load balancing, result accumulating, and target architecture-oriented optimization. The rationales of different algorithms in each category are analyzed, and a wide range of SpGEMM algorithms are summarized. This survey sufficiently reveals the latest progress and research status of SpGEMM optimization from 1977 to 2019. More specifically, an experimentally comparative study of existing implementations on CPU and GPU is presented. Based on our findings, we highlight future research directions and how future studies can leverage our findings to encourage better design and implementation.Comment: 19 pages, 11 figures, 2 tables, 4 algorithm

    Improving Performance of Iterative Methods by Lossy Checkponting

    Get PDF
    Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. To this end, significantly reducing the checkpointing overhead is critical to improving the overall performance of iterative methods. Our contribution is fourfold. (1) We propose a novel lossy checkpointing scheme that can significantly improve the checkpointing performance of iterative methods by leveraging lossy compressors. (2) We formulate a lossy checkpointing performance model and derive theoretically an upper bound for the extra number of iterations caused by the distortion of data in lossy checkpoints, in order to guarantee the performance improvement under the lossy checkpointing scheme. (3) We analyze the impact of lossy checkpointing (i.e., extra number of iterations caused by lossy checkpointing files) for multiple types of iterative methods. (4)We evaluate the lossy checkpointing scheme with optimal checkpointing intervals on a high-performance computing environment with 2,048 cores, using a well-known scientific computation package PETSc and a state-of-the-art checkpoint/restart toolkit. Experiments show that our optimized lossy checkpointing scheme can significantly reduce the fault tolerance overhead for iterative methods by 23%~70% compared with traditional checkpointing and 20%~58% compared with lossless-compressed checkpointing, in the presence of system failures.Comment: 14 pages, 10 figures, HPDC'1

    Fuzzy interacting multiple model H∞ particle filter algorithm based on current statistical model

    Get PDF
    In this paper, fuzzy theory and interacting multiple model are introduced into H∞ filter-based particle filter to propose a new fuzzy interacting multiple model H∞ particle filter based on current statistical model. Each model uses H∞ particle filter algorithm for filtering, in which the current statistical model can describe the maneuver of target accurately and H∞ filter can deal with the nonlinear system effectively. Aiming at the problem of large amount of probability calculation in interacting multiple model by using combination calculation method, our approach calculates each model matching probability through the fuzzy theory, which can not only reduce the calculation amount, but also improve the state estimation accuracy to some extent. The simulation results show that the proposed algorithm can be more accurate and robust to track maneuvering target

    Supercomputing Frontiers

    Get PDF
    This open access book constitutes the refereed proceedings of the 6th Asian Supercomputing Conference, SCFA 2020, which was planned to be held in February 2020, but unfortunately, the physical conference was cancelled due to the COVID-19 pandemic. The 8 full papers presented in this book were carefully reviewed and selected from 22 submissions. They cover a range of topics including file systems, memory hierarchy, HPC cloud platform, container image configuration workflow, large-scale applications, and scheduling

    Accelerating Dynamical Density Response Code on Summit and Its Application for Computing the Density Response Function of Vanadium Sesquioxide

    Get PDF
    This thesis details the process of porting the Eguiluz group dynamical density response computational platform to the hybrid CPU+GPU environment at the Summit supercomputer at Oak Ridge National Laboratory (ORNL) Leadership Computing Center. The baseline CPU-only version is a Gordon Bell-winning platform within the formally-exact time-dependent density functional theory (TD-DFT) framework using the linearly augmented plane wave (LAPW) basis set. The code is accelerated using a combination of the OpenACC programming model and GPU libraries -- namely, the Matrix Algebra for GPU and Multicore Architectures (MAGMA) library -- as well as exploiting the sparsity pattern of the matrices involved in the matrix-matrix multiplication. Benchmarks show a 12.3x speedup compared to the CPU-only version. This performance boost should accelerate discovery in material and condensed matter physics through computational means. After the hybrid CPU+GPU code has been sufficiently optimized, it is used to study the dynamical density response function of vanadium sesquioxide, and the results are compared with spectroscopic data from non-resonant inelastic X-ray scattering {NIXS} experiments
    corecore