254 research outputs found

    Optimized Vectorization Implementation of CRYSTALS-Dilithium

    Full text link
    CRYSTALS-Dilithium is a lattice-based signature scheme to be standardized by NIST as the primary post-quantum signature algorithm. In this work, we make a thorough study of optimizing the implementations of Dilithium by utilizing the Advanced Vector Extension (AVX) instructions, specifically AVX2 and the latest AVX512. We first present an improved parallel small polynomial multiplication with tailored early evaluation (PSPM-TEE) to further speed up the signing procedure, which results in a speedup of 5\%-6\% compared with the original PSPM Dilithium implementation. We then present a tailored reduction method that is simpler and faster than Montgomery reduction. Our optimized AVX2 implementation exhibits a speedup of 3\%-8\% compared with the state-of-the-art of Dilithium AVX2 software. Finally, for the first time, we propose a fully and highly vectorized implementation of Dilithium using AVX-512. This is achieved by carefully vectorizing most of Dilithium functions with the AVX512 instructions in order to improve efficiency both for time and for space simultaneously. With all the optimization efforts, our AVX-512 implementation improves the performance by 37.3\%/50.7\%/39.7\% in key generation, 34.1\%/37.1\%/42.7\% in signing, and 38.1\%/38.7\%/40.7\% in verification for the parameter sets of Dilithium2/3/5 respectively. To the best of our knowledge, our AVX512 implementation has the best performance for Dilithium on the Intel x64 CPU platform to date.Comment: 13 pages, 5 figure

    Using AVX2 Instruction Set to Increase Performance of High Performance Computing Code

    Get PDF
    In this paper we discuss new Intel instruction extensions - Intel Advance Vector Extensions 2 (AVX2) and what these bring to high performance computing (HPC). To illustrate this new systems utilizing AVX2 are evaluated to demonstrate how to effectively exploit AVX2 for HPC types of the code and expose the situation when AVX2 might not be the most effective way to increase performance

    Time series analysis acceleration with advanced vectorization extensions

    Get PDF
    Time series analysis is an important research topic and a key step in monitoring and predicting events in many felds. Recently, the Matrix Profle method, and particularly two of its Euclidean-distance-based implementations—SCRIMP and SCAMP—have become the state-of-the-art approaches in this feld. Those algorithms bring the possibility of obtaining exact motifs and discords from a time series, which can be used to infer events, predict outcomes, detect anomalies and more. While matrix profle is embarrassingly parallelizable, we fnd that auto-vectorization techniques fail to fully exploit the SIMD capabilities of modern CPU architectures. In this paper, we develop custom-vectorized SCRIMP and SCAMP implementations based on AVX2 and AVX-512 extensions, which we combine with multithreading techniques aimed at exploiting the potential of the underneath architectures. Our experimental evaluation, conducted using real data, shows a performance improvement of more than 4× with respect to the auto-vectorization.Funding for open access publishing: Universidad Málaga/CBU

    Time series analysis acceleration with advanced vectorization extensions

    Get PDF
    Time series analysis is an important research topic and a key step in monitoring and predicting events in many fields. Recently, the Matrix Profile method, and particularly two of its Euclidean-distance-based implementations—SCRIMP and SCAMP—have become the state-of-the-art approaches in this field. Those algorithms bring the possibility of obtaining exact motifs and discords from a time series, which can be used to infer events, predict outcomes, detect anomalies and more. While matrix profile is embarrassingly parallelizable, we find that auto-vectorization techniques fail to fully exploit the SIMD capabilities of modern CPU architectures. In this paper, we develop custom-vectorized SCRIMP and SCAMP implementations based on AVX2 and AVX-512 extensions, which we combine with multithreading techniques aimed at exploiting the potential of the underneath architectures. Our experimental evaluation, conducted using real data, shows a performance improvement of more than 4× with respect to the auto-vectorization.This work has been supported by the Government of Spain under project PID2019-105396RB-I00, and Junta de Andalucía under projects P18-FR-3433, and UMA18-FEDERJA-197.Peer ReviewedPostprint (published version

    Toward Reliable and Efficient Message Passing Software for HPC Systems: Fault Tolerance and Vector Extension

    Get PDF
    As the scale of High-performance Computing (HPC) systems continues to grow, researchers are devoted themselves to achieve the best performance of running long computing jobs on these systems. My research focus on reliability and efficiency study for HPC software. First, as systems become larger, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. Handling system failures becomes a prime challenge. My research aims to present a general design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Using multiple overlapping topologies to optimize the detection and propagation, minimizing the incurred overhead sand guaranteeing the scalability of the entire framework. Results from different machines and benchmarks compared to related works shows that my design and implementation outperforms non-HPC solutions significantly, and is competitive with specialized HPC solutions that can manage only MPI applications. Second, I endeavor to implore instruction level parallelization to achieve optimal performance. Novel processors support long vector extensions, which enables researchers to exploit the potential peak performance of target architectures. Intel introduced Advanced Vector Extension (AVX512 and AVX2) instructions for x86 Instruction Set Architecture (ISA). Arm introduced Scalable Vector Extension (SVE) with a new set of A64 instructions. Both enable greater parallelisms. My research utilizes long vector reduction instructions to improve the performance of MPI reduction operations. Also, I use gather and scatter feature to speed up the packing and unpacking operation in MPI. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architecture and efficient

    Faster Base64 Encoding and Decoding Using AVX2 Instructions

    Get PDF
    Web developers use base64 formats to include images, fonts, sounds and other resources directly inside HTML, JavaScript, JSON and XML files. We estimate that billions of base64 messages are decoded every day. We are motivated to improve the efficiency of base64 encoding and decoding. Compared to state-of-the-art implementations, we multiply the speeds of both the encoding (~10x) and the decoding (~7x). We achieve these good results by using the single-instruction-multiple-data (SIMD) instructions available on recent Intel processors (AVX2). Our accelerated software abides by the specification and reports errors when encountering characters outside of the base64 set. It is available online as free software under a liberal license.Comment: software at https://github.com/lemire/fastbase6

    Faster Base64 Encoding and Decoding Using AVX2 Instructions

    Get PDF
    Web developers use base64 formats to include images, fonts, sounds and other resources directly inside HTML, JavaScript, JSON and XML files. We estimate that billions of base64 messages are decoded every day. We are motivated to improve the efficiency of base64 encoding and decoding. Compared to state-of-the-art implementations, we multiply the speeds of both the encoding (~10x) and the decoding (~7x). We achieve these good results by using the single-instruction-multiple-data (SIMD) instructions available on recent Intel processors (AVX2). Our accelerated software abides by the specification and reports errors when encountering characters outside of the base64 set. It is available online as free software under a liberal license.Comment: software at https://github.com/lemire/fastbase6
    • …
    corecore