145 research outputs found

    Developing Efficient Discrete Simulations on Multicore and GPU Architectures

    Get PDF
    In this paper we show how to efficiently implement parallel discrete simulations on multicoreandGPUarchitecturesthrougharealexampleofanapplication: acellularautomatamodel of laser dynamics. We describe the techniques employed to build and optimize the implementations using OpenMP and CUDA frameworks. We have evaluated the performance on two different hardware platforms that represent different target market segments: high-end platforms for scientific computing, using an Intel Xeon Platinum 8259CL server with 48 cores, and also an NVIDIA Tesla V100GPU,bothrunningonAmazonWebServer(AWS)Cloud;and on a consumer-oriented platform, using an Intel Core i9 9900k CPU and an NVIDIA GeForce GTX 1050 TI GPU. Performance results were compared and analyzed in detail. We show that excellent performance and scalability can be obtained in both platforms, and we extract some important issues that imply a performance degradation for them. We also found that current multicore CPUs with large core numbers can bring a performance very near to that of GPUs, and even identical in some cases.Ministerio de Economía, Industria y Competitividad, Gobierno de España (MINECO), and the Agencia Estatal de Investigación (AEI) of Spain, cofinanced by FEDER funds (EU) TIN2017-89842

    Multi-GPU design and performance evaluation of homomorphic encryption on GPU clusters

    Get PDF
    We present a multi-GPU design, implementation and performance evaluation of the Halevi-Polyakov-Shoup (HPS) variant of the Fan-Vercauteren (FV) levelled Fully Homomorphic Encryption (FHE) scheme. Our design follows a data parallelism approach and uses partitioning methods to distribute the workload in FV primitives evenly across available GPUs. The design is put to address space and runtime requirements of FHE computations. It is also suitable for distributed-memory architectures, and includes efficient GPU-to-GPU data exchange protocols. Moreover, it is user-friendly as user intervention is not required for task decomposition, scheduling or load balancing. We implement and evaluate the performance of our design on two homogeneous and heterogeneous NVIDIA GPU clusters: K80, and a customized P100. We also provide a comparison with a recent shared-memory-based multi-core CPU implementation using two homomorphic circuits as workloads: vector addition and multiplication. Moreover, we use our multi-GPU Levelled-FHE to implement the inference circuit of two Convolutional Neural Networks (CNNs) to perform homomorphically image classification on encrypted images from the MNIST and CIFAR - 10 datasets. Our implementation provides 1 to 3 orders of magnitude speedup compared with the CPU implementation on vector operations. In terms of scalability, our design shows reasonable scalability curves when the GPUs are fully connected.This work is supported by A*STAR under its RIE2020 Advanced Manufacturing and Engineering (AME) Programmtic Programme (Award A19E3b0099).Peer ReviewedPostprint (author's final draft

    HI-Kyber: A novel high-performance implementation scheme of Kyber based on GPU

    Get PDF
    CRYSTALS-Kyber, as the only public key encryption (PKE) algorithm selected by the National Institute of Standards and Technology (NIST) in the third round, is considered one of the most promising post-quantum cryptography (PQC) schemes. Lattice-based cryptography uses complex discrete alogarithm problems on lattices to build secure encryption and decryption systems to resist attacks from quantum computing. Performance is an important bottleneck affecting the promotion of post quantum cryptography. In this paper, we present a High-performance Implementation of Kyber (named HI-Kyber) on the NVIDIA GPUs, which can increase the key-exchange performance of Kyber to the million-level. Firstly, we propose a lattice-based PQC implementation architecture based on kernel fusion, which can avoid redundant global-memory access operations. Secondly, We optimize and implement the core operations of CRYSTALS-Kyber, including Number Theoretic Transform (NTT), inverse NTT (INTT), pointwise multiplication, etc. Especially for the calculation bottleneck NTT operation, three novel methods are proposed to explore extreme performance: the sliced layer merging (SLM), the sliced depth-first search (SDFS-NTT) and the entire depth-first search (EDFS-NTT), which achieve a speedup of 7.5%, 28.5%, and 41.6% compared to the native implementation. Thirdly, we conduct comprehensive performance experiments with different parallel dimensions based on the above optimization. Finally, our key exchange performance reaches 1,664 kops/s. Specifically, based on the same platform, our HI-Kyber is 3.52×\times that of the GPU implementation based on the same instruction set and 1.78×\times that of the state-of-the-art one based on AI-accelerated tensor core

    Super Calculator using Compute Unified Device Architecture (CUDA)

    Get PDF
    Scientific computation requires a great amount of computing power especially in floating-point operation but a high-end multi-cores processor is currently limited in terms of floating point operation performance and parallelization. Recent technological advancement has made parallel computing technically and financially feasible using Compute Unified Device Architecture (CUDA) developed by NVIDIA. This research focuses on measuring the performance of CUDA and implementing CUDA for a scientific computation involving the process of porting the source code from CPU to GPU using direct integration technique. The ported source code is then optimized by managing the resources to achieve performance gain over CPU. It is found that CUDA is able to boost the performance of the system up to 69 times in Parboil Benchmark Suite. Successful attempt at porting Serpent encryption algorithm and Lattice Boltzmann Method provided up to 7 times throughput performance gain and up to 10 times execution time performance gain respectively over the CPU. Direct integration guideline for porting the source code is then produced based on the two implementations

    Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs

    Full text link
    Homomorphic encryption (HE) draws huge attention as it provides a way of privacy-preserving computations on encrypted messages. Number Theoretic Transform (NTT), a specialized form of Discrete Fourier Transform (DFT) in the finite field of integers, is the key algorithm that enables fast computation on encrypted ciphertexts in HE. Prior works have accelerated NTT and its inverse transformation on a popular parallel processing platform, GPU, by leveraging DFT optimization techniques. However, these GPU-based studies lack a comprehensive analysis of the primary differences between NTT and DFT or only consider small HE parameters that have tight constraints in the number of arithmetic operations that can be performed without decryption. In this paper, we analyze the algorithmic characteristics of NTT and DFT and assess the performance of NTT when we apply the optimizations that are commonly applicable to both DFT and NTT on modern GPUs. From the analysis, we identify that NTT suffers from severe main-memory bandwidth bottleneck on large HE parameter sets. To tackle the main-memory bandwidth issue, we propose a novel NTT-specific on-the-fly root generation scheme dubbed on-the-fly twiddling (OT). Compared to the baseline radix-2 NTT implementation, after applying all the optimizations, including OT, we achieve 4.2x speedup on a modern GPU.Comment: 12 pages, 13 figures, to appear in IISWC 202

    Weak RSA Key Discovery on GPGPU

    Get PDF
    We address one of the weaknesses of the RSA ciphering systems \textit{i.e.} the existence of the private keys that are relatively easy to compromise by the attacker. The problem can be mitigated by the Internet services providers, but it requires some computational effort. We propose the proof of concept of the GPGPU-accelerated system that can help detect and eliminate users' weak keys.We have proposed the algorithms and developed the GPU-optimised program code that is now publicly available and substantially outperforms the tested CPU processor. The source code of the OpenSSL library was adapted for GPGPU, and the resulting code can perform both on the GPU and CPU processors. Additionally, we present the solution how to map a triangular grid into the GPU rectangular grid \textendash{} the basic dilemma in many problems that concern pair-wise analysis for the set of elements. Also, the comparison of two data caching methods on GPGPU leads to the interesting general conclusions. We present the results of the experiments of the performance analysis of the selected algorithms for the various RSA key length, configurations of GPU grid, and size of the tested key set

    TMVP-based Polynomial Convolution for Saber and Sable on GPU using CUDA-cores and Tensor-cores

    Get PDF
    Recently proposed lattice-based cryptography algorithms can be used to protect the IoT communication against the threat from quantum computers, but they are computationally heavy. In particular, polynomial multiplication is one of the most time-consuming operations in lattice-based cryptography. To achieve efficient implementation, the Number Theoretic Transform (NTT) algorithm is an ideal choice, but it has certain limitations on the parameters, which not all lattice-based schemes can employ directly. Hence, alternative techniques are proposed to accelerate polynomial multiplication on lattice-based schemes that cannot utilize the NTT directly. In this paper, we propose a parallel Toeplitz matrix-vector product (TMVP) version to accelerate the polynomial multiplication in PQC algorithms implemented it on a graphics processing unit (GPU). This is the first time a TMVP parallel version has been proposed and experimented on different GPU cores (i.e., CUDA-cores and Tensor-cores). The effectiveness of the proposed solution is validated on Saber (the NIST post-quantum standardization finalist) and Sable (an improved version of Saber) schemes. Experimental results show that TMVP-based polynomial convolution using CUDA-cores fails to exhibit a significant enhancement compared to the schoolbook CUDA-core method already proposed by Hafeez et al. 2023. However, when the TMVP technique is applied to Tensor-cores, it outperformed state-of-the-art implementations. The proposed Tensor-core approach outperformed the schoolbook Tensor-core method by up to 1.21×, and outperformed the dot-product-instructions method (Lee et al. 2022) by up to 3.63×. The proposed TMVP Tensor-cores is also faster than the TMVP CUDA-cores method by 13.76

    High Throughput Lattice-based Signatures on GPUs: Comparing Falcon and Mitaka

    Get PDF
    The US National Institute of Standards and Technology initiated a standardization process for post-quantum cryptography in 2017, with the aim of selecting key encapsulation mechanisms and signature schemes that can withstand the threat from emerging quantum computers. In 2022, Falcon was selected as one of the standard signature schemes, eventually attracting effort to optimize the implementation of Falcon on various hardware architectures for practical applications. Recently, Mitaka was proposed as an alternative to Falcon, allowing parallel execution of most of its operations. These recent advancements motivate us to develop high throughput implementations of Falcon and Mitaka signature schemes on Graphics Processing Units (GPUs), a massively parallel architecture widely available on cloud service platforms. In this paper, we propose the first parallel implementation of Falcon on various GPUs. An iterative version of the sampling process in Falcon, which is also the most time-consuming Falcon operation, was developed. This allows us to implement Falcon signature generation without relying on expensive recursive function calls on GPUs. In addition, we propose a parallel random samples generation approach to accelerate the performance of Mitaka on GPUs. We evaluate our implementation techniques on state-of-the-art GPU architectures (RTX 3080, A100, T4 and V100). Experimental results show that our Falcon-512 implementation achieves 58, 595 signatures/second and 2, 721, 562 verifications/second on an A100 GPU, which is 20.03× and 29.51× faster than the highly optimized AVX2 implementation on CPU. Our Mitaka implementation achieves 161, 985 signatures/second and 1, 421, 046 verifications/second on the same GPU. Due to the adoption of a parallelizable sampling process, Mitaka signature generation enjoys ≈ 2 – 20× higher throughput than Falcon on various GPUs. The high throughput signature generation and verification achieved by this work can be very useful in various emerging applications, including the Internet of Things

    Implementation and Performance Evaluation of RNS Variants of the BFV Homomorphic Encryption Scheme

    Get PDF
    Homomorphic encryption is an emerging form of encryption that provides the ability to compute on encrypted data without ever decrypting them. Potential applications include aggregating sensitive encrypted data on a cloud environment and computing on the data in the cloud without compromising data privacy. There have been several recent advances resulting in new homomorphic encryption schemes and optimized variants. We implement and evaluate the performance of two optimized variants, namely Bajard-Eynard-Hasan-Zucca (BEHZ) and Halevi-Polyakov-Shoup (HPS), of the most promising homomorphic encryption scheme in CPU and GPU. The most interesting (and also unexpected) result of our performance evaluation is that the HPS variant in practice scales significantly better (typically by 15%-30%) with increase in multiplicative depth of the computation circuit than BEHZ, implying that the HPS variant will always outperform BEHZ for most practical applications. For the multiplicative depth of 98, our fastest GPU implementation performs homomorphic multiplication in 51 ms for 128-bit security settings, which is faster by two orders of magnitude than prior results and already practical for cloud environments supporting GPU computations. Large multiplicative depths supported by our implementations are required for applications involving deep neural networks, logistic regression learning, and other important machine learning problems
    corecore