101 research outputs found

    High-Throughput GPU Implementation of Dilithium Post-Quantum Digital Signature

    Full text link
    In this work, we present a well-optimized GPU implementation of Dilithium, one of the NIST post-quantum standard digital signature algorithms. We focus on warp-level design and exploit several strategies to improve performance, including memory pool, kernel fusing, batching, streaming, etc. All the above efforts lead to an efficient and high-throughput solution. We profile on both desktop and server-grade GPUs, and achieve up to 57.7×\times, 93.0×\times, and 63.1×\times higher throughput on RTX 3090Ti for key generation, signing, and verification, respectively, compared to single-thread CPU. Additionally, we study the performance in real-world applications to demonstrate the effectiveness and applicability of our solution

    cuML-DSA: Optimized Signing Procedure and Server-Oriented GPU Design for ML-DSA

    Get PDF
    The threat posed by quantum computing has precipitated an urgent need for post-quantum cryptography. Recently, the post-quantum digital signature draft FIPS 204 has been published, delineating the details of the ML-DSA, which is derived from the CRYSTALS-Dilithium. Despite these advancements, server environments, especially those equipped with GPU devices necessitating high-throughput signing, remain entrenched in classical schemes. A conspicuous void exists in the realm of GPU implementation or server-specific designs for ML-DSA. In this paper, we propose the first server-oriented GPU design tailored for the ML-DSA signing procedure in high-throughput servers. We introduce several innovative theoretical optimizations to bolster performance, including depth-prior sparse ternary polynomial multiplication, the branch elimination method, and the rejection-prioritized checking order. Furthermore, exploiting server-oriented features, we propose a comprehensive GPU hardware design, augmented by a suite of GPU implementation optimizations to further amplify performance. Additionally, we present variants for sampling sparse polynomials, thereby streamlining our design. The deployment of our implementation on both server-grade and commercial GPUs demonstrates significant speedups, ranging from 170.7× to 294.2× against the CPU baseline, and an improvement of up to 60.9% compared to related work, affirming the effectiveness and efficiency of the proposed GPU architecture for ML-DSA signing procedure

    cuXCMP: CUDA-Accelerated Private Comparison Based on Homomorphic Encryption

    Get PDF
    Private comparison schemes constructed on homomorphic encryption offer the noninteractive, output expressive and parallelizable features, and have advantages in communication bandwidth and performance. In this paper, we propose cuXCMP, which allows negative and float inputs, offers fully output expressive feature, and is more extensible and practical compared to XCMP (AsiaCCS 2018). Meanwhile, we introduce several memory-centric optimizations of the constant term extraction kernel tailored for CUDA-enabled GPUs. Firstly, we fully utilize the shared memory and present compact GPU implementations of NTT and INTT using a single block; Secondly, we fuse multiple kernels into one AKS kernel, which conducts the automorphism and key switching operation, and reduce the grid dimension for better resource usage, data access rate and synchronization. Thirdly, we precisely measure the IO latency and choose an appropriate number of CUDA streams to enable concurrent execution of independent operations, yielding a constant term extraction kernel with perfect latency hide, i.e., CTX. Combining these approaches, we boost the overall execution time to optimum level and the speedup ratio increases with the comparison scales. For one comparison, we speedup the AKS by 23.71×, CTX by 15.58×, and scheme by 1.83× (resp., 18.29×, 11.75×, and 1.42×) compared to C (resp., AVX512) baselines, respectively. For 32 comparisons, our CTX and scheme implementations outperform the C (resp., AVX512) baselines by 112.00× and 1.99× (resp., 81.53× and 1.51×)

    CUDA-Accelerated RNS Multiplication in Word-Wise Homomorphic Encryption Schemes

    Get PDF
    Homomorphic encryption (HE), which allows computation over encrypted data, has often been used to preserve privacy. However, the computationally heavy nature and complexity of network topologies make the deployment of HE schemes in the Internet of Things (IoT) scenario difficult. In this work, we propose CARM, the first optimized GPU implementation that covers BGV, BFV and CKKS, targeting for accelerating homomorphic multiplication using GPU in heterogeneous IoT systems. We offer constant-time low-level arithmetic with minimum instructions and memory usage, as well as performance- and memory-prior configurations, and exploit a parametric and generic design, and offer various trade-offs between resource and efficiency, yielding a solution suitable for accelerating RNS homomorphic multiplication on both high-performance and embedded GPUs. Through this, we can offer more real-time evaluation results and relieve the computational pressure on cloud devices. We deploy our implementations on two GPUs and achieve up to 378.4×, 234.5×, and 287.2× speedup for homomorphic multiplication of BGV, BFV, and CKKS on Tesla V100S, and 8.8×, 9.2×, and 10.3× on Jetson AGX Xavier, respectively

    Exploration of Programmed Cell Death-Associated Characteristics and Immune infiltration in Neonatal Sepsis: New insights From Bioinformatics analysis and Machine Learning

    Get PDF
    BACKGROUND: Neonatal sepsis, a perilous medical situation, is typified by the malfunction of organs and serves as the primary reason for neonatal mortality. Nevertheless, the mechanisms underlying newborn sepsis remain ambiguous. Programmed cell death (PCD) has a connection with numerous infectious illnesses and holds a significant function in newborn sepsis, potentially serving as a marker for diagnosing the condition. METHODS: From the GEO public repository, we selected two groups, which we referred to as the training and validation sets, for our analysis of neonatal sepsis. We obtained PCD-related genes from 12 different patterns, including databases and published literature. We first obtained differential expressed genes (DEGs) for neonatal sepsis and controls. Three advanced machine learning techniques, namely LASSO, SVM-RFE, and RF, were employed to identify potential genes connected to PCD. to further validate the results, PPI networks were constructed, artificial neural networks and consensus clustering were used. Subsequently, a neonatal sepsis diagnostic prediction model was developed and evaluated. We conducted an analysis of immune cell infiltration to examine immune cell dysregulation in neonatal sepsis, and we established a ceRNA network based on the identified marker genes. RESULTS: Within the context of neonatal sepsis, a total of 49 genes exhibited an intersection between the differentially expressed genes (DEGs) and those associated with programmed cell death (PCD). Utilizing three distinct machine learning techniques, six genes were identified as common to both DEGs and PCD-associated genes. A diagnostic model was subsequently constructed by integrating differential expression profiles, and subsequently validated by conducting artificial neural networks and consensus clustering. Receiver operating characteristic (ROC) curves were employed to assess the diagnostic merit of the model, which yielded promising results. The immune infiltration analysis revealed notable disparities in patients diagnosed with neonatal sepsis. Furthermore, based on the identified marker genes, the ceRNA network revealed an intricate regulatory interplay. CONCLUSION: In our investigation, we methodically identified six marker genes (AP3B2, STAT3, TSPO, S100A9, GNS, and CX3CR1). An effective diagnostic prediction model emerged from an exhaustive analysis within the training group (AUC 0.930, 95%CI 0.887-0.965) and the validation group (AUC 0.977, 95%CI 0.935-1.000)

    XNET: A Real-Time Unified Secure Inference Framework Using Homomorphic Encryption

    Get PDF
    Homomorphic Encryption (HE) presents a promising solution to securing neural networks for Machine Learning as a Service (MLaaS). Despite its potential, the real-time applicability of current HE-based solutions remains a challenge, and the diversity in network structures often results in inefficient implementations and maintenance. To address these issues, we introduce a unified and compact network structure for real-time inference in convolutional neural networks based on HE. We further propose several optimization strategies, including an innovative compression and encoding technique and rearrangement in the pixel encoding sequence, enabling a highly efficient batched computation and reducing the demand for time-consuming HE operations. To further expedite computation, we propose a GPU acceleration engine to leverage the massive thread-level parallelism to speed up computations. We test our framework with the MNIST, Fashion-MNIST, and CIFAR-10 datasets, demonstrating accuracies of 99.14%, 90.8%, and 61.09%, respectively. Furthermore, our framework maintains a steady processing speed of 0.46 seconds on a single-thread CPU, and a brisk 31.862 milliseconds on an A100 GPU for all datasets. This represents an enhancement in speed more than 3000 times compared to pervious work, paving the way for future explorations in the realm of secure and real-time machine learning applications

    Implementing and Benchmarking Word-Wise Homomorphic Encryption Schemes on GPU

    Get PDF
    Homomorphic encryption (HE) is one of the most promising techniques for privacy-preserving computations, especially the word-wise HE schemes that allow batched computations over ciphertexts. However, the high computational overhead hinders the deployment of HE in real-word applications. The GPUs are often used to accelerate the execution in such scenarios, while the performance of different HE schemes on the same GPU platform is still absent. In this work, we implement three word-wise HE schemes BGV, BFV, and CKKS on GPU, with both theoretical and engineering optimizations. We optimize the hybrid key-switching technique, reducing the computational and memory overhead of this procedure. We explore several kernel fusing strategies to reuse data, which reduces the memory access and IO latency, and improves the overall performance. By comparing with the state-of-the-art works, we demonstrate the effectiveness of our implementation. Meanwhile, we present a framework that finely integrates our implementation of the three schemes, covering almost all scheme functions and homomorphic operations. We optimize the management of pre-computation, RNS bases and memory in the framework, to provide efficient and low-latency data access and transfer. Based on this framework, we provide a thorough benchmark of the three schemes, which can serve as a reference for scheme selection and implementation in constructing privacy-preserving applications

    Leveraging GPU in Homomorphic Encryption: Framework Design and Analysis of BFV Variants

    Get PDF
    Homomorphic Encryption (HE) enhances data security by facilitating computations on encrypted data, opening new paths for privacy-focused computations. The Brakerski-Fan-Vercauteren (BFV) scheme, a promising HE scheme, raises considerable performance challenges. Graphics Processing Units (GPUs), with considerable parallel processing abilities, have emerged as an effective solution. In this work, we present an in-depth study focusing on accelerating and comparing BFV variants on GPUs, including Bajard-Eynard-Hasan-Zucca (BEHZ), Halevi-Polyakov-Shoup (HPS), and other recent variants. We introduce a universal framework accommodating all variants, propose optimized BEHZ implementation, and first support HPS variants with large parameter sets on GPUs. Moreover, we devise several optimizations for both low-level arithmetic and high-level operations, including minimizing instructions for modular operations, enhancing hardware utilization for base conversion, implementing efficient reuse strategies, and introducing intra-arithmetic and inner-conversion fusion methods, thus decreasing the overall computational and memory consumption. Leveraging our framework, we offer comprehensive comparative analyses. Our performance evaluation showcases a marked speed improvement, achieving 31.9× over OpenFHE running on a multi-threaded CPU and 39.7% and 29.9% improvement, respectively, over the state-of-the-art GPU BEHZ implementation. Our implementation of the leveled HPS variant records up to 4× speedup over other variants, positioning it as a highly promising alternative for specific applications

    A note on the security of KHL scheme

    Get PDF
    Agency for Science, Technology and Research (A*STAR
    corecore