22 research outputs found

    NTRU Modular Lattice Signature Scheme on CUDA GPUs

    Get PDF
    In this work we show how to use Graphics Processing Units (GPUs) with Compute Unified Device Architecture (CUDA) to accelerate a lattice based signature scheme, namely, the NTRU modular lattice signature (NTRU-MLS) scheme. Lattice based schemes require operations on large vectors that are perfect candidates for GPU implementations. In addition, similar to most lattice based signature schemes, NTRU-MLS provides transcript security with a rejection sampling technique. With a GPU implementation, we are able to generate many candidates simultaneously, and hence mitigate the performance slowdown from rejection sampling. Our implementation results show that for the original NTRU-MLS parameter sets, we obtain a 2x improvement in the signing speed; for the revised parameter sets, where acceptance rate of rejection sampling is down to around 1%, our implementation can be as much as 47x faster than a CPU implementation

    TMVP-based Polynomial Convolution for Saber and Sable on GPU using CUDA-cores and Tensor-cores

    Get PDF
    Recently proposed lattice-based cryptography algorithms can be used to protect the IoT communication against the threat from quantum computers, but they are computationally heavy. In particular, polynomial multiplication is one of the most time-consuming operations in lattice-based cryptography. To achieve efficient implementation, the Number Theoretic Transform (NTT) algorithm is an ideal choice, but it has certain limitations on the parameters, which not all lattice-based schemes can employ directly. Hence, alternative techniques are proposed to accelerate polynomial multiplication on lattice-based schemes that cannot utilize the NTT directly. In this paper, we propose a parallel Toeplitz matrix-vector product (TMVP) version to accelerate the polynomial multiplication in PQC algorithms implemented it on a graphics processing unit (GPU). This is the first time a TMVP parallel version has been proposed and experimented on different GPU cores (i.e., CUDA-cores and Tensor-cores). The effectiveness of the proposed solution is validated on Saber (the NIST post-quantum standardization finalist) and Sable (an improved version of Saber) schemes. Experimental results show that TMVP-based polynomial convolution using CUDA-cores fails to exhibit a significant enhancement compared to the schoolbook CUDA-core method already proposed by Hafeez et al. 2023. However, when the TMVP technique is applied to Tensor-cores, it outperformed state-of-the-art implementations. The proposed Tensor-core approach outperformed the schoolbook Tensor-core method by up to 1.21×, and outperformed the dot-product-instructions method (Lee et al. 2022) by up to 3.63×. The proposed TMVP Tensor-cores is also faster than the TMVP CUDA-cores method by 13.76

    Cryptanalysis and Secure Implementation of Modern Cryptographic Algorithms

    Get PDF
    Cryptanalytic attacks can be divided into two classes: pure mathematical attacks and Side Channel Attacks (SCAs). Pure mathematical attacks are traditional cryptanalytic techniques that rely on known or chosen input-output pairs of the cryptographic function and exploit the inner structure of the cipher to reveal the secret key information. On the other hand, in SCAs, it is assumed that attackers have some access to the cryptographic device and can gain some information from its physical implementation. Cold-boot attack is a SCA which exploits the data remanence property of Random Access Memory (RAM) to retrieve its content which remains readable shortly after its power has been removed. Fault analysis is another example of SCAs in which the attacker is assumed to be able to induce faults in the cryptographic device and observe the faulty output. Then, by careful inspection of faulty outputs, the attacker recovers the secret information, such as secret inner state or secret key. Scan-based Design-For-Test (DFT) is a widely deployed technique for testing hardware chips. Scan-based SCAs exploit the information obtained by analyzing the scanned data in order to retrieve secret information from cryptographic hardware devices that are designed with this testability feature. In the first part of this work, we investigate the use of an off-the-shelf SAT solver, CryptoMinSat, to improve the key recovery of the Advance Encryption Standard (AES-128) key schedules from its corresponding decayed memory images which can be obtained using cold-boot attacks. We also present a fault analysis on both NTRUEncrypt and NTRUSign cryptosystems. For this specific original instantiation of the NTRU encryption system with parameters (N,p,q)(N,p,q), our attack succeeds with probability ≈1−1p\approx 1-\frac{1}{p} and when the number of faulted coefficients is upper bounded by tt, it requires O((pN)t)O((pN)^t) polynomial inversions in Z/pZ[x]/(xN−1)\mathbb Z/p\mathbb Z[x]/(x^{N}-1). We also investigate several techniques to strengthen hardware implementations of NTRUEncrypt against this class of attacks. For NTRUSign with parameters (NN, q=plq=p^l, B\mathcal{B}, \emph{standard}, N\mathcal{N}), when the attacker is able to skip the norm-bound signature checking step, our attack needs one fault to succeed with probability ≈1−1p\approx 1-\frac{1}{p} and requires O((qN)t)O((qN)^t) steps when the number of faulted polynomial coefficients is upper bounded by tt. The attack is also applicable to NTRUSign utilizing the \emph{transpose} NTRU lattice but it requires double the number of fault injections. Different countermeasures against the proposed attack are also investigated. Furthermore, we present a scan-based SCA on NTRUEncrypt hardware implementations that employ scan-based DFT techniques. Our attack determines the scan chain structure of the polynomial multiplication circuits used in the decryption algorithm which allows the cryptanalyst to efficiently retrieve the secret key. Several key agreement schemes based on matrices were recently proposed. For example, \'{A}lvarez \emph{et al.} proposed a scheme in which the secret key is obtained by multiplying powers of block upper triangular matrices whose elements are defined over Zp\mathbb{Z}_p. Climent \emph{et al.} identified the elements of the endomorphisms ring End(Zp×Zp2)End(\mathbb{Z}_p \times \mathbb{Z}_{p^2}) with elements in a set, EpE_p, of matrices of size 2×22\times 2, whose elements in the first row belong to Zp\mathbb{Z}_{p} and the elements in the second row belong to Zp2\mathbb{Z}_{p^2}. Keith Salvin presented a key exchange protocol using matrices in the general linear group, GL(r,Zn)GL(r,\mathbb{Z}_n), where nn is the product of two distinct large primes. The system is fully specified in the US patent number 7346162 issued in 2008. In the second part of this work, we present mathematical cryptanalytic attacks against these three schemes and show that they can be easily broken for all practical choices of their security parameters

    HI-Kyber: A novel high-performance implementation scheme of Kyber based on GPU

    Get PDF
    CRYSTALS-Kyber, as the only public key encryption (PKE) algorithm selected by the National Institute of Standards and Technology (NIST) in the third round, is considered one of the most promising post-quantum cryptography (PQC) schemes. Lattice-based cryptography uses complex discrete alogarithm problems on lattices to build secure encryption and decryption systems to resist attacks from quantum computing. Performance is an important bottleneck affecting the promotion of post quantum cryptography. In this paper, we present a High-performance Implementation of Kyber (named HI-Kyber) on the NVIDIA GPUs, which can increase the key-exchange performance of Kyber to the million-level. Firstly, we propose a lattice-based PQC implementation architecture based on kernel fusion, which can avoid redundant global-memory access operations. Secondly, We optimize and implement the core operations of CRYSTALS-Kyber, including Number Theoretic Transform (NTT), inverse NTT (INTT), pointwise multiplication, etc. Especially for the calculation bottleneck NTT operation, three novel methods are proposed to explore extreme performance: the sliced layer merging (SLM), the sliced depth-first search (SDFS-NTT) and the entire depth-first search (EDFS-NTT), which achieve a speedup of 7.5%, 28.5%, and 41.6% compared to the native implementation. Thirdly, we conduct comprehensive performance experiments with different parallel dimensions based on the above optimization. Finally, our key exchange performance reaches 1,664 kops/s. Specifically, based on the same platform, our HI-Kyber is 3.52×\times that of the GPU implementation based on the same instruction set and 1.78×\times that of the state-of-the-art one based on AI-accelerated tensor core

    Transcript secure signatures based on modular lattices

    Get PDF
    We introduce a class of lattice-based digital signature schemes based on modular properties of the coordinates of lattice vectors. We also suggest a method of making such schemes transcript secure via a rejection sampling technique of Lyubashevsky (2009). A particular instantiation of this approach is given, using NTRU lattices. Although the scheme is not supported by a formal security reduction, we present arguments for its security and derive concrete parameters (first version) based on the performance of state-of-the-art lattice reduction and enumeration tech- niques. In the revision, we re-evaluate the security of first version of the parameter sets, under the hybrid approach of lattice reduction attack the meet-in-the-middle attack. We present new sets of parameters that are robust against this attack, as well as all previous known attacks

    Homomorphic Encryption for Machine Learning in Medicine and Bioinformatics

    Get PDF
    Machine learning techniques are an excellent tool for the medical community to analyzing large amounts of medical and genomic data. On the other hand, ethical concerns and privacy regulations prevent the free sharing of this data. Encryption methods such as fully homomorphic encryption (FHE) provide a method evaluate over encrypted data. Using FHE, machine learning models such as deep learning, decision trees, and naive Bayes have been implemented for private prediction using medical data. FHE has also been shown to enable secure genomic algorithms, such as paternity testing, and secure application of genome-wide association studies. This survey provides an overview of fully homomorphic encryption and its applications in medicine and bioinformatics. The high-level concepts behind FHE and its history are introduced. Details on current open-source implementations are provided, as is the state of FHE for privacy-preserving techniques in machine learning and bioinformatics and future growth opportunities for FHE

    A Novel High-performance Implementation of CRYSTALS-Kyber with AI Accelerator

    Get PDF
    Public-key cryptography, including conventional cryptosystems and post-quantum cryptography, involves computation-intensive workloads. With noticing the extraordinary computing power of AI accelerators, in this paper, we further explore the feasibility to introduce AI accelerators into high-performance cryptographic computing. Since AI accelerators are dedicated to machine learning or neural networks, the biggest challenge is how to transform cryptographic workloads into their operations, while ensuring the correctness of the results and bringing convincing performance gains. After investigating and analysing the workload of NVIDIA AI accelerator, Tensor Core, we choose to utilize it to accelerate the polynomial multiplication, usually the most time-consuming part in lattice-based cryptography. We take measures to accommodate the matrix-multiply-and-add mode of Tensor Core and make a trade-off between precision and performance, to leverage it as a high-performance NTT box performing NTT/INTT through CUDA C++ WMMA APIs. Meanwhile, we take CRYSTALS-Kyber, the candidate to be standardized by NIST, as a case study on RTX 3080 with the Ampere Tensor Core. The empirical results show that the customized NTT of polynomial vector (n=256,k=4n=256,k=4) with our NTT box obtains a speedup around 6.47x that of the state-of-the-art implementation on the same GPU platform. Compared with the AVX2 implementation submitted to NIST, our Kyber-1024 can achieve a speedup of 26x, 36x, and 35x for each phase

    TPU as Cryptographic Accelerator

    Full text link
    Polynomials defined on specific rings are heavily involved in various cryptographic schemes, and the corresponding operations are usually the computation bottleneck of the whole scheme. We propose to utilize TPU, an emerging hardware designed for AI applications, to speed up polynomial operations and convert TPU to a cryptographic accelerator. We also conduct preliminary evaluation and discuss the limitations of current work and future plan
    corecore