18 research outputs found

    Analysis of Parallel Montgomery Multiplication in CUDA

    Get PDF
    For a given level of security, elliptic curve cryptography (ECC) offers improved efficiency over classic public key implementations. Point multiplication is the most common operation in ECC and, consequently, any significant improvement in perfor- mance will likely require accelerating point multiplication. In ECC, the Montgomery algorithm is widely used for point multiplication. The primary purpose of this project is to implement and analyze a parallel implementation of the Montgomery algorithm as it is used in ECC. Specifically, the performance of CPU-based Montgomery multiplication and a GPU-based implementation in CUDA are compared

    Low-Latency Elliptic Curve Scalar Multiplication

    Get PDF
    This paper presents a low-latency algorithm designed for parallel computer architectures to compute the scalar multiplication of elliptic curve points based on approaches from cryptographic side-channel analysis. A graphics processing unit implementation using a standardized elliptic curve over a 224-bit prime field, complying with the new 112-bit security level, computes the scalar multiplication in 1.9ms on the NVIDIA GTX 500 architecture family. The presented methods and implementation considerations can be applied to any parallel 32-bit architectur

    Supporting Preemptive Task Executions and Memory Copies in GPGPUs

    Get PDF
    GPGPUs (General Purpose Graphic Processing Units) provide massive computational power. However, applying GPGPU technology to real-time computing is challenging due to the non-preemptive nature of GPGPUs. Especially, a job running in a GPGPU or a data copy between a GPGPU and CPU is non-preemptive. As a result, a high priority job arriving in the middle of a low priority job execution or memory copy suffers from priority inversion. To address the problem, we present a new lightweight approach to supporting preemptive memory copies and job executions in GPGPUs. Moreover, in our approach, a GPGPU job and memory copy between a GPGPU and the hosting CPU are run concurrently to enhance the responsiveness. To show the feasibility of our approach, we have implemented a prototype system for preemptive job executions and data copies in a GPGPU. The experimental results show that our approach can bound the response times in a reliable manner. In addition, the response time of our approach is significantly shorter than those of the unmodified GPGPU runtime system that supports no preemption and an advanced GPGPU model designed to support prioritization and performance isolation via preemptive data copies

    Utilizing the Double-Precision Floating-Point Computing Power of GPUs for RSA Acceleration

    Get PDF
    Asymmetric cryptographic algorithm (e.g., RSA and Elliptic Curve Cryptography) implementations on Graphics Processing Units (GPUs) have been researched for over a decade. The basic idea of most previous contributions is exploiting the highly parallel GPU architecture and porting the integer-based algorithms from general-purpose CPUs to GPUs, to offer high performance. However, the great potential cryptographic computing power of GPUs, especially by the more powerful floating-point instructions, has not been comprehensively investigated in fact. In this paper, we fully exploit the floating-point computing power of GPUs, by various designs, including the floating-point-based Montgomery multiplication/exponentiation algorithm and Chinese Remainder Theorem (CRT) implementation in GPU. And for practical usage of the proposed algorithm, a new method is performed to convert the input/output between octet strings and floating-point numbers, fully utilizing GPUs and further promoting the overall performance by about 5%. The performance of RSA-2048/3072/4096 decryption on NVIDIA GeForce GTX TITAN reaches 42,211/12,151/5,790 operations per second, respectively, which achieves 13 times the performance of the previous fastest floating-point-based implementation (published in Eurocrypt 2009). The RSA-4096 decryption precedes the existing fastest integer-based result by 23%

    Selected RNS Bases for Modular Multiplication

    Full text link

    A coprocessor for secure and high speed modular arithmetic

    Get PDF
    We present a coprocessor design for fast arithmetic over large numbers of cryptographic sizes. Our design provides a efficient way to prevent side channel analysis as well as fault analysis targeting modular arithmetic with large prime or composite numbers. These two countermeasure are then suitable both for Elliptic Curve Cryptography over prime fields or RSA using CRT or not. To do so, we use the residue number system (RNS) in an efficient manner to protect from leakage and fault, while keeping its ability to fast execute modular arithmetic with large numbers. We illustrate our countermeasure with a fully protected RSA-CRT implementation using our architecture, and show that it is possible to execute a secure 1024 bit RSA-CRT in less than 0:7 ms on a FPGA

    Faster Modular Arithmetic For Isogeny Based Crypto on Embedded Devices

    Get PDF
    We show how to implement the Montgomery reduction algorithm for isogeny based cryptography such that it can utilize the unsigned multiply accumulate accumulate long instruction present on modern ARM architectures. This results in a practical speed-up of a factor 1.34 compared to the approach used by SIKE: the supersingular isogeny based submission to the ongoing post-quantum standardization effort. Moreover, motivated by the recent work of Costello and Hisil (ASIACRYPT 2017), which shows that there is only a moderate degradation in performance when evaluating large odd degree isogenies, we search for more general supersingular isogeny friendly moduli. Using graphics processing units to accelerate this search we find many such moduli which allow for faster implementations on embedded devices. By combining these two approaches we manage to make the modular reduction 1.5 times as fast on a 32-bit ARM platform

    Cofactorization on Graphics Processing Units

    Get PDF
    We show how the cofactorization step, a compute-intensive part of the relation collection phase of the number field sieve (NFS), can be farmed out to a graphics processing unit. Our implementation on a GTX 580 GPU, which is integrated with a state-of-the-art NFS implementation, can serve as a cryptanalytic co-processor for several Intel i7-3770K quad-core CPUs simultaneously. This allows those processors to focus on the memory-intensive sieving and results in more useful NFS-relations found in less time
    corecore