1,915 research outputs found

    Analysis of Parallel Montgomery Multiplication in CUDA

    Get PDF
    For a given level of security, elliptic curve cryptography (ECC) offers improved efficiency over classic public key implementations. Point multiplication is the most common operation in ECC and, consequently, any significant improvement in perfor- mance will likely require accelerating point multiplication. In ECC, the Montgomery algorithm is widely used for point multiplication. The primary purpose of this project is to implement and analyze a parallel implementation of the Montgomery algorithm as it is used in ECC. Specifically, the performance of CPU-based Montgomery multiplication and a GPU-based implementation in CUDA are compared

    Accelerating Homomorphic Encryption in the Cloud Environment through High-Level Synthesis and Reconfigurable Resources

    Get PDF
    The recent surge in cloud services is revolutionizing the way that data is stored and processed. Everyone with an internet connection, from large corporations to small companies and private individuals, now have access to cutting-edge processing power and vast amounts of data storage. This rise in cloud computing and storage, however, has brought with it a need for a new type of security. In order to have access to cloud services, users must allow the service provider to have full access to their private, unencrypted data. Users are required to trust the integrity of the service provider and the security of its data centers. The recent development of fully homomorphic encryption schemes can offer a solution to this dilemma. These algorithms allow encrypted data to be used in computations without ever stripping the data of the protection of encryption. Unfortunately, the demanding memory requirements and computational complexity of the proposed schemes has hindered their wide-scale use. Custom hardware accelerators for homomorphic encryption could be implemented on the increasing number of reconfigurable hardware resources in the cloud, but the long development time required for these processors would lead to high production costs. This research seeks to develop a strategy for faster development of homomorphic encryption hardware accelerators using the process of High-Level Synthesis. Insights from existing number theory software libraries and custom hardware accelerators are used to develop a scalable, proof-of-concept software implementation of Karatsuba modular polynomial multiplication. This implementation was designed to be used with High-Level Synthesis to accelerate the large modular polynomial multiplication operations required by homomorphic encryption. The accelerator generated from this implementation by the High-Level Synthesis tool Vivado HLS achieved significant speedup over the implementations available in the highly-optimized FLINT software library

    Survey on Hardware Implementation of Montgomery Modular

    Get PDF
    This paper gives the information regarding different methodology for modular multiplication with the modification of Montgomery algorithm. Montgomery multiplier proved to be more efficient multiplier which replaces division by the modulus with series of shifting by a number and an adder block. For larger number of bits, Modular multiplication takes more time to compute and also takes more area of the chip. Different methods ensure more speed and less chip size of the system. The speed of the multiplier is decided by the multiplier. Here three modified Montgomery algorithm discussed with their output compared with each other. The three methods are Iterative architecture, Montgomery multiplier for faster Cryptography and Vedic multipliers used in Montgomery algorithm for multiplication.Here three boards have been used for the analysis and they are Altera DE2-70, FPGA board Virtex 6 and Kintex 7

    Survey on Hardware Implementation of Montgomery Modular

    Get PDF
    This paper gives the information regarding different methodology for modular multiplication with the modification of Montgomery algorithm. Montgomery multiplier proved to be more efficient multiplier which replaces division by the modulus with series of shifting by a number and an adder block. For larger number of bits, Modular multiplication takes more time to compute and also takes more area of the chip. Different methods ensure more speed and less chip size of the system. The speed of the multiplier is decided by the multiplier. Here three modified Montgomery algorithm discussed with their output compared with each other. The three methods are Iterative architecture, Montgomery multiplier for faster Cryptography and Vedic multipliers used in Montgomery algorithm for multiplication.Here three boards have been used for the analysis and they are Altera DE2-70, FPGA board Virtex 6 and Kintex 7

    Versatile Montgomery Multiplier Architectures

    Get PDF
    Several algorithms for Public Key Cryptography (PKC), such as RSA, Diffie-Hellman, and Elliptic Curve Cryptography, require modular multiplication of very large operands (sizes from 160 to 4096 bits) as their core arithmetic operation. To perform this operation reasonably fast, general purpose processors are not always the best choice. This is why specialized hardware, in the form of cryptographic co-processors, become more attractive. Based upon the analysis of recent publications on hardware design for modular multiplication, this M.S. thesis presents a new architecture that is scalable with respect to word size and pipelining depth. To our knowledge, this is the first time a word based algorithm for Montgomery\u27s method is realized using high-radix bit-parallel multipliers working with two different types of finite fields (unified architecture for GF(p) and GF(2n)). Previous approaches have relied mostly on bit serial multiplication in combination with massive pipelining, or Radix-8 multiplication with the limitation to a single type of finite field. Our approach is centered around the notion that the optimal delay in bit-parallel multipliers grows with logarithmic complexity with respect to the operand size n, O(log3/2 n), while the delay of bit serial implementations grows with linear complexity O(n). Our design has been implemented in VHDL, simulated and synthesized in 0.5μ CMOS technology. The synthesized net list has been verified in back-annotated timing simulations and analyzed in terms of performance and area consumption
    corecore