108 research outputs found

    Realizing arbitrary-precision modular multiplication with a fixed-precision multiplier datapath

    Get PDF
    Within the context of cryptographic hardware, the term scalability refers to the ability to process operands of any size, regardless of the precision of the underlying data path or registers. In this paper we present a simple yet effective technique for increasing the scalability of a fixed-precision Montgomery multiplier. Our idea is to extend the datapath of a Montgomery multiplier in such a way that it can also perform an ordinary multiplication of two n-bit operands (without modular reduction), yielding a 2n-bit result. This conventional (nxn->2n)-bit multiplication is then used as a “sub-routine” to realize arbitrary-precision Montgomery multiplication according to standard software algorithms such as Coarsely Integrated Operand Scanning (CIOS). We show that performing a 2n-bit modular multiplication on an n-bit multiplier can be done in 5n clock cycles, whereby we assume that the n-bit modular multiplication takes n cycles. Extending a Montgomery multiplier for this extra functionality requires just some minor modifications of the datapath and entails a slight increase in silicon area

    A high-speed integrated circuit with applications to RSA Cryptography

    Get PDF
    Merged with duplicate record 10026.1/833 on 01.02.2017 by CS (TIS)The rapid growth in the use of computers and networks in government, commercial and private communications systems has led to an increasing need for these systems to be secure against unauthorised access and eavesdropping. To this end, modern computer security systems employ public-key ciphers, of which probably the most well known is the RSA ciphersystem, to provide both secrecy and authentication facilities. The basic RSA cryptographic operation is a modular exponentiation where the modulus and exponent are integers typically greater than 500 bits long. Therefore, to obtain reasonable encryption rates using the RSA cipher requires that it be implemented in hardware. This thesis presents the design of a high-performance VLSI device, called the WHiSpER chip, that can perform the modular exponentiations required by the RSA cryptosystem for moduli and exponents up to 506 bits long. The design has an expected throughput in excess of 64kbit/s making it attractive for use both as a general RSA processor within the security function provider of a security system, and for direct use on moderate-speed public communication networks such as ISDN. The thesis investigates the low-level techniques used for implementing high-speed arithmetic hardware in general, and reviews the methods used by designers of existing modular multiplication/exponentiation circuits with respect to circuit speed and efficiency. A new modular multiplication algorithm, MMDDAMMM, based on Montgomery arithmetic, together with an efficient multiplier architecture, are proposed that remove the speed bottleneck of previous designs. Finally, the implementation of the new algorithm and architecture within the WHiSpER chip is detailed, along with a discussion of the application of the chip to ciphering and key generation

    Modular Exponentiation on Reconfigurable Hardware

    Get PDF
    It is widely recognized that security issues will play a crucial role in the majority of future computer and communication systems. A central tool for achieving system security are cryptographic algorithms. For performance as well as for physical security reasons, it is often advantageous to realize cryptographic algorithms in hardware. In order to overcome the well-known drawback of reduced flexibility that is associated with traditional ASIC solutions, this contribution proposes arithmetic architectures which are optimized for modern field programmable gate arrays (FPGAs). The proposed architectures perform modular exponentiation with very long integers. This operation is at the heart of many practical public-key algorithms such as RSA and discrete logarithm schemes. We combine two versions of Montgomery modular multiplication algorithm with new systolic array designs which are well suited for FPGA realizations. The first one is based on a radix of two and is capable of processing a variable number of bits per array cell leading to a low cost design. The second design uses a radix of sixteen, resulting in a speed-up of a factor three at the cost of more used resources. The designs are flexible, allowing any choice of operand and modulus. Unlike previous approaches, we systematically implement and compare several versions of our new architecture for different bit lengths. We provide absolute area and timing measures for each architecture on Xilinx XC4000 series FPGAs. As a first practical result we show that it is possible to implement modular exponentiation at secure bit lengths on a single commercially available FPGA. Secondly we present faster processing times than previously reported. The Diffie-Hellman key exchange scheme with a modulus of 1024 bits and an exponent of 160 bits is computed in 1.9 ms. Our fastest design computes a 1024 bit RSA decryption in 3.1 ms when the Chinese remainder theorem is applied. These times are more than ten times faster than any reported software implementation. They also outperform most of the hardware-implementations presented in technical literature

    Montgomery Modular Multiplication on Reconfigurable Hardware: Systolic versus Multiplexed Implementation

    Get PDF
    This paper describes a comparison of two Montgomery modular multiplication architectures: a systolic and a multiplexed. Both implementations target FPGA devices. The modular multiplication is employed in modular exponentiation processes, which are the most important operations of some public-key cryptographic algorithms, including the most popular of them, the RSA. The proposed systolic architecture presents a high-radix implementation with a one-dimensional array of Processing Elements. The multiplexed implementation is a new alternative and is composed of multiplier blocks in parallel with the new simplified Processing Elements, and it provides a pipelined operation mode. We compare the time × area efficiency for both architectures as well as an RSA application. The systolic implementation can run the 1024 bits RSA decryption process in just 3.23 ms, and the multiplexed architecture executes the same operation in 4.36 ms, but the second approach saves up to 28% of logical resources. These results are competitive with the state-of-the-art performance

    Revisiting sum of residues modular multiplication

    Get PDF
    the 1980s,when the introduction of public key cryptography spurred interest in modularmultiplication, many implementations performed modularmultiplication using a sumof residues. As the fieldmatured, sum of residues modularmultiplication lost favour to the extent that all recent surveys have either overlooked it or incorporated it within a larger class of reduction algorithms. In this paper, we present a new taxonomy of modular multiplication algorithms. We include sum of residues as one of four classes and argue why it should be considered different to the other, now more common, algorithms.We then apply techniques developed for other algorithms to reinvigorate sum of residues modular multiplication. We compare FPGA implementations of modular multiplication up to 24 bits wide. The Sum of Residues multipliers demonstrate reduced latency at nearly 50% compared to Montgomery architectures at the cost of nearly doubled circuit area. The new multipliers are useful for systems based on the Residue Number System (RNS).Yinan Kong and Braden Phillip

    Montgomery and RNS for RSA Hardware Implementation

    Get PDF
    There are many architectures for RSA hardware implementation which improve its performance. Two main methods for this purpose are Montgomery and RNS. These are fast methods to convert plaintext to ciphertext in RSA algorithm with hardware implementation. RNS is faster than Montgomery but it uses more area. The goal of this paper is to compare these two methods based on the speed and on the used area. For this purpose the architecture that has a better performance for each method is selected, and some modification is done to enhance their performance. This comparison can be used to select the proper method for hardware implementation in both FPGA and ASIC design

    Arithmetic Considerations for Isogeny Based Cryptography

    Get PDF
    In this paper we investigate various arithmetic techniques which can be used to potentially enhance the performance in the supersingular isogeny Diffie-Hellman (SIDH) key-exchange protocol which is one of the more recent contenders in the post-quantum public-key arena. Firstly, we give a systematic overview of techniques to compute efficient arithmetic modulo 2xpy±12^xp^y\pm 1. Our overview shows that in the SIDH setting, where arithmetic over a quadratic extension field is required, the approaches based on Montgomery reduction for such primes of a special shape are to be preferred. Moreover, the outcome of our investigation reveals that there exist moduli which allow even faster implementations. Secondly, we investigate if it is beneficial to use other curve models to speed-up the elliptic curve scalar multiplication. The use of twisted Edwards curves allows one to search for efficient addition-subtraction chains for fixed scalars while this is not possible with the differential addition law when using Montgomery curves. Our preliminary results show that despite the fact that we found such efficient chains, using twisted Edwards curves does not result in faster scalar multiplication arithmetic in the setting of SIDH

    Efficient and Fast Hardware Architectures for SIKE Round 2 on FPGA

    Get PDF
    New primes were proposed for Supersingular Isogeny Key Encapsulation (SIKE) in NIST standardization process of Round 2 after further cryptanalysis research showed that the security levels of the initial primes chosen were over-estimated. In this paper, we develop a highly optimized Fp\mathbb{F}_{p} Montgomery multiplication algorithm and architecture that further utilizes the special form of SIKE prime compared to previous implementations available in the literature. We then implement SIKE for all Round 2 NIST security levels (SIKEp434 for NIST security level 1, SIKEp503 for NIST security level 2, SIKEp610 for NIST security level 3, and SIKEp751 for NIST security level 5) on Xilinx Virtex 7 using the proposed multiplier. Our best implementation (NIST security level 1) runs 29\% faster and occupies 30\% less hardware resources in comparison to the leading counterpart available in the literature and implementations for other security levels achieved similar improvement
    corecore