34 research outputs found

    Fast, compact and secure implementation of rsa on dedicated hardware

    Get PDF
    RSA is the most popular Public Key Cryptosystem (PKC) and is heavily used today. PKC comes into play, when two parties, who have previously never met, want to create a secure channel between them. The core operation in RSA is modular multiplication, which requires lots of computational power especially when the operands are longer than 1024-bits. Although today’s powerful PC’s can easily handle one RSA operation in a fraction of a second, small devices such as PDA’s, cell phones, smart cards, etc. have limited computational power, thus there is a need for dedicated hardware which is specially designed to meet the demand of this heavy calculation. Additionally, web servers, which thousands of users can access at the same time, need to perform many PKC operations in a very short time and this can create a performance bottleneck. Special algorithms implemented on dedicated hardware can take advantage of true massive parallelism and high utilization of the data path resulting in high efficiency in terms of both power and execution time while keeping the chip cost low. We will use the “Montgomery Modular Multiplication” algorithm in our implementation, which is considered one of the most efficient multiplication schemes, and has many applications in PKC. In the first part of the thesis, our “2048-bit Radix-4 based Modular Multiplier” design is introduced and compared with the conventional radix-2 modular multipliers of previous works. Our implementation for 2048-bit modular multiplication features up to 82% shorter execution time with 33% increase in the area over the conventional radix-2 designs and can achieve 132 MHz on a Xilinx xc2v6000 FPGA. The proposed multiplier has one of the fastest execution times in terms of latency and performs better than (37% better) our reference radix-2 design in terms of time-area product. The results are similar in the ASIC case where we implement our design for UMC 0.18 μm technology. In the second part, a fast, efficient, and parameterized modular multiplier and a secure exponentiation circuit intended for inexpensive FPGAs are presented. The design utilizes hardwired block multipliers as the main functional unit and Block-RAM as storage unit for the operands. The adopted design methodology allows adjusting the number of multipliers, the radix used in the multipliers, and number of words to meet the system requirements such as available resources, precision and timing constraints. The deployed method is based on the Montgomery modular multiplication algorithm and the architecture utilizes a pipelining technique that allows concurrent operation of hardwired multipliers. Our design completes 1020-bit and 2040-bit modular multiplications* in 7.62 μs and 27.0 μs respectively with approximately the same device usage on Xilinx Spartan-3E 500. The multiplier uses a moderate amount of system resources while achieving the best area-time product in literature. 2040-bit modular exponentiation engine easily fits into Xilinx Spartan-3E 500; moreover the exponentiation circuit withstands known side channel attacks with an insignificant overhead in area and execution time. The upper limit on the operand precision is dictated only by the available Block-RAM to accommodate the operands within the FPGA. This design is also compared to the first one, considering the relative advantages and disadvantages of each circuit

    Studies on high-speed hardware implementation of cryptographic algorithms

    Get PDF
    Cryptographic algorithms are ubiquitous in modern communication systems where they have a central role in ensuring information security. This thesis studies efficient implementation of certain widely-used cryptographic algorithms. Cryptographic algorithms are computationally demanding and software-based implementations are often too slow or power consuming which yields a need for hardware implementation. Field Programmable Gate Arrays (FPGAs) are programmable logic devices which have proven to be highly feasible implementation platforms for cryptographic algorithms because they provide both speed and programmability. Hence, the use of FPGAs for cryptography has been intensively studied in the research community and FPGAs are also the primary implementation platforms in this thesis. This thesis presents techniques allowing faster implementations than existing ones. Such techniques are necessary in order to use high-security cryptographic algorithms in applications requiring high data rates, for example, in heavily loaded network servers. The focus is on Advanced Encryption Standard (AES), the most commonly used secret-key cryptographic algorithm, and Elliptic Curve Cryptography (ECC), public-key cryptographic algorithms which have gained popularity in the recent years and are replacing traditional public-key cryptosystems, such as RSA. Because these algorithms are well-defined and widely-used, the results of this thesis can be directly applied in practice. The contributions of this thesis include improvements to both algorithms and techniques for implementing them. Algorithms are modified in order to make them more suitable for hardware implementation, especially, focusing on increasing parallelism. Several FPGA implementations exploiting these modifications are presented in the thesis including some of the fastest implementations available in the literature. The most important contributions of this thesis relate to ECC and, specifically, to a family of elliptic curves providing faster computations called Koblitz curves. The results of this thesis can, in their part, enable increasing use of cryptographic algorithms in various practical applications where high computation speed is an issue

    Versatile Montgomery Multiplier Architectures

    Get PDF
    Several algorithms for Public Key Cryptography (PKC), such as RSA, Diffie-Hellman, and Elliptic Curve Cryptography, require modular multiplication of very large operands (sizes from 160 to 4096 bits) as their core arithmetic operation. To perform this operation reasonably fast, general purpose processors are not always the best choice. This is why specialized hardware, in the form of cryptographic co-processors, become more attractive. Based upon the analysis of recent publications on hardware design for modular multiplication, this M.S. thesis presents a new architecture that is scalable with respect to word size and pipelining depth. To our knowledge, this is the first time a word based algorithm for Montgomery\u27s method is realized using high-radix bit-parallel multipliers working with two different types of finite fields (unified architecture for GF(p) and GF(2n)). Previous approaches have relied mostly on bit serial multiplication in combination with massive pipelining, or Radix-8 multiplication with the limitation to a single type of finite field. Our approach is centered around the notion that the optimal delay in bit-parallel multipliers grows with logarithmic complexity with respect to the operand size n, O(log3/2 n), while the delay of bit serial implementations grows with linear complexity O(n). Our design has been implemented in VHDL, simulated and synthesized in 0.5μ CMOS technology. The synthesized net list has been verified in back-annotated timing simulations and analyzed in terms of performance and area consumption

    Efficient Implementation of Elliptic Curve Cryptography on FPGAs

    Get PDF
    This work presents the design strategies of an FPGA-based elliptic curve co-processor. Elliptic curve cryptography is an important topic in cryptography due to its relatively short key length and higher efficiency as compared to other well-known public key crypto-systems like RSA. The most important contributions of this work are: - Analyzing how different representations of finite fields and points on elliptic curves effect the performance of an elliptic curve co-processor and implementing a high performance co-processor. - Proposing a novel dynamic programming approach to find the optimum combination of different recursive polynomial multiplication methods. Here optimum means the method which has the smallest number of bit operations. - Designing a new normal-basis multiplier which is based on polynomial multipliers. The most important part of this multiplier is a circuit of size O(nlogn)O(n \log n) for changing the representation between polynomial and normal basis

    A high-speed integrated circuit with applications to RSA Cryptography

    Get PDF
    Merged with duplicate record 10026.1/833 on 01.02.2017 by CS (TIS)The rapid growth in the use of computers and networks in government, commercial and private communications systems has led to an increasing need for these systems to be secure against unauthorised access and eavesdropping. To this end, modern computer security systems employ public-key ciphers, of which probably the most well known is the RSA ciphersystem, to provide both secrecy and authentication facilities. The basic RSA cryptographic operation is a modular exponentiation where the modulus and exponent are integers typically greater than 500 bits long. Therefore, to obtain reasonable encryption rates using the RSA cipher requires that it be implemented in hardware. This thesis presents the design of a high-performance VLSI device, called the WHiSpER chip, that can perform the modular exponentiations required by the RSA cryptosystem for moduli and exponents up to 506 bits long. The design has an expected throughput in excess of 64kbit/s making it attractive for use both as a general RSA processor within the security function provider of a security system, and for direct use on moderate-speed public communication networks such as ISDN. The thesis investigates the low-level techniques used for implementing high-speed arithmetic hardware in general, and reviews the methods used by designers of existing modular multiplication/exponentiation circuits with respect to circuit speed and efficiency. A new modular multiplication algorithm, MMDDAMMM, based on Montgomery arithmetic, together with an efficient multiplier architecture, are proposed that remove the speed bottleneck of previous designs. Finally, the implementation of the new algorithm and architecture within the WHiSpER chip is detailed, along with a discussion of the application of the chip to ciphering and key generation

    FourQ on FPGA: New Hardware Speed Records for Elliptic Curve Cryptography over Large Prime Characteristic Fields

    Get PDF
    We present fast and compact implementations of FourQ (ASIACRYPT 2015) on field-programmable gate arrays (FPGAs), and demonstrate, for the first time, the high efficiency of this new elliptic curve on reconfigurable hardware. By adapting FourQ\u27s algorithms to hardware, we design FPGA-tailored architectures that are significantly faster than any other ECC alternative over large prime characteristic fields. For example, we show that our single-core and multi-core implementations can compute at a rate of 6389 and 64730 scalar multiplications per second, respectively, on a Xilinx Zynq-7020 FPGA, which represent factor-2.5 and 2 speedups in comparison with the corresponding variants of the fastest Curve25519 implementation on the same device. These results show the potential of deploying FourQ on hardware for high-performance and embedded security applications. All the presented implementations exhibit regular, constant-time execution, protecting against timing and simple side-channel attacks

    Efficient implementation of elliptic curve cryptography.

    Get PDF
    Elliptic Curve Cryptosystems (ECC) were introduced in 1985 by Neal Koblitz and Victor Miller. Small key size made elliptic curve attractive for public key cryptosystem implementation. This thesis introduces solutions of efficient implementation of ECC in algorithmic level and in computation level. In algorithmic level, a fast parallel elliptic curve scalar multiplication algorithm based on a dual-processor hardware system is developed. The method has an average computation time of n3 Elliptic Curve Point Addition on an n-bit scalar. The improvement is n Elliptic Curve Point Doubling compared to conventional methods. When a proper coordinate system and binary representation for the scalar k is used the average execution time will be as low as n Elliptic Curve Point Doubling, which makes this method about two times faster than conventional single processor multipliers using the same coordinate system. In computation level, a high performance elliptic curve processor (ECP) architecture is presented. The processor uses parallelism in finite field calculation to achieve high speed execution of scalar multiplication algorithm. The architecture relies on compile-time detection rather than of run-time detection of parallelism which results in less hardware. Implemented on FPGA, the proposed processor operates at 66MHz in GF(2 167) and performs scalar multiplication in 100muSec, which is considerably faster than recent implementations.Dept. of Electrical and Computer Engineering. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2004 .A57. Source: Masters Abstracts International, Volume: 44-03, page: 1446. Thesis (M.A.Sc.)--University of Windsor (Canada), 2005

    Bit Serial Systolic Architectures for Multiplicative Inversion and Division over GF(2<sup>m</sup>)

    Get PDF
    Systolic architectures are capable of achieving high throughput by maximizing pipelining and by eliminating global data interconnects. Recursive algorithms with regular data flows are suitable for systolization. The computation of multiplicative inversion using algorithms based on EEA (Extended Euclidean Algorithm) are particularly suitable for systolization. Implementations based on EEA present a high degree of parallelism and pipelinability at bit level which can be easily optimized to achieve local data flow and to eliminate the global interconnects which represent most important bottleneck in todays sub-micron design process. The net result is to have high clock rate and performance based on efficient systolic architectures. This thesis examines high performance but also scalable implementations of multiplicative inversion or field division over Galois fields GF(2m) in the specific case of cryptographic applications where field dimension m may be very large (greater than 400) and either m or defining irreducible polynomial may vary. For this purpose, many inversion schemes with different basis representation are studied and most importantly variants of EEA and binary (Stein's) GCD computation implementations are reviewed. A set of common as well as contrasting characteristics of these variants are discussed. As a result a generalized and optimized variant of EEA is proposed which can compute division, and multiplicative inversion as its subset, with divisor in either polynomial or triangular basis representation. Further results regarding Hankel matrix formation for double-basis inversion is provided. The validity of using the same architecture to compute field division with polynomial or triangular basis representation is proved. Next, a scalable unidirectional bit serial systolic array implementation of this proposed variant of EEA is implemented. Its complexity measures are defined and these are compared against the best known architectures. It is shown that assuming the requirements specified above, this proposed architecture may achieve a higher clock rate performance w. r. t. other designs while being more flexible, reliable and with minimum number of inter-cell interconnects. The main contribution at system level architecture is the substitution of all counter or adder/subtractor elements with a simpler distributed and free of carry propagation delays structure. Further a novel restoring mechanism for result sequences of EEA is proposed using a double delay element implementation. Finally, using this systolic architecture a CMD (Combined Multiplier Divider) datapath is designed which is used as the core of a novel systolic elliptic curve processor. This EC processor uses affine coordinates to compute scalar point multiplication which results in having a very small control unit and negligible with respect to the datapath for all practical values of m. The throughput of this EC based on this bit serial systolic architecture is comparable with designs many times larger than itself reported previously

    Design and analysis of an FPGA-based, multi-processor HW-SW system for SCC applications

    Get PDF
    The last 30 years have seen an increase in the complexity of embedded systems from a collection of simple circuits to systems consisting of multiple processors managing a wide variety of devices. This ever increasing complexity frequently requires that high assurance, fail-safe and secure design techniques be applied to protect against possible failures and breaches. To facilitate the implementation of these embedded systems in an efficient way, the FPGA industry recently created new families of devices. New features added to these devices include anti-tamper monitoring, bit stream encryption, and optimized routing architectures for physical and functional logic partition isolation. These devices have high capacities and are capable of implementing processors using their reprogrammable logic structures. This allows for an unprecedented level of hardware and software interaction within a single FPGA chip. High assurance and fail-safe systems can now be implemented within the reconfigurable hardware fabric of an FPGA, enabling these systems to maintain flexibility and achieve high performance while providing a high level of data security. The objective of this thesis was to design and analyze an FPGA-based system containing two isolated, softcore Nios processors that share data through two crypto-engines. FPGA-based single-chip cryptographic (SCC) techniques were employed to ensure proper component isolation when the design is placed on a device supporting the appropriate security primitives. Each crypto-engine is an implementation of the Advanced Encryption Standard (AES), operating in Galois/Counter Mode (GCM) for both encryption and authentication. The features of the microprocessors and architectures of the AES crypto-engines were varied with the goal of determining combinations which best target high performance, minimal hardware usage, or a combination of the two
    corecore