103 research outputs found

    Realizing arbitrary-precision modular multiplication with a fixed-precision multiplier datapath

    Get PDF
    Within the context of cryptographic hardware, the term scalability refers to the ability to process operands of any size, regardless of the precision of the underlying data path or registers. In this paper we present a simple yet effective technique for increasing the scalability of a fixed-precision Montgomery multiplier. Our idea is to extend the datapath of a Montgomery multiplier in such a way that it can also perform an ordinary multiplication of two n-bit operands (without modular reduction), yielding a 2n-bit result. This conventional (nxn->2n)-bit multiplication is then used as a “sub-routine” to realize arbitrary-precision Montgomery multiplication according to standard software algorithms such as Coarsely Integrated Operand Scanning (CIOS). We show that performing a 2n-bit modular multiplication on an n-bit multiplier can be done in 5n clock cycles, whereby we assume that the n-bit modular multiplication takes n cycles. Extending a Montgomery multiplier for this extra functionality requires just some minor modifications of the datapath and entails a slight increase in silicon area

    Versatile Montgomery Multiplier Architectures

    Get PDF
    Several algorithms for Public Key Cryptography (PKC), such as RSA, Diffie-Hellman, and Elliptic Curve Cryptography, require modular multiplication of very large operands (sizes from 160 to 4096 bits) as their core arithmetic operation. To perform this operation reasonably fast, general purpose processors are not always the best choice. This is why specialized hardware, in the form of cryptographic co-processors, become more attractive. Based upon the analysis of recent publications on hardware design for modular multiplication, this M.S. thesis presents a new architecture that is scalable with respect to word size and pipelining depth. To our knowledge, this is the first time a word based algorithm for Montgomery\u27s method is realized using high-radix bit-parallel multipliers working with two different types of finite fields (unified architecture for GF(p) and GF(2n)). Previous approaches have relied mostly on bit serial multiplication in combination with massive pipelining, or Radix-8 multiplication with the limitation to a single type of finite field. Our approach is centered around the notion that the optimal delay in bit-parallel multipliers grows with logarithmic complexity with respect to the operand size n, O(log3/2 n), while the delay of bit serial implementations grows with linear complexity O(n). Our design has been implemented in VHDL, simulated and synthesized in 0.5μ CMOS technology. The synthesized net list has been verified in back-annotated timing simulations and analyzed in terms of performance and area consumption

    Efficient Implementation on Low-Cost SoC-FPGAs of TLSv1.2 Protocol with ECC_AES Support for Secure IoT Coordinators

    Get PDF
    Security management for IoT applications is a critical research field, especially when taking into account the performance variation over the very different IoT devices. In this paper, we present high-performance client/server coordinators on low-cost SoC-FPGA devices for secure IoT data collection. Security is ensured by using the Transport Layer Security (TLS) protocol based on the TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256 cipher suite. The hardware architecture of the proposed coordinators is based on SW/HW co-design, implementing within the hardware accelerator core Elliptic Curve Scalar Multiplication (ECSM), which is the core operation of Elliptic Curve Cryptosystems (ECC). Meanwhile, the control of the overall TLS scheme is performed in software by an ARM Cortex-A9 microprocessor. In fact, the implementation of the ECC accelerator core around an ARM microprocessor allows not only the improvement of ECSM execution but also the performance enhancement of the overall cryptosystem. The integration of the ARM processor enables to exploit the possibility of embedded Linux features for high system flexibility. As a result, the proposed ECC accelerator requires limited area, with only 3395 LUTs on the Zynq device used to perform high-speed, 233-bit ECSMs in 413 µs, with a 50 MHz clock. Moreover, the generation of a 384-bit TLS handshake secret key between client and server coordinators requires 67.5 ms on a low cost Zynq 7Z007S device

    Maximizing the Efficiency using Montgomery Multipliers on FPGA in RSA Cryptography for Wireless Sensor Networks

    Get PDF
    The architecture and modeling of RSA public key encryption/decryption systems are presented in this work. Two different architectures are proposed, mMMM42 (modified Montgomery Modular Multiplier 4 to 2 Carry Save Architecture) and RSACIPHER128 to check the suitability for implementation in Wireless Sensor Nodes to utilize the same in Wireless Sensor Networks. It can easily be fitting into systems that require different levels of security by changing the key size. The processing time is increased and space utilization is reduced in FPGA due to its reusability. VHDL code is synthesized and simulated using Xilinx-ISE for both the architectures. Architectures are compared in terms of area and time. It is verified that this architecture support for a key size of 128bits. The implementation of RSA encryption/decryption algorithm on FPGA using 128 bits data and key size with RSACIPHER128 gives good result with 50% less utilization of hardware. This design is also implemented for ASIC using Mentor Graphics

    Parametric, Secure and Compact Implementation of RSA on FPGA

    Get PDF
    We present a fast, efficient, and parameterized modular multiplier and a secure exponentiation circuit especially intended for FPGAs on the low end of the price range. The design utilizes dedicated block multipliers as the main functional unit and Block-RAM as storage unit for the operands. The adopted design methodology allows adjusting the number of multipliers, the radix used in the multipliers, and number of words to meet the system requirements such as available resources, precision and timing constraints. The architecture, based on the Montgomery modular multiplication algorithm, utilizes a pipelining technique that allows concurrent operation of hardwired multipliers. Our design completes 1020-bit and 2040-bit modular multiplications in 7.62 μs and 27.0 μs, respectively. The multiplier uses a moderate amount of system resources while achieving the best area-time product in literature. 2040-bit modular exponentiation engine can easily fit into Xilinx Spartan-3E 500; moreover the exponentiation circuit withstands known side channel attacks

    Efficient NTRU Implementations

    Get PDF
    In this paper, new software and hardware designs for the NTRU Public Key Cryptosystem are proposed. The first design attempts to improve NTRU\u27s polynomial multiplication through applying techniques from the Chinese Remainder Theorem (CRT) to the convolution algorithm. Although the application of CRT shows promise for the creation of the inverse polynomials in the setup procedure, it does not provide any benefits to the procedures that are critical to the performance of NTRU (public key creation, encryption, and decryption). This research has identified that this is due to the small coefficients of one of the operands, which can be a common misunderstanding. The second design focuses on improving the performance of the polynomial multiplications within NTRU\u27s key creation, encryption, and decryption procedures through hardware. This design exploits the inherent parallelism within a polynomial multiplication to make scalability possible. The advantage scalability provides is that it allows the user to customize the design for low and high power applications. In addition, the support for arbitrary precision allows the user to meet the desired security level. The third design utilizes the Montgomery Multiplication algorithm to develop an unified architecture that can perform a modular multiplication for GF(p) and GF(2^k) and a polynomial multiplication for NTRU. The unified design only requires an additional 10 gates in order for the Montgomery Multiplier core to compute the polynomial multiplication for NTRU. However, this added support for NTRU presents some restrictions on the supported lengths of the moduli and on the chosen value for the residue for the GF(p) and GF(2^k) cases. Despite these restrictions, this unified architecture is now capable of supporting public key operations for the majority of Public-Key Cryptosystems

    Hardware implementation of elliptic curve Diffie-Hellman key agreement scheme in GF(p)

    Get PDF
    With the advent of technology there are many applications that require secure communication. Elliptic Curve Public-key Cryptosystems are increasingly becoming popular due to their small key size and efficient algorithm. Elliptic curves are widely used in various key exchange techniques including Diffie-Hellman Key Agreement scheme. Modular multiplication and modular division are one of the basic operations in elliptic curve cryptography. Much effort has been made in developing efficient modular multiplication designs, however few works has been proposed for the modular division. Nevertheless, these operations are needed in various cryptographic systems. This thesis examines various scalable implementations of elliptic curve scalar multiplication employing multiplicative inverse or field division in GF(p) focussing mainly on modular divison architectures. Next, this thesis presents a new architecture for modular division based on the variant of Extended Binary GCD algorithm. The main contribution at system level architecture to the modular division unit is use of counters in place of shift registers that are basis of the algorithm and modifying the algorithm to introduce a modular correction unit for the output logic. This results in 62% increase in speed with respect to a prototype design. Finally, using the modular division architecture an Elliptic Curve ALU in GF(p) was implemented which can be used as the core arithmetic unit of an elliptic curve processor. The resulting architecture was targeted to Xilinx Vertex2v6000-bf957 FPGA device and can be implemented for different elliptic curves for almost all practical values of field p. The frequency of the ALU is 58.8 MHz for 128-bits utilizing 20% of the device at 27712 gates which is 30% faster than a prototype implementation with a 2% increase in area utilization. The ALU was tested to perform Diffie-Hellman Key Agreement Scheme and is suitable for other public-key cryptographic algorithms

    Efficient Pipelining for Modular Multiplication Architectures in Prime Fields

    Get PDF
    This paper presents a pipelined architecture of a modular Montgomery multiplier, which is suitable to be used in public key coprocessors. Starting from a baseline implementation of the Montgomery algorithm, a more compact pipelined version is derived. The design makes use of 16bit integer multiplication blocks that are available on recently manufactured FPGAs. The critical path is optimized by omitting the exact computation of intermediate results in the Montgomery algorithm using a 6-2 carry-save notation. This results in a high-speed architecture, which outperforms previously designed Montgomery multipliers. Because a very popular application of Montgomery multiplication is public key cryptography, we compare our implementation to the state-of-the-art in Montgomery multipliers on the basis of performance results for 1024-bit RSA

    Accelerating RSA Public Key Cryptography via Hardware Acceleration

    Get PDF
    A large number and a variety of sensors and actuators, also known as edge devices of the Internet of Things, belonging to various industries - health care monitoring, home automation, industrial automation, have become prevalent in today\u27s world. These edge devices need to communicate data collected to the central system occasionally and often in burst mode which is then used for monitoring and control purposes. To ensure secure connections, Asymmetric or Public Key Cryptography (PKC) schemes are used in combination with Symmetric Cryptography schemes. RSA (Rivest - Shamir- Adleman) is one of the most prevalent public key cryptosystems, and has computationally intensive operations which might have a high latency when implemented in resource constrained environments. The objective of this thesis is to design an accelerator capable of increasing the speed of execution of the RSA algorithm in such resource constrained environments. The bottleneck of the algorithm is determined by analyzing the performance of the algorithm in various platforms - Intel Linux Machine, Raspberry Pi, Nios soft core processor. In designing the accelerator to speedup bottleneck function, we realize that the accelerator architecture will need to be changed according to the resources available to the accelerator. We use high level synthesis tools to explore the design space of the accelerator by taking into consideration system level aspects like the number of ports available to transfer inputs to the accelerator, the word size of the processor, etc. We also propose a new accelerator architecture for the bottleneck function and the algorithm it implements and compare the area and latency requirements of it with other designs obtained from design space exploration. The functionality of the design proposed is verified and prototyped in Zynq SoC of Xilinx Zedboard
    corecore