64 research outputs found

    Parametric, Secure and Compact Implementation of RSA on FPGA

    Get PDF
    We present a fast, efficient, and parameterized modular multiplier and a secure exponentiation circuit especially intended for FPGAs on the low end of the price range. The design utilizes dedicated block multipliers as the main functional unit and Block-RAM as storage unit for the operands. The adopted design methodology allows adjusting the number of multipliers, the radix used in the multipliers, and number of words to meet the system requirements such as available resources, precision and timing constraints. The architecture, based on the Montgomery modular multiplication algorithm, utilizes a pipelining technique that allows concurrent operation of hardwired multipliers. Our design completes 1020-bit and 2040-bit modular multiplications in 7.62 μs and 27.0 μs, respectively. The multiplier uses a moderate amount of system resources while achieving the best area-time product in literature. 2040-bit modular exponentiation engine can easily fit into Xilinx Spartan-3E 500; moreover the exponentiation circuit withstands known side channel attacks

    Realizing arbitrary-precision modular multiplication with a fixed-precision multiplier datapath

    Get PDF
    Within the context of cryptographic hardware, the term scalability refers to the ability to process operands of any size, regardless of the precision of the underlying data path or registers. In this paper we present a simple yet effective technique for increasing the scalability of a fixed-precision Montgomery multiplier. Our idea is to extend the datapath of a Montgomery multiplier in such a way that it can also perform an ordinary multiplication of two n-bit operands (without modular reduction), yielding a 2n-bit result. This conventional (nxn->2n)-bit multiplication is then used as a “sub-routine” to realize arbitrary-precision Montgomery multiplication according to standard software algorithms such as Coarsely Integrated Operand Scanning (CIOS). We show that performing a 2n-bit modular multiplication on an n-bit multiplier can be done in 5n clock cycles, whereby we assume that the n-bit modular multiplication takes n cycles. Extending a Montgomery multiplier for this extra functionality requires just some minor modifications of the datapath and entails a slight increase in silicon area

    Comparison of Scalable Montgomery Modular Multiplication Implementations Embedded in Reconfigurable Hardware

    No full text
    International audienceThis paper presents a comparison of possible approaches for an efficient implementation of Multiple-word radix-2 Montgomery Modular Multiplication (MM) on modern Field Programmable Gate Arrays (FPGAs). The hardware implementation of MM coprocessor is fully scalable what means that it can be reused in order to generate long-precision results independently on the word length of the originally proposed coprocessor. The first of analyzed implementations uses a data path based on traditionally used redundant carry-save adders, the second one exploits, in scalable designs not yet applied, standard carry-propagate adders with fast carry chain logic. As a control unit and a platform for purely software implementation an embedded soft-core processor Altera NIOS is employed. All implementations use large embedded memory blocks available in recent FPGAs. Speed and logic requirements comparisons are performed on the optimized software and combined hardware-software designs in Altera FPGAs. The issues of targeting a design specifically for a FPGA are considered taking into account the underlying architecture imposed by the target FPGA technology. It is shown that the coprocessors based on carry-save adders and carry-propagate adders provide comparable results in constrained FPGA implementations but in case of carry-propagate logic, the solution requires less embedded memory and provides some additional implementation advantages presented in the paper

    Maximizing the Efficiency using Montgomery Multipliers on FPGA in RSA Cryptography for Wireless Sensor Networks

    Get PDF
    The architecture and modeling of RSA public key encryption/decryption systems are presented in this work. Two different architectures are proposed, mMMM42 (modified Montgomery Modular Multiplier 4 to 2 Carry Save Architecture) and RSACIPHER128 to check the suitability for implementation in Wireless Sensor Nodes to utilize the same in Wireless Sensor Networks. It can easily be fitting into systems that require different levels of security by changing the key size. The processing time is increased and space utilization is reduced in FPGA due to its reusability. VHDL code is synthesized and simulated using Xilinx-ISE for both the architectures. Architectures are compared in terms of area and time. It is verified that this architecture support for a key size of 128bits. The implementation of RSA encryption/decryption algorithm on FPGA using 128 bits data and key size with RSACIPHER128 gives good result with 50% less utilization of hardware. This design is also implemented for ASIC using Mentor Graphics

    Montgomery Modular Multiplication on Reconfigurable Hardware: Systolic versus Multiplexed Implementation

    Get PDF
    This paper describes a comparison of two Montgomery modular multiplication architectures: a systolic and a multiplexed. Both implementations target FPGA devices. The modular multiplication is employed in modular exponentiation processes, which are the most important operations of some public-key cryptographic algorithms, including the most popular of them, the RSA. The proposed systolic architecture presents a high-radix implementation with a one-dimensional array of Processing Elements. The multiplexed implementation is a new alternative and is composed of multiplier blocks in parallel with the new simplified Processing Elements, and it provides a pipelined operation mode. We compare the time × area efficiency for both architectures as well as an RSA application. The systolic implementation can run the 1024 bits RSA decryption process in just 3.23 ms, and the multiplexed architecture executes the same operation in 4.36 ms, but the second approach saves up to 28% of logical resources. These results are competitive with the state-of-the-art performance

    Cryptarray A Scalable And Reconfigurable Architecture For Cryptographic Applications

    Get PDF
    Cryptography is increasingly viewed as a critical technology to fulfill the requirements of security and authentication for information exchange between Internet applications. However, software implementations of cryptographic applications are unable to support the quality of service from a bandwidth perspective required by most Internet applications. As a result, various hardware implementations, from Application-Specific Integrated Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs), to programmable processors, were proposed to improve this inadequate quality of service. Although these implementations provide performances that are considered better than those produced by software implementations, they still fall short of addressing the bandwidth requirements of most cryptographic applications in the context of the Internet for two major reasons: (i) The majority of these architectures sacrifice flexibility for performance in order to reach the performance level needed for cryptographic applications. This lack of flexibility can be detrimental considering that cryptographic standards and algorithms are still evolving. (ii) These architectures do not consider the consequences of technology scaling in general, and particularly interconnect related problems. As a result, this thesis proposes an architecture that attempts to address the requirements of cryptographic applications by overcoming the obstacles described in (i) and (ii). To this end, we propose a new reconfigurable, two-dimensional, scalable architecture, called CRYPTARRAY, in which bus-based communication is replaced by distributed shared memory communication. At the physical level, the length of the wires will be kept to a minimum. CRYPTARRAY is organized as a chessboard in which the dark and light squares represent Processing Elements (PE) and memory blocks respectively. The granularity and resource composition of the PEs is specifically designed to support the computing operations encountered in cryptographic algorithms in general, and symmetric algorithms in particular. Communication can occur only between neighboring PEs through locally shared memory blocks. Because of the chessboard layout, the architecture can be reconfigured to allow computation to proceed as a pipelined wave in any direction. This organization offers a high computational density in terms of datapath resources and a large number of distributed storage resources that easily support a high degree of parallelism and pipelining. Experimental prototyping a small array on FPGA chips shows that this architecture can run at 80.9 MHz producing 26,968,716 outputs every second in static reconfiguration mode and 20,226,537 outputs every second in dynamic reconfiguration mode

    8192 bit Rivest-Shamir-Adleman data encryption hardware accelerator

    Get PDF
    Rivest-Shamir-Adelman (RSA) algorithm is one of the state-of-art publickey cryptography that is efficient in terms of implementation because it uses the same general equation for encryption and decryption, that is, modular exponentiation equation. The security reliability of RSA algorithm is based on the difficulty of factoring a large number. The larger the RSA key size, the higher the security level that can be achieved. However, at the same time, the complexity of the computation increases, which results in more computation cycles. Software implementation of RSA with large key size is too slow and less effective for large amount of data encryption or decryption. Hence, the purpose of this project is to implement a hardware-based RSA coprocessor to handle RSA encryption and decryption effectively. This project implements a RSA coprocessor using radix-2 Montgomery modular multiplication that described at bit-level. This implementation uses carry-saved adders to achieve parallel processing in hardware. The hardware implementation of the RSA coprocessor is done using Verilog synthesizable Register-transfer Level (RTL) code to allow scalability. Simulation results are obtained to validate the functionality of the design. The design is synthesized using Altera Quartus software tool to evaluate the performance of the implementation. The designs are synthesized on device Stratix V 5SEEBF45I4 for key-size of 128-bit, 1024-bit and 8192-bit. The data throughput of the 8192-bit design can reach up to 3.387 kbps with LE utilization of 30% on the device used. Although the performance of the design is not the highest among the related works, but this design provides a proven working prototype for 8192-bit RSA coprocessor using Bit-level Montgomery Modular Multiplication for hardware parallel processing

    Power Efficient Fpga Implementation Of Rsa Algortihm

    Get PDF
    Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 2010Thesis (M.Sc.) -- İstanbul Technical University, Institute of Science and Technology, 2010Bu çalışmada Rivest, Shamir, Adleman (RSA) algoritması sahada programlanabilir kapı dizisi üzerinde gerçeklenmekte ve güç tasarruf yöntemlerinden yararlanılarak dinamik güç harcamaları azaltılmaktadır. RSA algoritması en yaygın kullanıma sahip açık anahtarlı şifreleme algoritmalarından biridir. RSA algoritmasını oluşturan matematiksel temel işlemleri iki ana başlıkta toplamak mümkündür: moduler çarpma işlemi ve moduler üs alma, exponent işlemi. RSA algoritmasında kullanılan aritmetik işlem ME mod N işlemidir. Bu işlemdeki N sayısı aralarında asal iki sayının çarpımından oluşan modulo değeri, M mesaj ya da düz metin dediğimiz bilgi, E ise açık anahtar olarak bilinen değerdir. İyi bir RSA gerçeklemesi oluşturmak istenirse; yapılması gereken en önemli şey, iyi bir modular çarpma devresi oluşturmaktır. Bu matematiksel açıklamalardan yola çıkararak anlamalıyız ki; bir RSA gerçeklemesinde en çok güç tüketen blok modular çarpma devresidir. Bu nedenle güç tüketimlerinin karşılaştırılması açısından modular çarpma devresine farklı teknikler uygulanmıştır. Daha sonra çok yaygın bir kullanıma sahip olan ardışıl ikili modular üs alma tekniği ile RSA algoritması gerçeklenmiştir. Bilgisayar benzetim programı ile yapıların test vektörü girişlerine karşılık doğru sonuçlar verdiği gösterilmiştir.In this study, dynamic power consumptions of Field Programmable Gate Array (FPGA) implementations of the Rivest, Shamir, Adleman (RSA) has been reduced by using low power design methods. RSA is one of the most popular public key cryptographic algorithms. The mathematics behind RSA algorithm, are summarized in two operations, modular multiplication and modular exponentiation. In the RSA cryptosystem, the arithmetic operation ME mod N is used, where N is a prime product of two relative prime numbers, M is the message and E is the public key. In order to create an efficient implementation of RSA, one has to design efficiently the multiplication of two modular numbers. So this mathematical background provides a good understanding that Modular Multiplication block dissipates the most of the power, dissipated in RSA. For comparison of power dissipations, different methods are used to implement Modular Multiplication block. Then RSA implemented by using Sequential Binary Modular Exponentiation which has widespread applications. Computer simulations have been used to show that the implementations of the algorithm generate correct outputs against test vectors.Yüksek LisansM.Sc

    A VLSI synthesis of a Reed-Solomon processor for digital communication systems

    Get PDF
    The Reed-Solomon codes have been widely used in digital communication systems such as computer networks, satellites, VCRs, mobile communications and high- definition television (HDTV), in order to protect digital data against erasures, random and burst errors during transmission. Since the encoding and decoding algorithms for such codes are computationally intensive, special purpose hardware implementations are often required to meet the real time requirements. -- One motivation for this thesis is to investigate and introduce reconfigurable Galois field arithmetic structures which exploit the symmetric properties of available architectures. Another is to design and implement an RS encoder/decoder ASIC which can support a wide family of RS codes. -- An m-programmable Galois field multiplier which uses the standard basis representation of the elements is first introduced. It is then demonstrated that the exponentiator can be used to implement a fast inverter which outperforms the available inverters in GF(2m). Using these basic structures, an ASIC design and synthesis of a reconfigurable Reed-Solomon encoder/decoder processor which implements a large family of RS codes is proposed. The design is parameterized in terms of the block length n, Galois field symbol size m, and error correction capability t for the various RS codes. The design has been captured using the VHDL hardware description language and mapped onto CMOS standard cells available in the 0.8-µm BiCMOS design kits for Cadence and Synopsys tools. The experimental chip contains 218,206 logic gates and supports values of the Galois field symbol size m = 3,4,5,6,7,8 and error correction capability t = 1,2,3, ..., 16. Thus, the block length n is variable from 7 to 255. Error correction t and Galois field symbol size m are pin-selectable. -- Since low design complexity and high throughput are desired in the VLSI chip, the algebraic decoding technique has been investigated instead of the time or transform domain. The encoder uses a self-reciprocal generator polynomial which structures the codewords in a systematic form. At the beginning of the decoding process, received words are initially stored in the first-in-first-out (FIFO) buffer as they enter the syndrome module. The Berlekemp-Massey algorithm is used to determine both the error locator and error evaluator polynomials. The Chien Search and Forney's algorithms operate sequentially to solve for the error locations and error values respectively. The error values are exclusive or-ed with the buffered messages in order to correct the errors, as the processed data leave the chip
    corecore