149 research outputs found

    Area- Efficient VLSI Implementation of Serial-In Parallel-Out Multiplier Using Polynomial Representation in Finite Field GF(2m)

    Full text link
    Finite field multiplier is mainly used in elliptic curve cryptography, error-correcting codes and signal processing. Finite field multiplier is regarded as the bottleneck arithmetic unit for such applications and it is the most complicated operation over finite field GF(2m) which requires a huge amount of logic resources. In this paper, a new modified serial-in parallel-out multiplication algorithm with interleaved modular reduction is suggested. The proposed method offers efficient area architecture as compared to proposed algorithms in the literature. The reduced finite field multiplier complexity is achieved by means of utilizing logic NAND gate in a particular architecture. The efficiency of the proposed architecture is evaluated based on criteria such as time (latency, critical path) and space (gate-latch number) complexity. A detailed comparative analysis indicates that, the proposed finite field multiplier based on logic NAND gate outperforms previously known resultsComment: 19 pages, 4 figure

    Novel Single and Hybrid Finite Field Multipliers over GF(2m) for Emerging Cryptographic Systems

    Get PDF
    With the rapid development of economic and technical progress, designers and users of various kinds of ICs and emerging embedded systems like body-embedded chips and wearable devices are increasingly facing security issues. All of these demands from customers push the cryptographic systems to be faster, more efficient, more reliable and safer. On the other hand, multiplier over GF(2m) as the most important part of these emerging cryptographic systems, is expected to be high-throughput, low-complexity, and low-latency. Fortunately, very large scale integration (VLSI) digital signal processing techniques offer great facilities to design efficient multipliers over GF(2m). This dissertation focuses on designing novel VLSI implementation of high-throughput low-latency and low-complexity single and hybrid finite field multipliers over GF(2m) for emerging cryptographic systems. Low-latency (latency can be chosen without any restriction) high-speed pentanomial basis multipliers are presented. For the first time, the dissertation also develops three high-throughput digit-serial multipliers based on pentanomials. Then a novel realization of digit-level implementation of multipliers based on redundant basis is introduced. Finally, single and hybrid reordered normal basis bit-level and digit-level high-throughput multipliers are presented. To the authors knowledge, this is the first time ever reported on multipliers with multiple throughput rate choices. All the proposed designs are simple and modular, therefore suitable for VLSI implementation for various emerging cryptographic systems

    Hardware Architectures for Post-Quantum Cryptography

    Get PDF
    The rapid development of quantum computers poses severe threats to many commonly-used cryptographic algorithms that are embedded in different hardware devices to ensure the security and privacy of data and communication. Seeking for new solutions that are potentially resistant against attacks from quantum computers, a new research field called Post-Quantum Cryptography (PQC) has emerged, that is, cryptosystems deployed in classical computers conjectured to be secure against attacks utilizing large-scale quantum computers. In order to secure data during storage or communication, and many other applications in the future, this dissertation focuses on the design, implementation, and evaluation of efficient PQC schemes in hardware. Four PQC algorithms, each from a different family, are studied in this dissertation. The first hardware architecture presented in this dissertation is focused on the code-based scheme Classic McEliece. The research presented in this dissertation is the first that builds the hardware architecture for the Classic McEliece cryptosystem. This research successfully demonstrated that complex code-based PQC algorithm can be run efficiently on hardware. Furthermore, this dissertation shows that implementation of this scheme on hardware can be easily tuned to different configurations by implementing support for flexible choices of security parameters as well as configurable hardware performance parameters. The successful prototype of the Classic McEliece scheme on hardware increased confidence in this scheme, and helped Classic McEliece to get recognized as one of seven finalists in the third round of the NIST PQC standardization process. While Classic McEliece serves as a ready-to-use candidate for many high-end applications, PQC solutions are also needed for low-end embedded devices. Embedded devices play an important role in our daily life. Despite their typically constrained resources, these devices require strong security measures to protect them against cyber attacks. Towards securing this type of devices, the second research presented in this dissertation focuses on the hash-based digital signature scheme XMSS. This research is the first that explores and presents practical hardware based XMSS solution for low-end embedded devices. In the design of XMSS hardware, a heterogenous software-hardware co-design approach was adopted, which combined the flexibility of the soft core with the acceleration from the hard core. The practicability and efficiency of the XMSS software-hardware co-design is further demonstrated by providing a hardware prototype on an open-source RISC-V based System-on-a-Chip (SoC) platform. The third research direction covered in this dissertation focuses on lattice-based cryptography, which represents one of the most promising and popular alternatives to today\u27s widely adopted public key solutions. Prior research has presented hardware designs targeting the computing blocks that are necessary for the implementation of lattice-based systems. However, a recurrent issue in most existing designs is that these hardware designs are not fully scalable or parameterized, hence limited to specific cryptographic primitives and security parameter sets. The research presented in this dissertation is the first that develops hardware accelerators that are designed to be fully parameterized to support different lattice-based schemes and parameters. Further, these accelerators are utilized to realize the first software-harware co-design of provably-secure instances of qTESLA, which is a lattice-based digital signature scheme. This dissertation demonstrates that even demanding, provably-secure schemes can be realized efficiently with proper use of software-hardware co-design. The final research presented in this dissertation is focused on the isogeny-based scheme SIKE, which recently made it to the final round of the PQC standardization process. This research shows that hardware accelerators can be designed to offload compute-intensive elliptic curve and isogeny computations to hardware in a versatile fashion. These hardware accelerators are designed to be fully parameterized to support different security parameter sets of SIKE as well as flexible hardware configurations targeting different user applications. This research is the first that presents versatile hardware accelerators for SIKE that can be mapped efficiently to both FPGA and ASIC platforms. Based on these accelerators, an efficient software-hardwareco-design is constructed for speeding up SIKE. In the end, this dissertation demonstrates that, despite being embedded with expensive arithmetic, the isogeny-based SIKE scheme can be run efficiently by exploiting specialized hardware. These four research directions combined demonstrate the practicability of building efficient hardware architectures for complex PQC algorithms. The exploration of efficient PQC solutions for different hardware platforms will eventually help migrate high-end servers and low-end embedded devices towards the post-quantum era

    Speeding up a scalable modular inversion hardware architecture

    Get PDF
    The modular inversion is a fundamental process in several cryptographic systems. It can be computed in software or hardware, but hardware computation proven to be faster and more secure. This research focused on improving an old scalable inversion hardware architecture proposed in 2004 for finite field GF(p). The architecture has been made of two parts, a computing unit and a memory unit. The memory unit is to hold all the data bits of computation whereas the computing unit performs all the arithmetic operations in word (digit) by word bases known as scalable method. The main objective of this project was to investigate the cost and benefit of modifying the memory unit to include parallel shifting, which was one of the tasks of the scalable computing unit. The study included remodeling the entire hardware architecture removing the shifter from the scalable computing part embedding it in the memory unit instead. This modification resulted in a speedup to the complete inversion process with an area increase due to the new memory shifting unit. Quantitative measurements of the speed area trade-off have been investigated. The results showed that the extra hardware to be added for this modification compared to the speedup gained, giving the user the complete picture to choose from depending on the application need.the British council in Saudi Arabia, KFUPM, Dr. Tatiana Kalganova at the Electrical & Computer Engineering Department of Brunel University in Uxbridg

    Synthesis Optimization on Galois-Field Based Arithmetic Operators for Rijndael Cipher

    Get PDF
    A  series  of  experiments  has  been  conducted  to  show  that  FPGA synthesis  of  Galois-Field  (GF)  based  arithmetic  operators  can  be  optimized automatically  to  improve  Rijndael  Cipher  throughput.  Moreover,  it  has  been demonstrated  that  efficiency  improvement  in  GF  operators  does  not  directly correspond to the system performance at application level. The experiments were motivated by so many research works that focused on improving performance of GF  operators.  Each  of  the  variants  has  the  most  efficient  form  in  either  time (fastest) or space  (smallest occupied area) when implemented in FPGA chips. In fact,  GF  operators are not utilized  individually, but  rather integrated one to the others to  implement algorithms.  Contribution  of  this  paper  is  to  raise  issue  on GF-based  application  performance  and  suggest  alternative  aspects  that potentially  affect  it.  Instead  of  focusing  on  GF  operator  efficiency,  system characteristics are worth considered in optimizing application performance

    Multiple bit error correcting architectures over finite fields

    Get PDF
    This thesis proposes techniques to mitigate multiple bit errors in GF arithmetic circuits. As GF arithmetic circuits such as multipliers constitute the complex and important functional unit of a crypto-processor, making them fault tolerant will improve the reliability of circuits that are employed in safety applications and the errors may cause catastrophe if not mitigated. Firstly, a thorough literature review has been carried out. The merits of efficient schemes are carefully analyzed to study the space for improvement in error correction, area and power consumption. Proposed error correction schemes include bit parallel ones using optimized BCH codes that are useful in applications where power and area are not prime concerns. The scheme is also extended to dynamically correcting scheme to reduce decoder delay. Other method that suits low power and area applications such as RFIDs and smart cards using cross parity codes is also proposed. The experimental evaluation shows that the proposed techniques can mitigate single and multiple bit errors with wider error coverage compared to existing methods with lesser area and power consumption. The proposed scheme is used to mask the errors appearing at the output of the circuit irrespective of their cause. This thesis also investigates the error mitigation schemes in emerging technologies (QCA, CNTFET) to compare area, power and delay with existing CMOS equivalent. Though the proposed novel multiple error correcting techniques can not ensure 100% error mitigation, inclusion of these techniques to actual design can improve the reliability of the circuits or increase the difficulty in hacking crypto-devices. Proposed schemes can also be extended to non GF digital circuits

    Modular Exponentiation on Reconfigurable Hardware

    Get PDF
    It is widely recognized that security issues will play a crucial role in the majority of future computer and communication systems. A central tool for achieving system security are cryptographic algorithms. For performance as well as for physical security reasons, it is often advantageous to realize cryptographic algorithms in hardware. In order to overcome the well-known drawback of reduced flexibility that is associated with traditional ASIC solutions, this contribution proposes arithmetic architectures which are optimized for modern field programmable gate arrays (FPGAs). The proposed architectures perform modular exponentiation with very long integers. This operation is at the heart of many practical public-key algorithms such as RSA and discrete logarithm schemes. We combine two versions of Montgomery modular multiplication algorithm with new systolic array designs which are well suited for FPGA realizations. The first one is based on a radix of two and is capable of processing a variable number of bits per array cell leading to a low cost design. The second design uses a radix of sixteen, resulting in a speed-up of a factor three at the cost of more used resources. The designs are flexible, allowing any choice of operand and modulus. Unlike previous approaches, we systematically implement and compare several versions of our new architecture for different bit lengths. We provide absolute area and timing measures for each architecture on Xilinx XC4000 series FPGAs. As a first practical result we show that it is possible to implement modular exponentiation at secure bit lengths on a single commercially available FPGA. Secondly we present faster processing times than previously reported. The Diffie-Hellman key exchange scheme with a modulus of 1024 bits and an exponent of 160 bits is computed in 1.9 ms. Our fastest design computes a 1024 bit RSA decryption in 3.1 ms when the Chinese remainder theorem is applied. These times are more than ten times faster than any reported software implementation. They also outperform most of the hardware-implementations presented in technical literature

    A VLSI synthesis of a Reed-Solomon processor for digital communication systems

    Get PDF
    The Reed-Solomon codes have been widely used in digital communication systems such as computer networks, satellites, VCRs, mobile communications and high- definition television (HDTV), in order to protect digital data against erasures, random and burst errors during transmission. Since the encoding and decoding algorithms for such codes are computationally intensive, special purpose hardware implementations are often required to meet the real time requirements. -- One motivation for this thesis is to investigate and introduce reconfigurable Galois field arithmetic structures which exploit the symmetric properties of available architectures. Another is to design and implement an RS encoder/decoder ASIC which can support a wide family of RS codes. -- An m-programmable Galois field multiplier which uses the standard basis representation of the elements is first introduced. It is then demonstrated that the exponentiator can be used to implement a fast inverter which outperforms the available inverters in GF(2m). Using these basic structures, an ASIC design and synthesis of a reconfigurable Reed-Solomon encoder/decoder processor which implements a large family of RS codes is proposed. The design is parameterized in terms of the block length n, Galois field symbol size m, and error correction capability t for the various RS codes. The design has been captured using the VHDL hardware description language and mapped onto CMOS standard cells available in the 0.8-µm BiCMOS design kits for Cadence and Synopsys tools. The experimental chip contains 218,206 logic gates and supports values of the Galois field symbol size m = 3,4,5,6,7,8 and error correction capability t = 1,2,3, ..., 16. Thus, the block length n is variable from 7 to 255. Error correction t and Galois field symbol size m are pin-selectable. -- Since low design complexity and high throughput are desired in the VLSI chip, the algebraic decoding technique has been investigated instead of the time or transform domain. The encoder uses a self-reciprocal generator polynomial which structures the codewords in a systematic form. At the beginning of the decoding process, received words are initially stored in the first-in-first-out (FIFO) buffer as they enter the syndrome module. The Berlekemp-Massey algorithm is used to determine both the error locator and error evaluator polynomials. The Chien Search and Forney's algorithms operate sequentially to solve for the error locations and error values respectively. The error values are exclusive or-ed with the buffered messages in order to correct the errors, as the processed data leave the chip

    High Speed Hardware Architecture to Compute GF(p) Montgomery Inversion with Scalability Features

    Get PDF
    Modular inversion is a fundamental process in several cryptographic systems. It can be computed in software or hardware, but hardware computation has been proven to be faster and more secure. This research focused on improving an old scalable inversion hardware architecture proposed in 2004 for finite field GF(p). The architecture comprises two parts, a computing unit and a memory unit. The memory unit holds all the data bits of computation whereas the computing unit performs all the arithmetic operations in word (digit) by word bases such that the design is scalable. The main objective of this paper is to show the cost and benefit of modifying the memory unit to include shifting, which was previously one of the tasks of the scalable computing unit. The study included remodeling the entire hardware architecture removing the shifter from the scalable computing part and embedding it in the non-scalable memory unit instead. This modification resulted in a speedup to the complete inversion process with an area increase due to the new memory shifting unit. Several design schemes have been compared giving the user the complete picture to choose from depending on the application need
    corecore