32 research outputs found

    Bit Serial Systolic Architectures for Multiplicative Inversion and Division over GF(2<sup>m</sup>)

    Get PDF
    Systolic architectures are capable of achieving high throughput by maximizing pipelining and by eliminating global data interconnects. Recursive algorithms with regular data flows are suitable for systolization. The computation of multiplicative inversion using algorithms based on EEA (Extended Euclidean Algorithm) are particularly suitable for systolization. Implementations based on EEA present a high degree of parallelism and pipelinability at bit level which can be easily optimized to achieve local data flow and to eliminate the global interconnects which represent most important bottleneck in todays sub-micron design process. The net result is to have high clock rate and performance based on efficient systolic architectures. This thesis examines high performance but also scalable implementations of multiplicative inversion or field division over Galois fields GF(2m) in the specific case of cryptographic applications where field dimension m may be very large (greater than 400) and either m or defining irreducible polynomial may vary. For this purpose, many inversion schemes with different basis representation are studied and most importantly variants of EEA and binary (Stein's) GCD computation implementations are reviewed. A set of common as well as contrasting characteristics of these variants are discussed. As a result a generalized and optimized variant of EEA is proposed which can compute division, and multiplicative inversion as its subset, with divisor in either polynomial or triangular basis representation. Further results regarding Hankel matrix formation for double-basis inversion is provided. The validity of using the same architecture to compute field division with polynomial or triangular basis representation is proved. Next, a scalable unidirectional bit serial systolic array implementation of this proposed variant of EEA is implemented. Its complexity measures are defined and these are compared against the best known architectures. It is shown that assuming the requirements specified above, this proposed architecture may achieve a higher clock rate performance w. r. t. other designs while being more flexible, reliable and with minimum number of inter-cell interconnects. The main contribution at system level architecture is the substitution of all counter or adder/subtractor elements with a simpler distributed and free of carry propagation delays structure. Further a novel restoring mechanism for result sequences of EEA is proposed using a double delay element implementation. Finally, using this systolic architecture a CMD (Combined Multiplier Divider) datapath is designed which is used as the core of a novel systolic elliptic curve processor. This EC processor uses affine coordinates to compute scalar point multiplication which results in having a very small control unit and negligible with respect to the datapath for all practical values of m. The throughput of this EC based on this bit serial systolic architecture is comparable with designs many times larger than itself reported previously

    A VLSI synthesis of a Reed-Solomon processor for digital communication systems

    Get PDF
    The Reed-Solomon codes have been widely used in digital communication systems such as computer networks, satellites, VCRs, mobile communications and high- definition television (HDTV), in order to protect digital data against erasures, random and burst errors during transmission. Since the encoding and decoding algorithms for such codes are computationally intensive, special purpose hardware implementations are often required to meet the real time requirements. -- One motivation for this thesis is to investigate and introduce reconfigurable Galois field arithmetic structures which exploit the symmetric properties of available architectures. Another is to design and implement an RS encoder/decoder ASIC which can support a wide family of RS codes. -- An m-programmable Galois field multiplier which uses the standard basis representation of the elements is first introduced. It is then demonstrated that the exponentiator can be used to implement a fast inverter which outperforms the available inverters in GF(2m). Using these basic structures, an ASIC design and synthesis of a reconfigurable Reed-Solomon encoder/decoder processor which implements a large family of RS codes is proposed. The design is parameterized in terms of the block length n, Galois field symbol size m, and error correction capability t for the various RS codes. The design has been captured using the VHDL hardware description language and mapped onto CMOS standard cells available in the 0.8-µm BiCMOS design kits for Cadence and Synopsys tools. The experimental chip contains 218,206 logic gates and supports values of the Galois field symbol size m = 3,4,5,6,7,8 and error correction capability t = 1,2,3, ..., 16. Thus, the block length n is variable from 7 to 255. Error correction t and Galois field symbol size m are pin-selectable. -- Since low design complexity and high throughput are desired in the VLSI chip, the algebraic decoding technique has been investigated instead of the time or transform domain. The encoder uses a self-reciprocal generator polynomial which structures the codewords in a systematic form. At the beginning of the decoding process, received words are initially stored in the first-in-first-out (FIFO) buffer as they enter the syndrome module. The Berlekemp-Massey algorithm is used to determine both the error locator and error evaluator polynomials. The Chien Search and Forney's algorithms operate sequentially to solve for the error locations and error values respectively. The error values are exclusive or-ed with the buffered messages in order to correct the errors, as the processed data leave the chip

    Hardware Implementations for Symmetric Key Cryptosystems

    Get PDF
    The utilization of global communications network for supporting new electronic applications is growing. Many applications provided over the global communications network involve exchange of security-sensitive information between different entities. Often, communicating entities are located at different locations around the globe. This demands deployment of certain mechanisms for providing secure communications channels between these entities. For this purpose, cryptographic algorithms are used by many of today\u27s electronic applications to maintain security. Cryptographic algorithms provide set of primitives for achieving different security goals such as: confidentiality, data integrity, authenticity, and non-repudiation. In general, two main categories of cryptographic algorithms can be used to accomplish any of these security goals, namely, asymmetric key algorithms and symmetric key algorithms. The security of asymmetric key algorithms is based on the hardness of the underlying computational problems, which usually require large overhead of space and time complexities. On the other hand, the security of symmetric key algorithms is based on non-linear transformations and permutations, which provide efficient implementations compared to the asymmetric key ones. Therefore, it is common to use asymmetric key algorithms for key exchange, while symmetric key counterparts are deployed in securing the communications sessions. This thesis focuses on finding efficient hardware implementations for symmetric key cryptosystems targeting mobile communications and resource constrained applications. First, efficient lightweight hardware implementations of two members of the Welch-Gong (WG) family of stream ciphers, the WG(29,11)\left(29,11\right) and WG-1616, are considered for the mobile communications domain. Optimizations in the WG(29,11)\left(29,11\right) stream cipher are considered when the GF(229)GF\left(2^{29}\right) elements are represented in either the Optimal normal basis type-II (ONB-II) or the Polynomial basis (PB). For WG-1616, optimizations are considered only for PB representations of the GF(216)GF\left(2^{16}\right) elements. In this regard, optimizations for both ciphers are accomplished mainly at the arithmetic level through reducing the number of field multipliers, based on novel trace properties. In addition, other optimization techniques such as serialization and pipelining, are also considered. After this, the thesis explores efficient hardware implementations for digit-level multiplication over binary extension fields GF(2m)GF\left(2^{m}\right). Efficient digit-level GF(2m)GF\left(2^{m}\right) multiplications are advantageous for ultra-lightweight implementations, not only in symmetric key algorithms, but also in asymmetric key algorithms. The thesis introduces new architectures for digit-level GF(2m)GF\left(2^{m}\right) multipliers considering the Gaussian normal basis (GNB) and PB representations of the field elements. The new digit-level GF(2m)GF\left(2^{m}\right) single multipliers do not require loading of the two input field elements in advance to computations. This feature results in high throughput fast multiplication in resource constrained applications with limited capacity of input data-paths. The new digit-level GF(2m)GF\left(2^{m}\right) single multipliers are considered for both the GNB and PB. In addition, for the GNB representation, new architectures for digit-level GF(2m)GF\left(2^{m}\right) hybrid-double and hybrid-triple multipliers are introduced. The new digit-level GF(2m)GF\left(2^{m}\right) hybrid-double and hybrid-triple GNB multipliers, respectively, accomplish the multiplication of three and four field elements using the latency required for multiplying two field elements. Furthermore, a new hardware architecture for the eight-ary exponentiation scheme is proposed by utilizing the new digit-level GF(2m)GF\left(2^{m}\right) hybrid-triple GNB multipliers

    Efficient Arithmetic for the Implementation of Elliptic Curve Cryptography

    Get PDF
    The technology of elliptic curve cryptography is now an important branch in public-key based crypto-system. Cryptographic mechanisms based on elliptic curves depend on the arithmetic of points on the curve. The most important arithmetic is multiplying a point on the curve by an integer. This operation is known as elliptic curve scalar (or point) multiplication operation. A cryptographic device is supposed to perform this operation efficiently and securely. The elliptic curve scalar multiplication operation is performed by combining the elliptic curve point routines that are defined in terms of the underlying finite field arithmetic operations. This thesis focuses on hardware architecture designs of elliptic curve operations. In the first part, we aim at finding new architectures to implement the finite field arithmetic multiplication operation more efficiently. In this regard, we propose novel schemes for the serial-out bit-level (SOBL) arithmetic multiplication operation in the polynomial basis over F_2^m. We show that the smallest SOBL scheme presented here can provide about 26-30\% reduction in area-complexity cost and about 22-24\% reduction in power consumptions for F_2^{163} compared to the current state-of-the-art bit-level multiplier schemes. Then, we employ the proposed SOBL schemes to present new hybrid-double multiplication architectures that perform two multiplications with latency comparable to the latency of a single multiplication. Then, in the second part of this thesis, we investigate the different algorithms for the implementation of elliptic curve scalar multiplication operation. We focus our interest in three aspects, namely, the finite field arithmetic cost, the critical path delay, and the protection strength from side-channel attacks (SCAs) based on simple power analysis. In this regard, we propose a novel scheme for the scalar multiplication operation that is based on processing three bits of the scalar in the exact same sequence of five point arithmetic operations. We analyse the security of our scheme and show that its security holds against both SCAs and safe-error fault attacks. In addition, we show how the properties of the proposed elliptic curve scalar multiplication scheme yields an efficient hardware design for the implementation of a single scalar multiplication on a prime extended twisted Edwards curve incorporating 8 parallel multiplication operations. Our comparison results show that the proposed hardware architecture for the twisted Edwards curve model implemented using the proposed scalar multiplication scheme is the fastest secure SCA protected scalar multiplication scheme over prime field reported in the literature

    Concurrent Error Detection in Finite Field Arithmetic Operations

    Get PDF
    With significant advances in wired and wireless technologies and also increased shrinking in the size of VLSI circuits, many devices have become very large because they need to contain several large units. This large number of gates and in turn large number of transistors causes the devices to be more prone to faults. These faults specially in sensitive and critical applications may cause serious failures and hence should be avoided. On the other hand, some critical applications such as cryptosystems may also be prone to deliberately injected faults by malicious attackers. Some of these faults can produce erroneous results that can reveal some important secret information of the cryptosystems. Furthermore, yield factor improvement is always an important issue in VLSI design and fabrication processes. Digital systems such as cryptosystems and digital signal processors usually contain finite field operations. Therefore, error detection and correction of such operations have become an important issue recently. In most of the work reported so far, error detection and correction are applied using redundancies in space (hardware), time, and/or information (coding theory). In this work, schemes based on these redundancies are presented to detect errors in important finite field arithmetic operations resulting from hardware faults. Finite fields are used in a number of practical cryptosystems and channel encoders/decoders. The schemes presented here can detect errors in arithmetic operations of finite fields represented in different bases, including polynomial, dual and/or normal basis, and implemented in various architectures, including bit-serial, bit-parallel and/or systolic arrays

    New Hardware Algorithms and Designs for Montgomery Modular Inverse Computation in Galois Fields GF(p) and GF(2n)

    Get PDF
    The computation of the inverse of a number in finite fields, namely Galois Fields GF(p) or GF(2n), is one of the most complex arithmetic operations in cryptographic applications. In this work, we investigate the GF(p) inversion and present several phases in the design of efficient hardware implementations to compute the Montgomery modular inverse. We suggest a new correction phase for a previously proposed almost Montgomery inverse algorithm to calculate the inversion in hardware. It is also presented how to obtain a fast hardware algorithm to compute the inverse by multi-bit shifting method. The proposed designs have the hardware scalability feature, which means that the design can fit on constrained areas and still handle operands of any size. In order to have long-precision calculations, the module works on small precision words. The word-size, on which the module operates, can be selected based on the area and performance requirements. The upper limit on the operand precision is dictated only by the available memory to store the operands and internal results. The scalable module is in principle capable of performing infinite-precision Montgomery inverse computation of an integer, modulo a prime number. We also propose a scalable and unified architecture for a Montgomery inverse hardware that operates in both GF(p) and GF(2n) fields. We adjust and modify a GF(2n) Montgomery inverse algorithm to benefit from multi-bit shifting hardware features making it very similar to the proposed best design of GF(p) inversion hardware. We compare all scalable designs with fully parallel ones based on the same basic inversion algorithm. All scalable designs consumed less area and in general showed better performance than the fully parallel ones, which makes the scalable design a very efficient solution for computing the long precision Montgomery inverse

    New Hardware Algorithms and Designs for Montgomery Modular Inverse Computation in Galois Fields GF(p) and GF(2n)

    Get PDF
    The computation of the inverse of a number in finite fields, namely Galois Fields GF(p) or GF(2n), is one of the most complex arithmetic operations in cryptographic applications. In this work, we investigate the GF(p) inversion and present several phases in the design of efficient hardware implementations to compute the Montgomery modular inverse. We suggest a new correction phase for a previously proposed almost Montgomery inverse algorithm to calculate the inversion in hardware. It is also presented how to obtain a fast hardware algorithm to compute the inverse by multi-bit shifting method. The proposed designs have the hardware scalability feature, which means that the design can fit on constrained areas and still handle operands of any size. In order to have long-precision calculations, the module works on small precision words. The word-size, on which the module operates, can be selected based on the area and performance requirements. The upper limit on the operand precision is dictated only by the available memory to store the operands and internal results. The scalable module is in principle capable of performing infinite-precision Montgomery inverse computation of an integer, modulo a prime number. We also propose a scalable and unified architecture for a Montgomery inverse hardware that operates in both GF(p) and GF(2n) fields. We adjust and modify a GF(2n) Montgomery inverse algorithm to benefit from multi-bit shifting hardware features making it very similar to the proposed best design of GF(p) inversion hardware. We compare all scalable designs with fully parallel ones based on the same basic inversion algorithm. All scalable designs consumed less area and in general showed better performance than the fully parallel ones, which makes the scalable design a very efficient solution for computing the long precision Montgomery inverse

    New Hardware Algorithms and Designs for Montgomery Modular Inverse Computation in Galois Fields GF(p) and GF(2n)

    Get PDF
    The computation of the inverse of a number in finite fields, namely Galois Fields GF(p) or GF(2n), is one of the most complex arithmetic operations in cryptographic applications. In this work, we investigate the GF(p) inversion and present several phases in the design of efficient hardware implementations to compute the Montgomery modular inverse. We suggest a new correction phase for a previously proposed almost Montgomery inverse algorithm to calculate the inversion in hardware. It is also presented how to obtain a fast hardware algorithm to compute the inverse by multi-bit shifting method. The proposed designs have the hardware scalability feature, which means that the design can fit on constrained areas and still handle operands of any size. In order to have long-precision calculations, the module works on small precision words. The word-size, on which the module operates, can be selected based on the area and performance requirements. The upper limit on the operand precision is dictated only by the available memory to store the operands and internal results. The scalable module is in principle capable of performing infinite-precision Montgomery inverse computation of an integer, modulo a prime number. We also propose a scalable and unified architecture for a Montgomery inverse hardware that operates in both GF(p) and GF(2n) fields. We adjust and modify a GF(2n) Montgomery inverse algorithm to benefit from multi-bit shifting hardware features making it very similar to the proposed best design of GF(p) inversion hardware. We compare all scalable designs with fully parallel ones based on the same basic inversion algorithm. All scalable designs consumed less area and in general showed better performance than the fully parallel ones, which makes the scalable design a very efficient solution for computing the long precision Montgomery inverse
    corecore