5 research outputs found

    Area Efficient Modular Reduction in Hardware for Arbitrary Static Moduli

    Full text link
    Modular reduction is a crucial operation in many post-quantum cryptographic schemes, including the Kyber key exchange method or Dilithium signature scheme. However, it can be computationally expensive and pose a performance bottleneck in hardware implementations. To address this issue, we propose a novel approach for computing modular reduction efficiently in hardware for arbitrary static moduli. Unlike other commonly used methods such as Barrett or Montgomery reduction, the method does not require any multiplications. It is not dependent on properties of any particular choice of modulus for good performance and low area consumption. Its major strength lies in its low area consumption, which was reduced by 60% for optimized and up to 90% for generic Barrett implementations for Kyber and Dilithium. Additionally, it is well suited for parallelization and pipelining and scales linearly in hardware resource consumption with increasing operation width. All operations can be performed in the bit-width of the modulus, rather than the size of the number being reduced. This shortens carry chains and allows for faster clocking. Moreover, our method can be executed in constant time, which is essential for cryptography applications where timing attacks can be used to obtain information about the secret key.Comment: 7 pages, 2 figure

    NEUROSim: Naturally Extensible, Unique RISC Operation Simulator

    Get PDF
    The NEUROSim framework consists of a compiler, assembler, and cycle-accurate processor simulator to facilitate computer architecture research. This framework provides a core instruction set common to many applications and a simulated datapath capable of executing these instructions. However, the core contribution of NEUROSim is its exible and extensible design allowing for the addition of instructions and architecture changes which target aspecic application. The NEUROSim framework is presented through the analysis of many system design decisions including execution forwarding, control change detection, FPU configuration, loop unrolling, recursive functions, self modifying code, branch predictors, and cache architectures. To demonstrate its exible nature, the NEUROSim framework is applied to specific domains including a modulo instruction intended for use in encryption applications, a multiply accumulate instruction analyzed in the context of digital signal processing, Taylor series expansion and lookup table instructions applied to mathematical expression approximation, and an atomic compare and swap instruction used for sorting

    Extension and implementation of the mod without mod algorithm to efficiently compute the modulus of a number in hardware

    Get PDF
    This thesis discusses a hardware implementation of modulo that does not require a multiplication. This implementation is based on the algorithm proposed in Mark A. Will's "Mod without mod" in which the an algorithm is presented to calculate the modulus of large values using shifting and adding. This allows our implementation to be comparable in clock cycles to other implementations without the need for a multiplier's delay. This algorithm is compared with others, such as Barret reduction, Montgomery reduction, and fast modular reduction. Our implementation of this modulo algorithm is shown to be faster in many cases. This paper proposes both a hardware implementation of this algorithm as well as synthesis results in soi12s0 45nm IBM Multi-threshold CMOS (MTCMOS) technology and ARM-based standard cells

    Computing Mod Without Mod

    Get PDF
    Abstract. Encryption algorithms are designed to be difficult to break without knowledge of the secrets or keys. To achieve this, the algorithms require the keys to be large, with some algorithms having a recommend size of 2048-bits or more. However most modern processors only support computation on 64-bits at a time. Therefore standard operations with large numbers are more complicated to implement. One operation that is particularly challenging to implement efficiently is modular reduction. In this paper we propose a highly-efficient algorithm for solving large modulo operations; it has several advantages over current approaches as it supports the use of a variable sized lookup table, has good spatial and temporal locality allowing data to be streamed, and only requires basic processor instructions. Our proposed algorithm is theoretically compared to widely used modular algorithms, before practically compared against the state-of-the-art GNU Multiple Precision (GMP) large number li-brary.
    corecore