We describe the hardware implementation of a novel algorithm for computing the discrete logarithm modulo 2 k . The circuit has a total latency of less than k 
Introduction
The hardware support of floating point arithmetic for high precision scientific computation has greatly advanced since the introduction of the IEEE floating point standard in the early 80's. However, the hardware capabilities for integer arithmetic have been enhanced primarily only at low precision such as where 64 and/or 128 bit words may be partitioned to support parallel addition/multiplication of four 16/32-bit precision operands. 1 The focus of this paper is to explore hardware implementation of the discrete logarithm modulo 2 k . Following [4] we employ modular function notation using |n| m = j to denote the congruence relation n • j (mod m) with the further restriction that j is the standard residue for modulus m • 2 satisfying 0 1 j m ≤ ≤ − . Herein we restrict our investigation exclusively to moduli m = 2 k for k • 3, with particular interest in time and space feasible algorithms for hardware implementation where k = 64, 128, 256, 512 or 1024. We utilize the value 3 as the logarithmic base for our discrete logarithm as it has useful properties in combination with the modulus 2 k . The discrete logarithm problem asks for determination of the minimum exponent e (when it exists) for given 0 • j • Extending the hardware support to include the discrete logarithm modular 2 k will be beneficial for number systems such as the one introduced by Benschop in [3] . Here, integer numbers N with fixed precision k are represented by a factorization (a, b) with: 
). An algorithmic conversion from the standard binary representation to the number system presented in [3] is needed. The operations corresponding to conversions to and from are the discrete logarithm and the exponential residue operation.
Known algorithms for determination of the discrete logarithm for an arbitrary modulus are very complex. Existing algorithms such as: Pollard's and Algorithms, the Pohlig-Hellman Algorithm, the index calculus method, and Shanks baby step -giant step Algorithm have super polynomial running times that are at best sub-exponential. Only Shanks baby stepgiant step Algorithm is deterministic. In the recent papers [1] [2], two discrete logarithm algorithms for the special case of modulo 2 k were introduced. The algorithm described in [1] uses binary arithmetic with 3 as the logarithmic base and has a critical path containing one modulo 2 k multiplication operation for each of its k iterations. Extensions of the algorithm to other logarithmic bases and computations using digits in a higher radix 2 r are also described. The algorithm in [1] [2] is well suited for implementation in special purpose hardware.
The paper is organized as follows. In Section 2, we introduce a new one-to-one bit string encoding between Benschop's modular factors and standard binary.To support the determination of the encoding, we review background material, which is also a condensed version of the algorithm described in [2] . We summarize it here to make this paper self-contained. Section 3 describes our implementation. Section 4 describes experimental results and cell library information and conclusions are presented in Section 5.
Preliminaries and Algorithm
It is readily shown (see [1] ) that the set of values representation system that provides a one-to-one mapping between k-bit strings and standard binary k-bit strings. Our encoding further contains a technical detail to insure that the one-to-one mapping also satisfies an "inheritance" property. More information on the encoding and the practical value of the inheritance property in reducing the size of associated lookup tables is given in [7] and [8] . Herein we simply utilize an example table to illustrate the encoding and some of its significant properties.
The one-to-one mapping between 5-bit stings and binary strings is given by the conversion table illustrated in Table 1 . The string is partitioned as follows to determine the three exponents s, p and e. Consider the line in the table for string 10110 which yields binary 01110B=14D.
The parsing begins from the right hand side determining the variable length field identifying Conversion from this discrete logarithm number system to a standard binary representation is possible by this procedure using the integer multiplier to evaluate the product ( 1) 2 3 s p e
−
. As discovered by Benschop, the discrete logarithm format allows integer multiplications to be handled by addition (and left shifts to accumulate the factors of 2). In [1] and [2] the authors described an efficient algorithm for determining the discrete logarithm of an odd integer which then supports the necessary conversion from binary into the discrete logarithm number system.
In this section we point out the essential mathematical properties that make the algorithm we implemented feasible. A) denote the discrete logarithm modulo 2 k of A with logarithmic base 3 and, respectively, the discrete logarithm modulo M with logarithmic base of A. Let
In [1] it is showed that:
The correct selection for . That is: in line L6 to be 1, along with the corresponding update of P (which is equivalent to P 2 = 3× P in L7).
The second stage contains the main iteration step and is represented by lines L9 • L14, where both P and the exponent e′ are updated. P is conceptually updated as P i+1 = P i The final result is computed in line L15 as the sign s and the exponent
. This is because e′ really
, and, as a direct consequence of (a.), we have that dlg((• 1)
The updating of e′ and P in lines L11 and L12 can be performed concurrently. As can be seen by inspection from the Algorithm, the time complexity is essentially k dependent shift-and-add modulo 2 k operations.
Hardware Implementation
The state diagram of the hardware implementation is given in the Figure 1 . There are 4 states available, wait, init, loop and ready. The wait state is also a reset state. It accepts input when the load signal is one and the busy signal is zero. The initial state corresponds to the part of initial checking in the previous algorithm. It will set the initial value for the loop according the third Least Significant Bit (LSB) and the three LSBs. The loop state will repeatedly update the P and e′ . This will decide the throughput of the whole system. The loop count goes from 3 to k, totally k-3 maximal. The ready state is the state that outputs the result. The circuit automatically steps into the waiting state after ready state.
There are three major components of the circuit, a controller, ROM and datapath, as shown in Figure 2 .
The controller consists of a counter and state control block (FSM). The FSM will start and stop the counting procedure. The output of the counter, count, will be used for 4 purposes: an address for ROM, an index for the bit checker and a shifter controller and feedback to the FSM for state transition. The ROM is used as lookup table for the dlg(τ ).The major components in the datapath are two adders, one shifter and a unit called bit-checker which is used to check if a certain bit is true or false. The output of the bit-checker will control the operation of the adders and shifter. If the output is false, no operation will be performed, otherwise, the e′ value will be updated by one adder and the P value will be updated by the shifter and adder. The modulo operation given in previous algorithm is handled by limiting the size of P and e′ . The size of P is set to k while the size of e′ is set to be k-2. Thus, while updating P and e′ , the result values may be longer than the specified size (or overflow). We can simply ignore the overflowed bits since this computation is performed modulo 2 k . 
Experimental Results
We implemented the circuit using the Synopsys tool set based on a standard cell library from the Synopsys tutorial [5] . Table 3 shows the cell delay and transition delay calculated based on the output net total capacitance (cap.) for three types of registers in the library. Figure 3 shows the schematic view of the circuit after synthesis. Figure 4 (a) shows the physical layout of the design with the standard cell library. Figure 4(b) shows the critical path of the design in the layout for k=8. We implemented four designs corresponding to k=8,16,32,64 respectively. Table 4 compares the layout results of the four designs. From the table, it is seen that as k increases, the speed will decrease since the output net total capacitance will increase and the cell delay will also increase. In summary, area, number of cells, and nets all increase since more cells are needed for a larger k value.
From Table 3 , it is seen that the cell library is not a fast technology library. Even with this library, we have reached an estimated speed of 500Mhz for k=8. Also, the latency is linear with respect to k. Based on the above data, it is observed that the algorithm provides a very efficient way of calculating the discrete logarithm of a value modulo 2 k .
Conclusion and future work
We presented a standard cell hardware implementation of a novel algorithm for computing the discrete logarithm modulo 2 k . The algorithm has a critical path of less than k shift-and-add modulo 2 k operations. We compare the physical standard cell implemntations for the algorithm when k=8,16,32,64 respectively. The experimental results confirm that the algorithm is an effective way of calculating the discrete logarithm of a value modulo 2 k . The exponentiation modulo 2 k operation can also be implemented with a critical path consisting of k lookup table determined shift-and-add modulo 2 k operations [6] . We are currently investigating a hardware extension where components could be shared between the discrete logarithm modulo 2 k and exponentiation modulo 2 k . Based on these two operations, we can implement the multiplication circuit in [3] , and provide support for more applications of modular arithmetic in the "hardware friendly" family of binary power moduli 2 k . 
