Abstract. This paper describes new methods for producing optimal binary signed-digit representations. This can be useful in the fast computation of exponentiations. Contrary to existing algorithms, the digits are scanned from left to right (i.e., from the most significant position to the least significant position). This may lead to better performances in both hardware and software.
Introduction
Methodology using signed-digit (sd) representations, also called redundant number representations, for fast parallel arithmetic were considered in the late 1950's by Avizienis [1] . More recently, algorithms using signed-digit representations with the digit set {−1, 0, 1} (in this paper, we call it the sd2 representation) accompanied with their applications to efficient methods for addition, multiplication, division, and their vlsi chip designs are presented in [2] [3] [4] [5] [6] . In general, after the computations in the sd2 domain, a converter is required to transform a number from its sd2 representation into the conventional binary representation. Such converters may be found in [7, Section 1.5] , [8] .
For many cryptosystems, (modular) exponentiation is one of the most timeconsuming operations. Therefore, efficient algorithms to perform this operation are crucial in the performance of the resulting cryptographic protocols. Basically, when computing α r , two types of exponentiations may be distinguished.
-The first type involves exponentiations with a fixed exponent r, such as in the rsa cryptosystem [9, 10] . The goal is then to quickly compute α r for randomly chosen α. This is usually achieved thanks to addition chains [11, Section 4.6.3] . The problem of finding the shortest addition chain was shown to be NP-hard by Downey, Leong and Sethi [12] ; but good heuristics are known [13] . -In the second type of exponentiation, the base α is fixed and the exponent r varies. Examples include the ElGamal cryptosystem [14] and its numerous variations [15, Section 11.5] . In that case, good performances are obtained by the basic square-and-multiply technique (see Section 2) . If larger amount of storage is available, this can be further improved via precomputations [16] .
In this paper, we are mainly concerned with the second type of exponentiation. However, we note that our methods may lead to some advantages in the first type, too. Another application is when inverses can be virtually computed for free, as for elliptic curves [17] . The basic idea is to recode the exponent in a representation which has fewer nonzero digits, namely the sd2 representation. Already in 1951, this was successfully exploited by Booth to efficiently multiply two numbers [18] . An optimal version (in terms of the number of zero digits of the recoding) was later given by Reitwiesner [19] . In [20] , Jedwab and Mitchell rediscovered Reitwiesner's algorithm and slightly generalized it by taking as input any sd2 representation of the exponent (instead of the binary representation). Our algorithms also produce optimal outputs but scan the digits of the exponent from left to right, i.e., from the most significant digit (msd) to the least significant digit (lsd). This brings some advantages, especially in the hardware realization or for memory-constrained environments like smart-cards.
The rest of this paper is organized as follows. In Section 2, we review the square-and-multiply methods for fast exponentiation and extend them to exponents given in the sd2 representation. Section 3 presents Reitwiesner's algorithm. New exponent recoding algorithms are proposed in Section 4. Their hardware implementation is given in Section 5. Section 6 discusses further advantages of the proposed methods. Finally, we conclude in Section 7.
Notations
If r = m−1 i=0 r i 2 i denotes the binary expansion of r, then we represent r as the vector (r m−1 , . . . , r 0 ) 2 . The bit-length of r is denoted by |r|. By abuse of notations, we do not make the distinction between the value of r and its representation and write r = (r m−1 , . . . , r 0 ) 2 . For signed-digit systems, we sometimes write1 for −1. Moreover, if r = m i=0 r ′ i 2 i denotes the binary signed expansion of r (that is, r ′ i ∈ {1, 0, 1}), then we also abuse the notations and write r = (r ′ m , . . . , r ′ 0 ) sd2 . Let S be a string. Then S k means S, S, . . . , S (k times); for example, ( 0, 1 2 ,1) sd2 represents (0, 1, 0, 1,1) sd2 . If t is a real number, then ⌊t⌋ is the largest integer ≤ t and ⌈t⌉ is the least integer ≥ t.
Throughout this paper, the multiplicative notation is used. However, the described techniques also apply to additively written sets (e.g., additive group of integers, points on an elliptic curve over a field). Exponentiation has then to be understood as multiplication.
Binary Algorithms for Fast Exponentiation
The most commonly used algorithms for computing α r are the binary methods [11, Section 4.6.3] . The binary methods (also called square-and-multiply methods) scan the bits of exponent r either from right to left or from left to right (Fig. 1) . At each step, a squaring is performed and depending on whether the scanned bit-value is equal to 1, a multiplication is also performed. Let r = m−1 i=0 r i 2 i (with r m−1 = 1) be the binary expansion of r. The right-to-left (rl) algorithm is based on the observation that α r = α 
INPUT: α, r = (rm−1, . . . , r0)2
INPUT: α, r = (rm−1, . . . , r0)2 We remark that the lr algorithm ( Fig. 1 (b) ) requires 2 registers (for α and for M ) and that the rl algorithm ( Fig. 1 (a) ) requires one more register (for S). However, we note that S can be used in place of α if the value of α is not needed thereafter. The rl algorithm presents the advantage to be parallelizable: one multiplier performs the multiplications M ← M ·S and another one performs the squarings S ← S 2 . However, if only one multiplier is available, the lr algorithm may be preferred because the multiplications are always done by the fixed value α, M ← M ·α. So, if α has a special structure, these multiplications may be easier than multiplying two arbitrary numbers (see [15, Note 14.81] or [21, pp. 9-10] for examples of application).
Let ω(r) denote the Hamming weight of r (that is, the number of 1's in the binary representation of r). Both algorithms require ω(r)− 1 multiplications and m−1 squarings (we do not count multiplications by 1, nor 1·1, nor the last squaring in Fig. 1 (a) ) to compute α r . It is well-known that m − 1 is a lower bound for the number of squarings. However, the number of subsequent multiplications can be further reduced by using a recoding algorithm [20, 22] 
, which requires 4 squarings and 1 multiplication.
If we allow the digits of the exponent to be in {1, 0, 1}, then the binary methods to compute α r are easily modified as depicted in Fig. 2 . Note that the sd2 representation of r may require an extra digit, r ′ m , we refer to Section 3 for an explanation. We see that the modified right-to-left (mrl) algorithm ( Fig. 2 (a) ) now works more differently and less efficiently. Indeed, when r ′ i =1, the inverse of S has to be computed (note that the value of S varies at each step). Most commonly used cryptosystems work in the multiplicative group of a finite field or ring. Inversion is then usually achieved via the extended Euclidean algorithm [ [23] for a specialized implementation). This is a rather costly operation; therefore the benefits resulting from the reduced number of multiplications may be annihilated. The modified left-to-right (mlr) algorithm (Fig. 2 (b) ) only needs the fixed value of α −1 , which can be precomputed. Consequently, in the sequel, we will only consider the mlr algorithm. We note, however, that the mrl algorithm may be useful when the inverse is available at no cost, as for elliptic curves or for the additive group of integers.
With the sd2 representation, the (minimal) Hamming weight ω(r) of r (i.e., the number of nonzero digits in the sd2 expansion of r) is equal to (m + 1)/3, on average [24] . The computation of α r can thus be performed with Fig. 1 ), on average. Assuming that a squaring is approximatively as costly as a multiplication, then we can roughly expect a gain of (
The next section presents an algorithm to convert the exponent r from its binary representation into its sd2 representation. This algorithm, due to Reitwiesner, is optimal in the sense that it gives an sd2 output with minimal Hamming weight. Unfortunately, Reitwiesner's algorithm scans the bits of the exponent from right to left, while we have seen that only the modified left-to-right algorithm may bring some advantages. These heterogeneous modes of operation require to first recode the exponent r into its sd2 representation (r ′ m , . . . , r ′ 0 ) sd2 and to temporarily store it for its latter usage in the mlr exponentiation algorithm. Noting that each digit in sd2 representation is encoded with 2 bits, twice the memory space taken by r is needed to store its sd2 representation. These shortcomings are alleviated in Section 4, where optimal left-to-right recoding algorithms are presented.
Reitwiesner's Method
In binary signed-digit notation 1 (i.e., using the digits {1, 0, 1}), a number is not uniquely represented. Two representations (a ℓ , a ℓ−1 , . . . , a 0 ) sd2 and (b ℓ , b ℓ−1 , . . . , b 0 ) sd2 of a same number are equivalent if they have both the same length and the same Hamming weight; this equivalence will be denoted (
A binary signed-digit representation is said to be canonical (or sparse) if no two adjacent digits are nonzero. For that reason, some authors sometimes call it the nonadjacent form (naf) of a number [25] .
The canonical recoding was studied by Reitwiesner [19] . He proved that this representation is unique (if the binary representation is viewed as padded with an initial 0). Following Hwang [7, pp. 150-151 ], Reitwiesner's method to convert a number r = (r m−1 , . . . , r 0 ) 2 with r i ∈ {0, 1} into its canonical form
, 0, 1} is given by the algorithm depicted in Fig. 3 . This is also known as Booth canonical recoding algorithm; however, Booth's method does not present the naf property (see Footnote 2) . Reitwiesner's algorithm is very efficient. It can be done by using the following look-up table ('X' stands for 0 or 1, that is, the output is independent of this value). At first glance, it is not so obvious that this algorithm effectively yields the canonical representation of a number. However, if we closely observe how it works, we see that this algorithm comes down to subtract r from 3r (with the additional rule 0 − 1 =1) and then to discard the last (i.e., least significant) 0 [26] . i+1 denotes the binary expansion of 3r, then the conventional pencil-and-paper method to add nonnegative integers [11, p. 251] gives s i = (c i + r i + r i+1 ) mod 2 = c i + r i + r i+1 − 2⌊(c i + r i + r i+1 )/2⌋ where c i is the carry-in. Moreover, since the carry-out c i+1 is equal to 1 if and only if there are two or three 1's among c i , r i and r i+1 , we can write c i+1 = ⌊(c i +r i +r i+1 )/2⌋. Hence, s i = c i + r i + r i+1 − 2c i+1 and thus r
To see that the output is sparse, it suffices to remark that the algorithm scans the bits from right to left and replaces a consecutive block of several 1's by a block of 0's and1 according to ( 
2 Furthermore, if two blocks of 1's are separated by an isolated 0, the algorithm implicitly uses the fact that ( 
. A more formal (but less intuitive) proof of the sparseness property is given in the next lemma.
The main advantage of Reitwiesner's algorithm is that, in some sense, it is optimal. Indeed, Reitwiesner [19] proved that: Proposition 1. Among the sd2 representations, the canonical representation has minimal Hamming weight.
⊓ ⊔ Although, this does not rule out the existence of other minimal representations. For example, (1, 0, 1, 1) sd2 and (1, 0,1, 0,1) sd2 are both minimal representations for 11.
The general case was later addressed by Clark and Liang [26] . They present a minimal representation for any signed-radix b. In that case, Arno and Wheeler [24] proved that the average proportion of nonzero digits is equal to (b−1)/(b+1). This has to be compared with the average proportion (b − 1)/b of nonzero digits in the standard radix b representation. So, we see that exponent recoding is mostly interesting for binary signed-digit representation (b = 2) because the savings rapidly go down.
Proposed Methods
In Section 2, we pointed out that a left-to-right recoding algorithm might be desirable for fast exponentiation. Designing such an algorithm is not as straightforward as it appears and is even considered as a hard problem by some authors [27] .
Our first algorithm is an adaptation of Reitwiesner's algorithm. As in the right-to-left algorithm, it also presents the naf property. Unfortunately, look-up tables cannot be used. Our second algorithm does not have the naf property but is equally efficient as Reitwiesner's algorithm. Moreover, it enables the use of a look-up table.
A Simple Left-to-right Recoding Algorithm
The interpretation of the naf given in the previous section suggests a simple way to construct a left-to-right recoding algorithm. Let r = (r m−1 , . . . , r 0 ) 2 and 3r = (s m , . . . , s 0 , r 0 ) 2 , then the naf for r is (r 
, . . . ) must be available. On the other hand, we remark that at step i = I, variable j is only decremented when r I+1 = r I . So, if J denotes the value of j before entering in the while loop, we see that this loop is only executed after a block (r J+1 , r J , . . . , r I+1 , r I ) either of the form We therefore introduce an additional variable b to distinguish between the two cases. Because of the alternation of 0 and 1 inside a block, we do not have to know the value of the r j+1 's inside the while loop; only the value of the two first consecutive equal bits (i.e., r J+1 and r J ) is necessary: we use the variable b to keep track of this value, that is, before entering the loop, b contains the value of r J (b = 0 in Case (B1) and b = 1 in Case (B2)). Next, inside the while loop, we alternately subtract 0 or 1, starting with 0 or 1 depending on the value of b. Putting all together, we finally our first algorithm (Fig. 6 ). While this algorithm yields the canonical representation, it is not fully satisfying. It looks quite cumbersome compared to the original Reitwiesner's algorithm (see Fig. 3 ). The next paragraph considers another minimal representation which leads to a very elegant left-to-right recoding algorithm. 
Minimum-weight Left-to-right Recoding Algorithm
The main difficulty in the previous algorithm comes from the fact that it can only "decide" what the output will be after two consecutive equal bits. In other words, when entering in an alternate block of 0, 1 k (or 1, 0 k ), the algorithm must know a priori if this block will end with two consecutive bits equal to 0 or 1. The next lemma enables to give a minimal (albeit not sparse) output whatever the ending bits of an alternate block. The while loop's in the canonical algorithm (Fig. 6) can therefore be removed.
Lemma 2.
In sd2 representation, we have the following equivalences:
, it may also be represented as (1, 0,1 k ) sd2 . Noting that ( 0,1 k ,1) sd2 and (1, 0, 1 k ) sd2 are both representations of −N , the second equivalence follows from the first one.
⊓ ⊔
Elementary Blocks [This paragraph explains how the recoding algorithm was found. The reader uniquely interested by the algorithm itself may skip it.] The exponent r to be recoded may be viewed as a succession of elementary blocks (r J+1 , r J , . . . , r I+1 , r I ) with J > I, r J+1 = r J and r I+1 = r I , and whose form is given by one of the four following possibilities. 
is a valid expression for r ′ i .
Our Second Algorithm From the simple formulations for b i and r ′ i (see Eqs. (1) and (2)), we obtain an elegant and efficient left-to-right exponent recoding algorithm. Similarly as Reitwiesner's algorithm, our second algorithm can be performed still more efficiently thanks to table look-up. The corresponding look-up table is given hereafter (as aforementioned 'X' stands for 0 or 1).
Note that entries (b i , r i , r i−1 , r i−2 ) = (0, 1, 1, X) and (1, 0, 0, X) are missing; see the discussion before Lemma 3 (next paragraph) for an explanation.
Main Theorem From its construction, the proposed algorithm (Fig. 7) produces an sd2 representation with minimal Hamming weight. However, for completeness, we now give a formal proof of this assertion.
We first consider three auxiliary lemmas. Lemma 3 implies that the cases (b i , r i , r i−1 , r i−2 ) = (0, 1, 1, X) and (b i , r i , r i−1 , r i−2 ) = (1, 0, 0, X) never occur (see Table 2 ). Lemma 4 shows that if b i = r i then the output is sparse. Table 2 . Left-to-right sd exponent recoding. 
Finally, Lemma 5 proves that our algorithm effectively yields an sd2 representation.
Proof. Suppose first that r i−1 This concludes the proof.
⊓ ⊔
Hardware Implementation
Evidently, the proposed minimum-weight signed-digit converter (Fig. 7) is much more regular and simpler for hardware implementation. To implement the transformation algorithm, we encode r
and we denote r Table 2 were optimally assigned to (0, 1) and (1,1), respectively.
A hardware realization of these equations is given in Figure 8 . Initially, the shift-left registers {r i , r i−1 , r i−2 } are loaded with {0, r m−1 , r m−2 } and the latch is reset to logic "0". At the end of each iteration, the outputs r is fed into the latch, the lower bit r i−3 is prepared to be fed into the shift-left registers, and the registers shift one bit to the left.
Based on this minimum-weight signed-digit transformer, an hardware architecture for fast exponentiation, α r , is sketched in Figure 9 . In that exponentiation hardware, the outputs of the signed-digit transformer, r 
? to determine which of α or α −1 has to be selected by the multiplexer MUX-1, and signal r 
Other Considerations
For some special cases, the proposed algorithms may bring further advantages.
1. One of the main concerns of the proposed algorithms was to reduce the Hamming weight of exponent r for the computation of α r . While the number of squarings still remains the same (i.e., |r|), the average number of multiplications was reduced from |r|/2 to |r|/3. Over GF(2 k ), squaring operations can be achieved via a simple circular shift if the elements are represented in the so-called normal basis [28] . Their computational costs can therefore be ignored comparing to multiplications. In this case, the real expected gain over the binary methods is then ( ⌊(c i +r i +r i+1 )/2⌋, the parallel recoding method proposed by Koç [29] readily extends to our minimum-weight left-to-right recoding algorithm.
Conclusions
To improve the performance of the square-and-multiply exponentiation for the evaluation of α r , the Hamming weight of exponent r should be reduced. This can be achieved by adopting a signed-digit representation for exponent r. Assuming that α −1 is provided along with α (or can be cheaply evaluated, as for elliptic curves), a minimum-weight signed-digit representation for r can improve quite a large the computation of α r . However, in order to produce the minimum-weight signed-digit representation, existing solutions must initiate the recoding process from right to left while only the modified left-to-right (mlr) exponentiation algorithm is suitable. The heterogeneous processings bring some time and memory inefficiencies for hardware realization. Indeed, the recoding has first to be performed, the resulting representation for r must be saved (which requires twice more memory space than its binary representation). Then and only then, the exponentiation process may start in order to obtain α r . In this paper, new and homogeneous approaches for minimum-weight signeddigit representation are presented. Our methods are homogeneous in the sense that both the proposed recoding algorithms and the mlr exponentation algorithm initiate from the most significant position of the exponent. The signed-digit representation of exponent r is consequently obtained in real-time and does not need to be stored. This better memory usage is especially useful for small devices like smart-cards. Our first algorithm gives the canonical representation for the exponent. This algorithm is then modified into another minimum-weight recoding algorithm so that table look-up is possible. A hardware implementation of this second converter and the corresponding architecture for exponentiation are also presented.
