We 
Introduction
Most public key cryptosystems require resourceintensive arithmetic calculations in certain mathematical structures such as finite fields, groups and rings. The efficient realizations of the these operations including modular multiplication, inversion, and exponentiation are at the center of research activities in cryptographic engineering. In this study, new techniques of performing modular multiplication and exponentiation based on a new reduction operation are proposed (Section 2). These methods work completely in the frequency domain (spectrum) with some exceptions such as the initial, final and some partial transformations calculations.
Spectral techniques for integer multiplication have been known for over a quarter of a century. Using the spectral integer multiplication of Schönhage and Strassen [9] , large to extremely large sizes of numbers can efficiently be multiplied. Such computations are needed when computing π to millions of digits of precision, factoring and also big prime search projects.
A naive way of utilizing the spectral techniques for modular multiplication is first computing the multiplication using possibly Schönhage-Strassen and then performing the reduction in the time domain. This approach is preferable if the input length is large enough to meet the asymptotic crossover of Schönhage-Strassen, assuming the reduction has a constant cost. Additionally, if the naive method is used for operations involving consecutive multiplications, because of the costly forward and backward transformation computations, the asymptotic crossover of these operations would be similar to what a single modular multiplication has. Unfortunately these crossovers are larger than the key sizes of the most public-key cryptosystems; thus, in practice the naive way is hardly used.
We propose a modular multiplication that can be performed on the Fourier representations of integers. In such a representation, multiplications are readily available by the convolution property, and thus, operations involving several modular multiplications can be efficiently computed.
After giving some preliminary notation and a formal definition of Discrete Fourier Transform (DFT) over Z q (i.e., the ring of integers with multiplication and addition modulo a positive integer q), we describe our main idea of spectral modular arithmetic including spectral modular multiplication (SMM) and spectral modular exponentiation (SME) in the next section.
In Section 3, we describe methodologies for selecting the parameters for SME in order to apply the algorithm to public key cryptography. In Section 4, we turn our attention to the architectural issues and performance analysis: we show that spectral algorithms yield efficient and highly parallel architectures for hardware implementations of public-key cryptosystems. We conclude our work with some final comments in the final section.
Spectral Modular Arithmetic

Discrete Fourier Transform
Spectral techniques are widely accepted and used in the field of digital signal processing, hence most existing notation and concepts come from this theory. For many reasons, the signals and admissible operations on these signals of such a theory are quite different than of a theory of computer arithmetic. For instance, when we multiply integers using FFT (or convolutions), first we break the integers into words and then process on these partitions. Note that any small perturbation in these components would totally change represented integer. On the other hand, approximations on the signal components without changing the main characteristics of the original signal are allowable in digital signal processing.
Therefore, we believe we need to develop a more suitable notation that would permit us to have a better understanding of employing the spectral techniques for the computer arithmetic related problems. While doing this, we follow a polynomial representation instead of the standard sequence representation of digital signal processing. Such a presentation is necessary for our purposes and moreover, it states the different nature of number representing signals from a classical signal processing analysis. We start with a formal definition of DFT. 
Definition 1 Let
with the inverse
. We say x(t) and X(t) are transform pairs, x(t) is called a time polynomial and sometimes X(t) is named as the spectrum of x(t).
In the literature, DFT over a finite ring spectrum (1) is also known as the Number Theoretical Transform (NTT). Moreover, if q has some special form such as a Mersenne or a Fermat number, the transform named after this form; i.e., Mersenne Number Transform (MNT) or Fermat Number Transform (FNT) .
Note that, unlike the DFT over the complex numbers, the existence of DFT over finite rings is not trivial. In fact, Pollard [7] Proof. See Chapter 6 of Blahut [2] or Chapter 8 of Naussbaumer [6] .
Spectral Modular Reduction (SMR)
In order to compute the modular reduction, a Montgomery type [5] method can be employed. Instead of attacking to compute "x mod n" directly, it is possible to derive the reduction after performing a related computation
where gcd(n, r) = 1. At first glance, this seems computationally pointless because of inversion involved but the selection of r changes this first impression drastically. In Algorithm 1, we present such a reduction methodology after presenting a related notation. We gently refer the reader to [8] for a proof. 
Notation 1 Let
Y (t) = DF T (y(t)) where y ≡ x2 −db mod n and y = y(b), 1: Y (t) := X(t) 2: α := 0 3: for i = 0 to d − 1 4: y 0 := d −1 · (Y 0 + Y 1 + . . . + Y d−1 ) mod q 5: β := −(y 0 + α) mod b 6: α := (y 0 + α + β) div b 7: Y (t) := Y (t) + β · N (t) mod q 8: Y (t) := Y (t) − (y 0 + β)(t) mod q 9: Y (t) := Y (t) Γ(t)
Spectral Modular Multiplication
Convolution and the SMR algorithm can easily be combined to harvest a spectral modular multiplication algorithm in a finite ring spectrum. In order to have a clear presentation we divide our presentation into 3 sub-procedures as seen in Figure 1 . Note that the initial and final stages consist of some data arrangements where the Spectral Modular Product (SMP) procedure consists of the actual multiplication and reduction steps (i.e., convolution and spectral reduction). Later, while presenting the spectral exponentiation algorithm, SMP is going to be the basic building block again. SMP procedure and SMM are given as follows: 
Algorithm 2 Spectral Modular Product Suppose that there exist a d-point DFT map for some principal root of unity ω in Z q , and X(t), Y (t) and N (t) are transform pairs of x(t), y(t) and n(t) respectively. We assume that x(b) = x, y(b) = y and n(b)
is a multiple of modulus n with n 0 = 1 for some integers x, y < n and b > 0.
Input: X(t), Y (t) and N (t); spectral polynomials
Z(t) := Z(t) Γ(t) mod q 10: end for
11: Z(t) := Z(t) + A(t) 12: return Z(t)
Algorithm 3 (Spectral Modular Multiplication) Suppose that there exist a d-point DFT map and n(b) is a multiple of modulus n where n
Input: A modulus n > 0 and x, y < n Output: z ≡ xy mod n.
Since we operate in a finite ring spectrum, one has to deal with overflows that might occur during the computations. In fact, Algorithm 3 gives a correct result if the intermediate values stay bounded. The following theorem states the condition when overflows do not occur. The reader is referred to [8] for a proof.
Theorem 1 Algorithm 3 computes z ≡ xy mod n, if the parameters b, q and s satisfies 2sb
2 < q.
Spectral Modular Exponentiation
In general, a single classical modular multiplication is faster than a single SMM; however, spectral methods are very effective when several modular multiplications with respect to the same modulus are needed. An example is the case when one needs to compute a modular exponentiation, i.e., the computation of m e mod n, where m, e and n are positive integers. Such a setup needs a consecutive use of SMM; also means a consecutive use of DFT and IDFT operations (obviously redundant computations as seen Figure 2) . Therefore, if the data is kept in the Fourier domain at all times, the backward and forward transforms are bypassed. Consequently, this approach decreases the asymptotic crossovers of the spectral methods to cryptographic sizes while computing the modular exponentiation. There are many methods for carrying general exponentiation. Mostly efficiency comes from two resources; one is to decrease the time to multiply; the other is to reduce the number of multiplications. In practice one does both. Notice that, until now our objective was reducing the modular multiplication which is categorized in the first category. For the rest of this study we keep this goal and simply consider using the binary method (see [4] ) for the rest of our presentation.
The binary method scans the bits of the exponent either from left to right or from right to left. A squaring is performed at each step, and depending on the scanned bit value, a subsequent multiplication is performed. We describe the spectral modular exponentiation algorithm by using a left-to-right binary method below.
Algorithm 4 (Spectral Modular Exponentiation) Suppose that there exist a d-point DFT map and n(b) is a multiple of modulus n where n
0 = 1, deg(n(t)) = s − 1, s = d/2 , gcd(b, n) = 1 and b > 0. Input: A modulus n > 0 and m, e < n Output: c ≡ m e mod n 1: Compute λ(t) such that λ = b 2d mod n. 2: N (t) := DFT(n(t)) 3: Λ(t) := DFT(λ(t)) 4: M (t) := DFT(m(t)) 5: M (t) := SMP(M (t), Λ(t), N(t)) 6: C(t) := SMP(1(t), Λ(t), N(t)) 7: for i = j − 2 downto 0 8: C(t) := SMP(C(t), C(t), N(t)) 9: if e i = 1 then C(t) := SMP(C(t), M (t), N(t)) 10: C(t) := SMP(C(t), 1(t))
11: c(t) := IDFT(C(t)) 12: return c(b)
Once again, we need to guarantee that overflows do not occur, in other words the coefficients of intermediate or final results should not be winding over q. 
Theorem 2 Algorithm 4 computes an almost modular reduction, c ≡ m e mod n if the parameters b, q and s satisfies the following inequality
(b 2 + b) 2 B(s) + b 2 s < q(2)
Applications and Further Improvements
Modular exponentiation is one of the most important arithmetic operation in modern cryptography. For example, the RSA algorithm requires exponentiation in Z n for some positive integer n, whereas Diffie-Hellman key agreement and the ElGamal scheme use exponentiation in some large prime fields (see [3] ). In this section, we describe the methodologies of selecting the parameters for SME in order to use the method in public-key cryptography. In particular the Inequality (2) presents a solid basis for the relation between the parameters b, q and s. This bound can be improved in many ways which we also investigate throughout this section.
Parameter Selection for RSA
In this section we tabulate some example parameters for modular exponentiations using the SME method. Once the underlying ring, the DFT length and the principal root of unity are selected, the maximum modulus size used in the SME method is computed by finding the base b = 2 u . The relation between these parameters is computed after determining the maximum b satisfying the Inequality (2). Table 1 . SMP Parameter selection for SME.
In Table 1 , some sample rings with DFT parameters are given. We give an example to show how we get these figures. We first select a ring, for instance, lets take q = 2 79 −1. This comes with the principal root of unity ω = −2, the length d = 158 and s = d/2 = 79. Plugging these values into the Inequality (2) and then by inspection b = 2 15 ⇒ u = 15 is found. Therefore, we may perform an exponentiation of maximum operand size equals to k = s · u = 79 · 15 = 1185 using SME with the specified parameters.
Modified Spectral Modular Product
If the SMP (i.e., Algorithm 2) is considered, the boundary analysis depends heavily on β · N (t) multiplication of Step 7. It is possible to replace this multiplication by a multi-operand addition at a cost of some pre-computation and extra memory. This replacement gives a reasonable amount of radius shrinkage. Additionally, replacing the multiplication by an array adder improves the computational complexity.
Let b = 2 u and n i (t) be the polynomial representation of an integer multiple of n such that the zeroth coefficient of n i (t) satisfies (n i ) 0 = 2 i−1 for i = 1, 2, . . . , u (note that n(t) = n 1 (t)). We can now write β · N (t) as
where β i is a binary digit of β and
Plugging the Equation (3) into the Algorithm 2 gives us the modified Spectral modular product algorithm; N 2 (t) , . . . , N u (t)} be the set of special polynomials as described above;
Algorithm 5 Modified Spectral Modular Product (MSMP) Suppose that there exist a d-point DFT map for some principal root of unity ω in Z q , and X(t) and Y (t) are transform pairs of x(t) and y(t) respectively where x(b) = x and
y(b) = y. Let N = {N 1 (t),
Input: X(t), Y (t) and a basis set N Output: Z(t) = DF T (z(t)) where z ≡ xy2
−db mod n and z(b) = z,
1: Z(t) := X(t) Y (t)
2: α := 0 3:
Z(t) := Z(t) Γ(t) mod q 10: end for 11: Z(t) := Z(t) + A(t) 12: return Z(t)
Observe that the basis set N needs to be pre-computed and stored. This requirement can be seen as a handicap for certain platforms; however, in general the MSMP delivers smaller realizations. While inserting MSMP into either SME or SMM the pre-computation is done at the beginning of Algorithm 3 or 4. After computing n 1 (t) the rest of the basis set is computed by multiplying n 1 by powers of 2. We then apply the DFT function to corresponding polynomials in order to get N i (t) = DFT[n i (t)] for i = 1, 2, . . . , u.
With the adjustment (3), the time coefficients of u i=1 β i · N i (t) become less than b log(b) which gives us a better inequality
This new bound is a good improvement for a reasonable space cost and gives us the improved parameters which are tabulated in Table 2 Table 2 . MSMP parameter selection for SME.
Architectures and Performance Analysis
In this section, we give architectures and performance analysis of spectral modular algorithms. Spectral methods brings parallelism in computations by dividing the larger problems into smaller pieces. As a result, it is possible to employ some low level parallelism on the resulting partitions. In general, large size problems suffer from this kind of parallelism. For instance, a parallel realization of a multiplier has quadratic area complexity and infeasible to realize for large input sizes.
Spectral modular algorithms are dominated by multioperand additions and multiplications over some special rings. We briefly review the architectures described in [11] and [12] , in particular multiplication and multi-operand addition for Fermat and Mersenne rings. In Section 4.2, beside a generic analysis, we append a combined analysis of high and low level parallelism.
There are other efficient ways of carrying low level operations, for instance, check [10] and [1] for table-lookup methods. Further, in [1] , discussions on the use of Booth's recoding can be found.
Fermat/Mersenne Ring Arithmetic
An integer ring having q = 2 v ± 1 elements is the most suitable one for the SME computation since the modular arithmetic operations for such q are simplified. Moreover, if the principal root of unity is chosen as a power of 2, spectral coefficients are computed only using additions and circular shifts. The rings with q = 2 v − 1 are called the Mersenne rings, while the rings having q = 2 v + 1 are called the Fermat rings.
Observe that carrying arithmetic in Mersenne rings is equivalent to doing one's complement operations. Although the arithmetic in Fermat rings is more complicated than one's complement arithmetic, with certain encoding techniques this difference can be minimized [8] .
Obviously addition and multiplication are the basic operations of our interest. Recall that Partial Product Generator (PPG), partial product accumulator (PPA), and carry propa-gate adder (CPA) are the main stages of a multiplier. For our discussions PPA corresponds to a Wallace tree network having a height function, θ(x) for operands number between 14 and 95 as follows; In Table 3 , we tabulate the delay and area functions of basic operations for Fermat and Mersenne rings. Remark that we tend to keep the data in a carry-save form to avoid the horizontal carry propagation delay. If a normal form is needed the carry-save form is converted to a regular representation by using a CPA.
Software and Hardware Architectures for Spectral Modular Arithmetic
In this section we give a unit-gate analysis of our algorithms with the specified low level arithmetic architectures. In addition to this analysis we also give a generic evaluation in order to have a platform independent treatment. Specifically, we describe architectures for SME which could immediately be reduced to outline designs for the SMP or MSMP core since initialization and finalization stages are cost negligible. Recall that SMP and MSMP differ only in calculation of β · N (t). The first one performs a direct multiplication where the second one achieve this by a successive addition of some pre-computed values. At first glance the second approach seem as an acceleration attempt of modular multiplication by using look-up tables. In reality the second approach is more about increasing the size of b which immediately increases the bit size of the realizable modular arithmetic for a fixed transform size d. For simplicity, a complete analysis for MSMP is given in this text. A similar analysis for SMP can be found in [8] .
In Figure 3 we exhibit a higher level architecture for MSMP. Before zooming into the boxes and discussing the possible design practices, we like to say a few words about the data flow. In this architecture, the outputs of the convolution (i.e., Step 1) feed the Z MUXes. For the initial case, each Z MUX chooses the input from Step 1, and then the reduction loop starts. The loop runs d times: at every run, the outputs of the processing units are passed to the interpolation and also fed back to the unit itself. Notice that all processing units work in parallel. The processing engine waits until the parameters z 0 + β and β are generated from the parameter generation logic. After d runs of the loop, the processing units from 0 to d − 1 returns the coefficient of the resultant spectral polynomial Z(t).
Going back to MSMP algorithm; in Step 1, the convolution property is employed. Given the spectral polynomials X(t) and Y (t), we compute Z(t) such that
Note that these multiplications can be realized by employing v × v modulo multipliers. In case of a Mersenne arithmetic, these multipliers without a CPA has 4θ(v) + 1 delay with a 8v 2 − 14v area need. When same calculation is considered for Fermat rings we need some additional hardware and time for some corrections which is given by 9v 2 + 3v area and 4θ(v + 1) + 4 delay. Now it is time to consider the reduction steps; for simplicity we divide the loop with i into two parts;
• Parameter Generation: This corresponds to Steps 4, 5 and 6. Here, we compute the parameters z 0 , α, and β, and feed them to the main processing units.
• Processing Engine: This corresponds to Steps 7, 8, and 9 , in which we add a multiple of the modulus to the partial sum and then divide it by the base.
In Parameter Generation,
Step 4 corresponds to a partial interpolation in which a d-input multi-operand addition followed by a multiplication by d −1 . This computes the zeroth coefficient of the time polynomial.
Observe that d −1 is a constant v-bit number so we need another multi-operand addition. With some recoding techniques, it is possible to deliver this multiplication by a v/2 operand addition. Therefore, Step 2 can be realized by a d+v/2-operand addition in total having the complexity tabulated in Table 4 .
In a special Fermat ring with some other desirable conditions, it is possible to replace d −1 multiplication by a circular shift. Once d −1 multiplication changed by a circular shift, the complexity shrinks for both time and area. In Table 4 , We summarize what have been said about Step 4. Table 4 . The cost of Step 4.
Steps 5 and 6 are called the Parameter Generation Logic (PGL) in which we compute the parameters z 0 + β and β (see Figure 4 .a). Note that, the adders in Figure 4 .a are the usual v-bit adders and they do not need modular reductions since α,
If the Steps 5 through 8 of SMP are examined carefully, the second adder can be discarded by making the following modifications: Figure 4 . Parameter Generation Logic.
Therefore, pre-computing and storing N (t) − 1(t) =: DF T (n(t) − 1) instead of N (t) eliminate the second adder as seen in Figure 4 .b. A similar modification is possible for MSMP by pre-computing and storing N 2 (t) , . . . , N u (t)}. With these adjustments the cost of PGL becomes equivalent to the delay of the first adder which is a single two operand addition.
For a specific analysis; first of all the output of the partial interpolation is in carry save form. With the first adder the carry (initially 0) is added to the output of the partial interpolation. Since β is the multiplicand for the multiplication, in Step 7, we need β in a normal form. Therefore, here we have to engage a u-bit CPA on the critical path. On the contrary, the addition of most significant v − u bits is the carry to the next run which is not in the critical path.
Once the second adder has been discarded the delay of PGL becomes equivalent to a u-bit CPA delay which is 2 log u + 5. This is same in both Fermat and Mersenne arithmetic since u is small enough that addition does not produce any round around carry, hence this adder is realized as a regular integer adder having 2 log u + 5 delay for
The Processing Engine is the most resource-consuming stage and it corresponds to Steps 7, 8, and 9. In Figure  5 the architecture of a single processing unit for SMP is seen if multiplier is changed with a multi-operand addition, a schematic for MSMP can be achieved. The processing engine consists of d such units.
Note that both adders in Figure 5 are modulo q adders. The shift operation at the bottom of the figure corresponds to Step 9 of the SMP core. As we pick ω as a power of 2, the Notice that, Step 8 can be buried into the Wallace tree of the multiplication of Step 7. Therefore, we simply need a u + 1 operand Wallace tree network which needs 7vu area for 4θ(u + 1) + 4 delay in Mersenne rings and 9uv + 20v area for 4θ(u + 2) + 7 delay in Fermat rings.
After adding everything up described in previous paragraphs, in Table 5 we give the entire complexity of MSMP. A similar analysis for SMP can be found in [8] . u log u + 7u +θ(u + 2)) Table 5 . MSMP performance analysis.
Conclusions
We proposed new techniques for performing modular multiplication and exponentiation. Our initial motivation was to obtain modular arithmetic algorithms working completely in the spectrum in order to fully utilize the convolution property. For carrying out modular arithmetic operations, one has to deal with the computations of modular reduction. After defining spectral reduction and related concepts, we introduced a spectral reduction algorithm using the linearity and shifting property of DFT, and therefore, spectral modular multiplication and spectral modular exponentiation algorithms are obtained quite naturally. To avoid the round-off errors of the complex transforms, we employ finite ring spectrum in our method.
We carefully analyzed and stated the conditions for which our algorithms work without overflows. Working in the spectrum, we can exploit a vast amount of parallelism in our computations. Therefore, our algorithms yield highly parallel hardware architectures which we described.
We are currently working on creating VHDL implementations of the spectral modular exponentiation; such implementation will yield a more detailed analysis of our method and its advantages and disadvantages over algorithms and architectures which do not employ spectral techniques.
