Pipeline Architectures for Radix-2 New Mersenne Number Transform by Nibouche O et al.
1668 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 8, AUGUST 2009
Pipeline Architectures for Radix-2
New Mersenne Number Transform
Omar Nibouche, Member, IEEE, Said Boussakta, Senior Member, IEEE, and Michael Darnell, Senior Member, IEEE
Abstract—Number theoretic transforms which operate in rings
or fields of integers and use modular arithmetic operations can
perform operations of convolution and correlation very efficiently
and without round-off errors; thus, they are well matched to the
implementation of digital filters. One such transform is the new
Mersenne number transform, which relaxes the rigid relation-
ship between the length of the transform and the wordlength in
Fermat and Mersenne number transforms where the kernel is
usually equal to a power of two. In this paper, three novel pipeline
architectures that implement this transform are presented. The
proposed architectures are scalable, parameterized, and can be
easily pipelined; they are thus ideally suited to very high speed
integrated circuit hardware-description-language (VHDL) de-
scriptions. These architectures process data sequentially and have
either one or two inputs and two or four outputs. The different
input and output formats have resulted in the proposed archi-
tectures having different performances in terms of processing
time and area requirements. Furthermore, they give the designer
more choices in meeting the requirements of the application
being implemented. A field-programmable gate array (FPGA)
implementation of the proposed architectures has demonstrated
that a throughput rate of up to 6.09 Gbit/s can be achieved for a
1024-sample transform, with samples coded to 31 bits.
Index Terms—New Mersenne number transform (NMNT),
number theoretic transform (NTT), pipeline architectures.
I. INTRODUCTION
T HE USE OF number theoretic transforms (NTTs) forthe fast calculation of convolutions and correlations has
gained a large acceptance within the signal processing commu-
nity since the early work by Rader et al. was published [1]–[3].
The main advantage of these transforms is that they produce
exact results. They are defined over a field or ring of integers
Manuscript received November 30, 2006; revised April 01, 2008.First pub-
lished October 31, 2008; current version published August 14, 2009. This work
was supported by the Engineering and Physical Sciences Research Council
under Grant GR/598160/01. This paper was recommended by Associate Editor
M. Stan.
O. Nibouche was with the School of Electrical, Electronic and Computer En-
gineering, Newcastle University, Newcastle upon Tyne NE1 7RU, U.K. He is
now with The Institute of Electronics, Communications and Information Tech-
nology, Queen’s University of Belfast, BT3 9DT Belfast, U.K. (e-mail: o.ni-
bouche@qub.ac.uk).
S. Boussakta is with the School of Electrical, Electronic and Computer Engi-
neering, Newcastle University, Newcastle upon Tyne NE1 7RU, U.K. (e-mail:
s.boussakta@ncl.ac.uk).
M. Darnell is with School of Electrical, Electronic and Computer Engi-
neering, Newcastle University, Newcastle upon Tyne NE1 7RU, U.K. (e-mail:
mike.darnell@ncl.ac.uk).
Digital Object Identifier 10.1109/TCSI.2008.2008266
with the arithmetic operations being carried out modulo a prime
or a composite number, respectively. Only a residue reduction
operation is required if an intermediate result exceeds the
modulus [1]–[4]. Moreover, the range of applications of NTTs
has been extended beyond digital convolutions and correla-
tions, as shown in [1]–[7], to include solving Toeplitz systems
of equations [8], [9], matrix and large integer multiplications
[10]–[12], coding and decoding [7], motion estimation, filtering
and compression [13]–[16], speech processing [17]–[19], and
the design of secure hashing functions [20].
Among all NTTs, two transforms have received particular
attention: namely, the Fermat number transform (FNT) and the
Mersenne number transform (MNT) [1]–[5], [21]–[23]. Both
Fermat numbers (in the form , where is an integer) and
Mersenne numbers (in the form , where is a prime)
have only two 1s in their binary representation, therefore
offering an advantage in terms of speed for the calculation of
modular arithmetic operations. Furthermore, they circumvent
the rounding-off noise issue associated with transforms such as
the fast Fourier transform (FFT) or the fast Hartley transform
[1]–[5]. Both FNT and MNT kernels can be chosen to be a
power of two; thus, multiplications by the kernel are equivalent
to shift operations. However, such a choice of kernels has led to
these two transforms having an intrinsically rigid relationship
between the transform length (i.e., the number of samples) and
the wordlength (i.e., the number of bits in the modulus), which
limits their application [21]–[26]. An approach to solving
the sequence length constraint for the MNT was proposed in
[24]–[26]. At the cost of extra integer multiplications, the new
transform, termed the new MNT (NMNT), has a sequence
length equal to a power of two. It can yield a fast algorithm
with a convenient wordlength and arithmetic [24]–[26]. Fur-
thermore, it possesses the cyclic convolution property (CCP),
and the forward and inverse transforms are similar [24]–[26].
Efficient implementations of NTTs have to take account of
the fact that these transforms are highly parallel. Thus, a move
from a sequential implementation, such as that found in micro-
processors, toward a hardware implementation that involves
multiple processing units can achieve a major gain in terms of
processing time. In addition, in the context of the hardware im-
plementation of transforms, the class of pipeline architectures
presents a tradeoff between the low-throughput-rate memory
architectures and the area-consuming parallel architectures
[27]–[35]. Pipeline architectures use a 1-D array of processing
units whose length depends on the size of the transform. Each
processing unit is built using arithmetic components, registers,
and switches. They are connected to each other, and data flow
continuously from each unit to its adjacent cell. In this way, a
1549-8328/$26.00 © 2009 IEEE
Authorized licensed use limited to: Newcastle University. Downloaded on March 08,2010 at 08:40:33 EST from IEEE Xplore.  Restrictions apply. 
NIBOUCHE et al.: PIPELINE ARCHITECTURES FOR RADIX-2 NEW MERSENNE NUMBER TRANSFORM 1669
pipeline architecture normally uses less hardware area than its
parallel counterpart and is faster than a memory architecture
[27]–[29], [31], [33].
In this paper, three novel and efficient pipeline architectures
that implement the NMNT are presented. The proposed ar-
chitectures are parameterizable and can be easily pipelined to
increase their working frequency. Thus, they can be described
using a Hardware Description Language (HDL) such as very
high speed integrated circuit hardware description language
(VHDL).
The remainder of this paper is organized as follows. Section II
revisits the NMNT and presents its butterfly structure and flow
graph. Section III briefly examines issues and aspects of various
DSP architectures and introduces the basic bidirectional shift
registers (BSRs) used to tackle retrograde indexing which char-
acterizes the NMNT flow graph. It also introduces three new
basic butterfly structures; these butterflies are used to build the
three new transform architectures, referred to as Architecture-I,
Architecture-II, and Architecture-III. In addition to describing
the internal structure of the proposed architectures, Section III
also provides an insight into the control signals used to manage
the flow of data in the architectures, which is also illustrated by
examples. The proposed architectures have different formats of
inputs and outputs and have achieved different performances in
terms of speed and area, thus leaving it to the designer to select
what best suits the design goals in terms of speed and area usage.
A comparison between the proposed architectures is presented
in Section IV. Conclusions are then drawn in Section V.
II. NMNT: A BRIEF DEFINITION
As shown in [25], the NMNT for an integer sequence
of length is defined as
(1)
and the inverse NMNT is defined by
(2)
where
(3)
(4)
(5)
Here, is a Mersenne prime equal to , is a prime,
and denotes modulo . and stand for the
real and imaginary parts of the enclosed term, respectively. The
maximum transform size is . The values of and
in the aforementioned equations are given by
(6)
(7)
where .
A. CCP
The NMNT has a CCP as shown in
(8)
where is point-by-point multiplication and and
stand for the even and odd parts of , respectively,
and are given by
(9)
The factor is due to the fact that . If
necessary, the operands and may be scaled or a larger
Mersenne prime may be selected, so that the convolution
result does not exceed . One suggested upper bound
is given in [2] and [3] as
(10)
It is worth mentioning that (10) has no direct dependence on
the transform length ; thus, the signal-to-noise ratio (SNR) of
an NMNT filter does not depend on the transform length as is
the case for the FFT which, owing to scaling and round-off error
accumulation, means that doubling the transform length reduces
the SNR by about 3 dB [36].
B. Fast Calculation of the 1-D NMNT
Fast and efficient algorithms may be developed for this trans-
form [24]–[26], [37]. They include radix-2, radix-4, and split-
radix algorithms, where the decimation can be carried out either
in time or in frequency. For radix-2 decimation in frequency
(DIF), (1) can be rewritten as
(11)
By considering even- and odd-indexed samples of sep-
arately, one may write
(12)
(13)
The aforementioned two equations can be recognized as two
separate NMNTs, each of length samples. Each of these
Authorized licensed use limited to: Newcastle University. Downloaded on March 08,2010 at 08:40:33 EST from IEEE Xplore.  Restrictions apply. 
1670 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 8, AUGUST 2009
Fig. 1. NMNT radix-2 DIF butterfly.
can be further decomposed into another two -sample trans-
forms which, in turn, can be decomposed further until two-point
NMNTs are obtained. For an in-place computation, four ele-
ments are included in each radix-2 butterfly, as shown in Fig. 1.
Each butterfly requires four Mersenne modular multiplication
operations and six Mersenne modular addition operations. The
whole transform needs stages. An example of the calcu-
lation of the NMNT for a sequence of 32 samples is shown in
Fig. 2.
From Figs. 1 and 2, it is clear that the algorithm processes data
in groups of four points, namely, , , ,
and . The index is in the range , with
the first four points, obtained for , being , ,
, and . The transform butterfly includes retro-
grade indexing, i.e., when the samples and
have their indexes incremented, the indexes of and
are decremented [30]. Alternatively, the butterfly in
Fig. 1 can be divided into two different stages (or subbutterflies).
The first part deals with addition/subtraction operations and in-
volves two points and , where is in the range
. The second part carries out the multiplication
operations; it involves the pairs of data points and
, where the index is in the range .
The first pair, obtained for , is and , as
shown in Fig. 2. The impact of using these two approaches, i.e.,
treating the butterfly in Fig. 1 as a single entity or dividing it into
two subbutterflies, on the proposed architectures is discussed in
Section III. The third and fourth rounds (in the general case,
the last two rounds) of the transform do not involve multiplica-
tion operations by the factors, as shown in Fig. 2. These two
rounds can be fused, leading to a four-point butterfly, or they
can be separated, as in Fig. 2, thus using two-point butterflies.
III. ARCHITECTURES
There are several issues which have to be considered when
designing pipeline architectures for the NMNT. The proposed
architectures have to be parameterizable and scalable. Scala-
bility implies that the architecture uses a certain number of basic
building blocks. The internal structure of such basic components
should remain intact when the length of the transform varies;
here, only the length of the registers will be affected. By “pa-
rameterizable” is meant that the circuits can be fully described in
terms of parameters such as the transform length and the chosen
Mersenne number .
Another important feature to be sought in the proposed
architectures is the “pipelinability,” by which is meant the
ability to insert registers into the proposed architectures in
order to achieve high working frequencies, without altering
their behavior; in particular, feedback loops have to be avoided
in the design. In addition, the efficiency of the architecture is
another important factor that should be taken into account for
the proposed implementations. By efficiency of the architecture
is meant the percentage of time for which the different ele-
ments, such as the registers, multipliers, and adders, are used.
For instance, pipeline 1-D radix-2 FNT and FFT architectures
are 50% efficient [27], [28], [34], thus remaining idle for half
of the time. This efficiency results from storing the first half of
data fed sequentially to a single serial input before starting to
compute the transforms using two-input butterfly structures and
one sample from each of the two halves of data. The efficiency
is lower in the case of radix-4 FFT and 2-D vector radix-2 FNT
architectures [27], [28], [35]. These architectures use four-input
butterfly structures and store three quarters of data fed to a serial
input before computing the transforms; thus, the efficiency is
now equal to 25%. A more efficient FFT architecture has been
proposed in [28]. It achieves 100% efficiency in terms of the
use of the registers. However, such an architecture cannot be
pipelined beyond the adder level due to the presence of feed-
back loops [28]. Nevertheless, the efficiency of these pipeline
architectures can be increased by interleaving more than one set
of data; in this case, the core of the architecture remains intact
while extra registers are added to store the different sets of data
as shown in [27]. In addition to the aforementioned issues, one
must also overcome the challenge of retrograde indexing in
the case of the NMNT. For this, two BSRs are introduced in
Section III-A.
A. BSRs and Arithmetic Components
BSRs permit a change in the shift direction. Thus, the output
from a BSR may be in a reversed order, specified by the shift
direction control, or in the original order as at the input of the
register. In this paper, two types of BSRs are used. For the
first type, referred to as BSR-type I, the reordered sequence
is “ ,” where the original sequence is “
” and is the length of the register. The
second BSR, termed BSR-type II, has a reordered output in the
format “ ” given that the shift di-
rection control changes its value after cycles. The internal ar-
chitectures of BSR-types I and II are shown in Fig. 3(a) and (b),
respectively. As shown in these figures, a BSR-type I of length
can be seen as a BSR-type II of length , followed by a
flip-flop (FF). Both registers are built using FFs, multiplexers,
and demultiplexers. The multiplexers allow data to move either
forward or backward from one FF to its two neighboring FFs. In
addition, to achieve the bidirectional shift behavior of the BSRs,
one must be able to broadcast data to both ends of the registers.
Thus, the demultiplexer in Fig. 3(b) ( Fig. 3(a), respectively) is
Authorized licensed use limited to: Newcastle University. Downloaded on March 08,2010 at 08:40:33 EST from IEEE Xplore.  Restrictions apply. 
NIBOUCHE et al.: PIPELINE ARCHITECTURES FOR RADIX-2 NEW MERSENNE NUMBER TRANSFORM 1671
Fig. 2. Thirty-two-point NMNT flow graph.
necessary to feed the input either to the first FF in the BSR or to
the last FF (the one before the last, respectively). In the context
of this work, the direction control signal is stable for periods of
time that are multiples of the register length. In the remainder
of this paper, a shift register (a unidirectional shift register that
only uses FFs) will be referred to as SR.
Authorized licensed use limited to: Newcastle University. Downloaded on March 08,2010 at 08:40:33 EST from IEEE Xplore.  Restrictions apply. 
1672 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 8, AUGUST 2009
Fig. 3. Bidirectional shift register. (a) BSR-type I (b). BSR-type II.
The arithmetic components include Mersenne modular
adders, subtracters, and multipliers. Addition modulo a
Mersenne number is equivalent to adding the carry-out bit
to the result of adding two words. The multiplication opera-
tion modulo a Mersenne number consists of adding the most
significant word to the least significant word of the result of
multiplying two words. Depending on whether the multipli-
cation and the addition/subtraction parts are separated or not,
these components are combined together in different ways to
implement the arithmetic operations of the transform butterfly
of Fig. 1. If the addition part is separated from the multipli-
cation part, then components BF Type I (where BF refers to a
circuit that implements a butterfly), for the addition/subtraction
part in Fig. 1, and BF type II, for the multiplication part in
Fig. 1, shown in Fig. 4, are used. Otherwise, the component BF
type III is used. In this case, when no multiplication operations
are involved, such as in the last two rounds of the transform
shown in Fig. 2, then BF type IV can be used (the notation BF
here should not be confused with the butterfly architecture).
In addition to the registers and arithmetic components, the
proposed architectures use switches which are either 1-to-2
switches (demultiplexers), 2-to-1 switches (multiplexers), or
2-to-2 switches.
B. Architecture-I
In Architecture-I, the operations of addition/subtraction
are separated from the multiplication part. As a whole,
this architecture has a single input and two outputs. How-
ever, its generic butterfly has two inputs and two out-
puts, as shown in Fig. 5. To the inputs in0 and in1 are
fed the sequences “ ” and “
,” respectively.
The result of the subtraction operation of BF type I in the ar-
chitecture of Fig. 5 has to be reordered before the multiplication
operations can take place. This is carried out by the switch sw0
and the BSR-type I reg0. The control signal for these two com-
ponents is periodically low for cycles and then high for an-
other cycles. Hence, at the inputs of BF type II, one can ex-
pect the two sequences “ ”
and “ .” The two outputs of BF
Fig. 4. Arithmetic components.
Fig. 5. Butterfly structure of Architecture-I.
type II are converted into a single serial output using reg1 and
sw1, whose control signal is similar to that of sw0. It remains
to rearrange the data into a suitable format for the next stage of
the transform. For this, a switch, sw2, and a BSR-type I, reg2,
are used. By taking into account the fact that the upper output
Authorized licensed use limited to: Newcastle University. Downloaded on March 08,2010 at 08:40:33 EST from IEEE Xplore.  Restrictions apply. 
NIBOUCHE et al.: PIPELINE ARCHITECTURES FOR RADIX-2 NEW MERSENNE NUMBER TRANSFORM 1673
Fig. 6. Flow of data in Architecture-I for a 32-point transform.
of BF type I is ahead of the output of sw1 by cycles, the
control signal of sw2 is similar to that of sw0 and sw1. Only
the first half of the output of sw1 needs to be reversed using
reg2; thus, the control of reg2 is periodically low for cy-
cles and then high for another cycles. Therefore, the out-
puts of the butterfly architecture of Fig. 5, out0 and out1, are
the sequences “
” and “
,” respectively. In this
order, the operations of addition/subtraction for the next stage
can take place.
The whole transform requires butterfly structures.
This is shown in the example in Fig. 6 for a transform length of
32 samples. It is worth noting that there are no multiplication
operations involved in the last two stages of the transform.
Thus, the multiplication components BF type II have been
trimmed from the proposed architecture at points 11 and 13
in Fig. 6. The figure also shows the flow of data through the
different components of the architecture. To shuffle data into a
suitable format for the first butterfly stage, the latter butterfly
stage has to be preceded by an SR of length , where the
first half of data can be stored, and
a switch whose control signal is low for cycles and then
high for another cycles. In this way, two samples, one
from the first half and the other from the second half of data,
can be fed to the inputs of the first butterfly structure. Another
consequence of storing the first half of data before starting the
computation of the transform is that the efficiency of this archi-
Authorized licensed use limited to: Newcastle University. Downloaded on March 08,2010 at 08:40:33 EST from IEEE Xplore.  Restrictions apply. 
1674 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 8, AUGUST 2009
Fig. 7. Butterfly structure of Architecture-II.
tecture is 50%. This is similar to the efficiency attained by 1-D
pipeline FFT architectures built using two-input/two-output
butterfly structures [27], [28].
C. Architecture-II
Architecture-II is the second proposed architecture that im-
plements the NMNT. It uses a four-input/four-output butterfly
structure as a basic building block; this is shown in Fig. 7, while
Architecture-II, shown in Fig. 8, is a one-input/four-output
structure. Its butterfly structure can be considered as the most
straightforward of the three proposed butterflies since all the
arithmetic operations are combined together using a BF type
III component. In addition, this butterfly structure includes
two SRs (reg0 and reg1), two BSR-type I (reg2 and reg3),
all of length , and three switches (sw0, sw1, and sw2).
The four sequences of data at the inputs in0, in1, in2, and
in3 of the butterfly in Fig. 7 are ,
,
, and ,
respectively. Therefore, with four points being fed to the inputs
of the butterfly structure at every clock cycle, the arithmetic op-
erations in Fig. 1 can be carried out in a single cycle, including
the multiplication operations; this is in contrast to the butterfly
of Architecture-I, in which the addition/subtraction operations
are carried out first and separated from the multiplication part.
To rearrange data for the next stage, the control signal of sw0,
sw1, reg2, and reg3 has to be periodically low for cycles
and then high for cycles. The control of switch sw2 has a
period of clock cycles. This signal is low for a cycle and
then high for the remaining cycles. In the last stage
of the transform, no multiplication operations are involved;
thus, a BF type IV component is used. The whole transform
requires stages. An example of Architecture-II for
a transform length of 32 samples is given in Fig. 8, which also
shows the flow of the data through the different components of
the architecture. At its input, Architecture-II receives the data
sequence , which needs to be reshuf-
fled into a suitable format for the first stage of the transform.
Fig. 8. Flow of data in Architecture-II for a 32-point transform.
Therefore, samples have to be stored before starting to
compute the transform, as shown in Fig. 1, and a reordering
block, composed of registers with an overall length of
and switches, is required to temporarily reorder data before it is
fed to the first butterfly of Architecture-II. As shown in Fig. 8,
such a reordering block uses two SRs of length 8 ( in the
general case) and one BSR-type I of length 8 ( in the gen-
eral case), together with three switches. It should be noted here
that the efficiency of such architecture is 25%, thus carrying
out a useful task for a quarter of the processing time. This is a
consequence of storing 3/4 of the data before starting to com-
pute the transform. Such efficiency also characterizes radix-4
1-D FFT architectures which are built using butterfly structures
with the same number of inputs and outputs as the butterfly
structure shown in this section [27], [28]. In Section III-D, it is
shown that the efficiency can be increased to 50% while still
Authorized licensed use limited to: Newcastle University. Downloaded on March 08,2010 at 08:40:33 EST from IEEE Xplore.  Restrictions apply. 
NIBOUCHE et al.: PIPELINE ARCHITECTURES FOR RADIX-2 NEW MERSENNE NUMBER TRANSFORM 1675
Fig. 9. Butterfly structure of Architecture-III.
using the same arithmetic components as in Architecture-II.
However, this is not achieved through multiplexing different
sets of data to use the same architecture as in [24] but, rather,
by feeding two samples to the proposed Architecture-III.
D. Architecture-III
Architecture-III is the third proposed architecture that imple-
ments the 1-D NMNT. This structure uses the same arithmetic
components as employed in Architecture-II. However, the
arithmetic operations are scheduled in a different way which
separates odd-indexed samples from even-indexed samples. As
shown in Fig. 2, samples with even indexes and samples with
odd indexes are not processed together until the last stage of
the transform. This fact is exploited to increase the efficiency of
the proposed architecture, without having to interleave multiple
sets of data. Architecture-III is a two-input (an even-indexed
sample input and an odd-indexed sample input) and four-output
structure. Its basic butterfly is a four-input/four-output struc-
ture, as shown in Fig. 9. At its inputs are the four sequences
“ ,”
“
,” “
,” and
“
.” In these four sequences, data with even
indexes are processed first, followed by data with odd indices.
Once the arithmetic operations are carried out, data have to be
rearranged for the next stage of the transform. For this purpose,
five switches and six registers are used. Registers reg0 and reg1
in Fig. 9 are shift registers of length ; while reg2 and reg3
are BSR-type I registers of length . The registers reg4
and reg5 are BSR-type II of length . The control signal
of sw0, sw1, reg2, reg3, reg4, and reg5 is periodically low for
cycles and then high for another cycles. The control
signal of sw2 and sw3 is periodic, with a period of cycles
and a duty cycle of 50%. Finally, the control signal of sw4 is
low for one cycle and then high for cycles.
The transform architecture requires stages. As
with Architecture-II, no multiplication operations are required
in the last stage of the transform; thus, a BF type IV component
is used. An example of Architecture-III for a transform length
TABLE I
PERFORMANCE COMPARISON OF THE PROPOSED ARCHITECTURES FOR    
of 32 samples and the flow of data through the different com-
ponents of the architecture are shown in Fig. 10. In this figure,
a reordering block is used to arrange data in a suitable format
for the first stage of the transform. As shown in Fig. 11, the re-
ordering block comprises seven registers of length 4 ( in
the general case). It should be pointed out that the move from a
single-input structure, as in Architecture-I and Architecture-II,
to a two-input architecture, as in Architecture-III, has not been
achieved through the use of the unfolding technique [38]. Had
such a technique been used, one would have had to double the
number of processing elements, such as adders and multipliers,
while the total number of registers would remain the same. It is
clear that this is not the case since the same arithmetic compo-
nents are used. Rather, a different way of scheduling the NMNT
arithmetic operations is employed in order to double the effi-
ciency of Architecture-II, without doubling the number of pro-
cessing units. This leads to the design of Architecture-III which
is more efficient than a two-input unfolded version of Architec-
ture-I or Architecture-II.
IV. IMPLEMENTATION AND RESULTS
The proposed architectures are compared in terms of their
area usage, expressed by the number of components they use,
including adders, multipliers, registers, and multiplexers (used
in BSRs), and also in terms of their computation time, given by
the number of processing cycles required to compute the trans-
form. The performances of the proposed architectures are shown
in Table I where Architecture-II is used as a reference architec-
ture.
Since Architecture-I uses fewer adders than Architecture-II,
one may expect that it is more efficient in hardware terms than
the latter structure. However, the reduction in the number of
adders used by Architecture-I, which is only 66% of that of Ar-
chitecture-II, comes with an increase of 14% and 33% in reg-
isters and multiplexer requirements, respectively, when com-
pared with Architecture-II. Both architectures need cycles
to carry out the computation of the transform. A further signif-
icant point is that Architecture-III only requires an extra 7% in
terms of registers when compared with the area usage of Archi-
tecture-II. Both architectures have the same number of BSRs,
although used in two different ways. However, Architecture-III
is the most efficient of the three architectures since it needs only
cycles to compute the transform.
Furthermore, as shown in Table I, Architecture-III requires
less than half the number of registers of the early prototype in
[39]. The latter design relies on the use of ordering registers
which convert a pair of inputs of length given in the
order
Authorized licensed use limited to: Newcastle University. Downloaded on March 08,2010 at 08:40:33 EST from IEEE Xplore.  Restrictions apply. 
1676 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 8, AUGUST 2009
Fig. 10. Flow of data in Architecture-III for a 32-point transform.
to the sequence
.
The results in Table I clearly indicate that the use of the BSRs
(both type I and type II) in Architecture-III allows for a better
arithmetic operation schedule in the structure BFs. Thus, al-
though the resulting architecture shares similar requirements
in terms of adders and multipliers with the design in [39], it is
more efficient in terms of register usage.
The proposed architectures have been tested and imple-
mented in field-programmable gate array (FPGA) using the
Virtex-II family of devices. The results of such an implementa-
tion are shown in Tables II – IV. To illustrate the fact that the
proposed architectures can be parameterized in order to meet
the requirements of the application being processed, the prime
moduli , , and for transform lengths of
up to 1024 samples, whenever possible (the maximum possible
transform length for modulus is 128 points), have been
selected.
The arithmetic elements in the three architectures have
been pipelined in accordance with the vertical cut-set lines
of Figs. 5, 7, and 9. The addition and subtraction operations
modulo a Mersenne number have been implemented using the
dedicated fast carry logic chains within the chip [40]. Two
chains are used for each modular addition or subtraction; the
carry-out bit and the result of the first chain are accumulated
using the second chain. The second chain does not generate a
carry-out bit, and only its result is taken into consideration. The
results from both chains are registered. Therefore, the level of
Authorized licensed use limited to: Newcastle University. Downloaded on March 08,2010 at 08:40:33 EST from IEEE Xplore.  Restrictions apply. 
NIBOUCHE et al.: PIPELINE ARCHITECTURES FOR RADIX-2 NEW MERSENNE NUMBER TRANSFORM 1677
TABLE II
IMPLEMENTATION RESULTS FOR       
Fig. 11. Reordering block for a 32-point transform.
pipelining of a Mersenne modular adder/subtractor is equal to
two. The multiplication operation modulo a Mersenne prime
is implemented using block multipliers embedded in the chip
[40]. Each multiplier is capable of carrying out an 18 bit by 18
bit multiplication. Therefore, four multipliers are required for
a 31 bit by 31 bit modulo multiplication. A single multiplier is
required for the multiplication of 7 bit by 7 bit and 17 bit by
17 bit words. The most significant half and the least significant
half of the multiplication operation are accumulated using a
Mersenne modular adder. The level of pipelining of a 31 bit
by 31 bit Mersenne modular multiplier is 5; it is 3 for the case
of 7 bit by 7 bit and 17 bit by 17 bit multipliers. The switches
are not registered, and no registers have been inserted in the
architectures’ data paths. The “twiddle factors” and are
generated in VHDL code and stored in two lookup-table-based
ROMs. In Tables II –IV, the comparison is made in terms of
the area usage given in FPGA slices, the frequency at which
the architectures can work, the throughput rate, and processing
time for moduli , , and , respectively.
For each transform length in these tables, the corresponding
Architecture-III is taken as reference for the proposed three
architectures.
It is clear from these tables that the area usage increases with
the transform length and the moduli wordlength. On the other
hand, the frequency of the proposed architectures increases
when the transform length and the moduli wordlength decrease.
This can be explained by the fact that there is less routing in the
designs and that the propagation path through the arithmetic
elements is shorter. However, the decrease in area usage due to
choosing a shorter modulus wordlength is not proportional to
the gain in terms of processing time as shown by the throughput
rate of the different architectures. In fact, the highest throughput
rate values are obtained for the modulus , although
at the cost of being the largest designs. When comparing the
proposed architectures, Architecture-III is the most efficient in
terms of speed and throughput rate, achieving up to 7.89 Gbit/s
for a short transform length of 64 samples and 6.09 Gbit/s for
a transform length of 1024 samples. This is due to the fact
that it processes two samples per clock cycle. The other two
architectures have comparable speed performances.
In terms of area usage, Architecture-I uses less area than Ar-
chitecture-II and Architecture-III for transform lengths shorter
than 1024 points. In particular, Architecture-I is clearly more
efficient in using the hardware than the other two architectures
Authorized licensed use limited to: Newcastle University. Downloaded on March 08,2010 at 08:40:33 EST from IEEE Xplore.  Restrictions apply. 
1678 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 8, AUGUST 2009
TABLE III
IMPLEMENTATION RESULTS FOR       
TABLE IV
IMPLEMENTATION RESULTS FOR       
for transform lengths shorter than 256 points. However, Archi-
tecture-III uses less area than the other two structures when the
transform length reaches 1024 points, as shown in Tables II and
III. This is a remarkable result as, in the case of large transform
lengths, Architecture-III not only outperforms the other two ar-
chitectures in terms of processing time by almost a factor of two,
but its area usage is smaller than the area usage of Architecture-I
and Architecture-II.
It should be noted here that this is the first time the NMNT
has been implemented in hardware. The authors found no sim-
ilar work to use for comparison purposes. With regard to NTT
implementation, the authors of [34] and [35] used an old tech-
nology to implement the FNT which possesses a different flow
graph from that of the NMNT.
The comparison with the FFT is quite difficult as the two
transforms belong to different families of transforms; they have
different structures and involve different arithmetic operations
and architectures. However, to provide useful information on
the suitability of using the NMNT to implement convolutions or
correlations, the proposed work is compared against pipelined
FFT architectures in the literature [28], [35], [41]–[46], as
shown in Tables V and VI. It should be noted here that, when
the partial results are rounded off, truncated, or scaled in an
FFT-based convolution [29], [36], the modulus in the NMNT
controls the dynamic range of the convolution. If, during the
intermediate stages of the computation of the NMNT or the
convolution operation, the partial results exceed the modulus,
then the modulus will be subtracted from the intermediate
results. As the modulus is a Mersenne number, the subtraction
operation becomes equivalent to adding the carry. Provided
that the condition in (10) is satisfied, the final result of the
convolution will be correct, and no rounding or truncation
errors are involved. Hence, an error-free calculation of convo-
lutions/correlations can be carried out using the NMNT.
Table V presents a generic comparison between our work
and FFT architectures in the literature [28], [41]–[44]. The
Authorized licensed use limited to: Newcastle University. Downloaded on March 08,2010 at 08:40:33 EST from IEEE Xplore.  Restrictions apply. 
NIBOUCHE et al.: PIPELINE ARCHITECTURES FOR RADIX-2 NEW MERSENNE NUMBER TRANSFORM 1679
TABLE V
COMPARISON WITH FFT ARCHITECTURES
TABLE VI
COMPARISON OF FPGA-BASED IMPLEMENTATIONS
arithmetic component requirements in Table V are expressed
in terms of real/integer adders and multipliers. The register
requirements take into account the fact that a real sequence is
processed and expressed in terms of the real part (or imaginary
part) of the data. The length of the transform in Table V is
equal to a power of four. However, is equal to . All ar-
chitectures in Table V exhibit an efficiency of at least 50% [28],
[41]–[44], which places these architectures into the same class.
Radix-4 FFT architectures clearly utilize fewer multipliers
and registers than the proposed NMNT architectures since the
proposed NMNT uses radix-2. However, the main advantage
of the proposed architectures is that Architecture-III halves
the number of processing cycles. Architecture-III computes a
1024-point NMNT transform with a wordlength of 31 bits in
less than half the time required by [45] and [46] to compute a
1024-bit FFT transform on data coded using 16 bits as shown
in Table VI.
V. CONCLUSION
In this paper, three new pipeline architectures, namely, Archi-
tecture-I, Architecture-II, and Architecture-III, that implement
the NMNT have been presented. The NMNT circumvents the
drawback of the rigid relationship between the transform length
and the wordlength that characterizes Fermat and Mersenne
number transforms, thus making it more suitable for digital
signal processing applications. The suggested architectures are
scalable, parameterizable, and can easily be pipelined. They
have different formats of inputs/outputs leading to different
area usage and speed performances. While Architecture-I uses
fewer adders than Architecture-II, the latter architecture uses
fewer registers and multiplexers than Architecture-I. Both
architectures require the same number of processing cycles to
compute the transform. The third architecture, Architecture-III,
requires extra 7% more registers than Architecture-II to achieve
twice its throughput rate. An FPGA-based implementation has
shown that Architecture-I and Architecture-II have comparable
processing time performances while Architecture-III is, by a
large margin, the fastest of the three architectures, which is
the consequence of processing two samples per clock cycle.
Furthermore, in some cases, Architecture-III requires less
area usage than the other two structures. The implementation
results presented in this paper offer the designer a wide range
of choices in selecting an appropriate architecture to meet the
design requirements.
REFERENCES
[1] C. M. Rader, “Discrete convolutions via Mersenne transforms,” IEEE
Trans. Comput., vol. C-21, no. 12, pp. 1269–1273, Dec. 1972.
[2] R. C. Agarwal and C. S. Burrus, “Number theoretic transforms to
implement fast digital convolution,” Proc. IEEE, vol. 63, no. 4, pp.
550–560, Apr. 1975.
[3] R. C. Agarwal and C. S. Burrus, “Fast convolution using Fermat
number transform with application to digital filtering,” IEEE Trans.
Acoust., Speech, Signal Process., vol. ASSP-22, no. 2, pp. 87–97, Apr.
1974.
[4] H. J. Nussbaumer, “Relative evaluation of various number theoretic
transforms for digital filtering applications,” IEEE Trans. Acoust.,
Speech, Signal Process., vol. Vol. ASSP-26, no. 1, pp. 88–93, Feb.
1978.
[5] R. Conway, “Modified overlap technique using Fermat and Mersenne
transforms,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 53, no. 8,
pp. 632–636, Aug. 2006.
[6] V. M. Chernov, “Fast algorithm for error-free convolution computation
using Mersenne–Lucas codes,” Chaos, Solitons Fractals, vol. 29, no.
2, pp. 372–380, Jul. 2006.
[7] S. Gudvangen, “Practical applications of number theoretic transforms,”
in Proc. NORSIG, Asker, Norway, Sep. 10–11, 1999, pp. 102–107.
[8] R. Kumar, “A fast algorithm for solving a Toeplitz system of equa-
tions,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-33,
no. 1, pp. 254–267, Feb. 1985.
[9] J. J. Hsue and A. E. Yagle, “Fast algorithms for solving Toeplitz sys-
tems of equations using number theoretic transforms,” IEEE Trans.
Signal Process., vol. 44, no. 1, pp. 89–101, Jun. 1995.
[10] T. Lundy and J. V. Buskirk, “A new matrix approach to real FFTs and
convolutions of length   ,” Computing, vol. 80, no. 1, pp. 23–45, May
2007.
[11] A. E. Yagle, “Fast algorithms for matrix multiplication using pseudo-
number-theoretic transforms,” IEEE Trans. Signal Process., vol. 43,
no. 1, pp. 71–76, Jan. 1995.
[12] A. Schonhage and V. Strassen, “Schnelle multiplikation grosser zahlen
(Fast multiplication of large numbers),” Computing, vol. 7, no. 3/4, pp.
281–292, 1971.
[13] T. Toivonen and J. Heikkilä, “Video filtering with Fermat number the-
oretic transforms using residue number system,” IEEE Trans. Circuits
Syst. Video Technol., vol. 16, no. 1, pp. 92–101, Jan. 2006.
[14] R. Conway, “Efficient residue arithmetic based parallel fixed co-
efficient FIR filters,” in Proc. IEEE ISCAS, May 18–21, 2008, pp.
1484–1487.
[15] T. Toivonen, J. Heikkilä, and O. Silvén, “A new algorithm for fast full
search block motion estimation based on number theoretic transforms,”
in Proc. 9th Int. Workshop Syst., Signals, Image Process., 2002, pp.
90–94.
[16] Z. Hong, C. Zhengxing, Y. Fan, and W. Wei, “Research on lossless
image compression algorithms using Fermat number transform,” in
Proc. 8th Int. Conf. SNPD, 2007, vol. 2, pp. 487–490.
[17] G. Madre, E. H. Baghious, S. Azou, and G. Bnrel, “Linear predictive
speech coding using Fermat number,” in Proc. 4th EURASIP Conf.,
Zaqreb, Croatia, Jul. 2–5, 2003, pp. 607–612.
[18] S. Gudvangen, “Number theoretic transforms in audio processing,” in
Proc. 2nd COST G-6 DAFx, Trondheim, Norway, Dec. 9–11, 1999, pp.
29–32.
[19] S. Xu, L. Dai, and S. C. Lee, “Autocorrelation analysis of speech sig-
nals using Fermat number transform,” IEEE Trans. Signal Process.,
vol. 40, no. 8, pp. 1910–1914, Aug. 1992.
Authorized licensed use limited to: Newcastle University. Downloaded on March 08,2010 at 08:40:33 EST from IEEE Xplore.  Restrictions apply. 
1680 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 8, AUGUST 2009
[20] C. P. Schnorr and S. Vaudenay, “Parallel FFT-hashing,” in Fast Soft-
ware Encryption. New York: Springer-Verlag, 1994, vol. 809, Lec-
ture Notes in Computer Science, pp. 149–156.
[21] R. C. Agarwal and C. H. Burrus, “Fast one-dimensional digital convo-
lution by multidimensional techniques,” IEEE Trans. Acoust., Speech,
Signal Process., vol. ASSP-22, no. 1, pp. 1–10, Feb. 1974.
[22] C. Rader, “On the application of the number theoretic methods of high-
speed convolution to two-dimensional filtering,” IEEE Trans. Circuits
Syst., vol. CAS-22, no. 6, pp. 575–575, Jun. 1975.
[23] H. Lu and S. C. Lee, “A new approach to solve the sequence-length
constraint problem in circular convolution using number theoretic
transform,” IEEE Trans. Signal Process., vol. 39, no. 6, pp. 1314–1321,
Jun. 1991.
[24] S. Boussakta and A. G. J. Holt, “New two dimensional transform,”
Electron. Lett., vol. 29, no. 11, pp. 949–950, May 1993.
[25] S. Boussakta and A. G. J. Holt, “New transform using the Mersenne
numbers,” Proc. Inst. Elect. Eng.—Vis., Image Signal Process., vol.
142, no. 6, pp. 381–388, Dec. 1995.
[26] S. Boussakta and A. G. J. Holt, “New separable transform,” Proc. Inst.
Elect. Eng.—Vis., Image Signal Process., vol. 142, no. 1, pp. 27–30,
Feb. 1995.
[27] L. R. Rabiner and B. Gold, Theory and Application of Digital Signal
Processing. Englewood Cliffs, NJ: Prentice-Hall, 1975.
[28] S. He and M. Torkelson, “A new approach to pipeline FFT processor,”
in Proc. 10th IPPS, 1996, pp. 766–770.
[29] T. Lenart and V. Owall, “A 2048 complex point FFT processor using
a novel data scaling approach,” in Proc. ISCAS, May 25–28, 2003, pp.
IV-45–IV-48.
[30] T. Bortfeld and W. Dinter, “Calculation of multidimensional Hartley
transforms using one-dimensional Fourier transforms,” IEEE Trans.
Signal Process., vol. 43, no. 5, pp. 1306–1310, May 1995.
[31] B. M. Baas, “A low-power, high-performance, 1024-point FFT pro-
cessor,” IEEE J. Solid-State Circuits, vol. 34, no. 3, pp. 380–387, Mar.
1999.
[32] K. J. Jones, “Design and parallel computation of regularised fast
Hartley transform,” Proc. Inst. Elect. Eng.—Vis., Image, Signal
Process., vol. 153, no. 1, pp. 70–78, Feb. 2006.
[33] W. Han, T. Arslan, A. T. Erdogan, and M. Hasan, “Low power commu-
tator for pipelined FFT processors,” in Proc. IEEE ISCAS, May 23–26,
2005, vol. 5, pp. 5274–5277.
[34] M. Benaissa, S. S. Dlay, and A. G. J. Holt, “CMOS VLSI design of a
high-speed Fermat number transform based convolver/correlator using
three-input adders,” Proc. Inst. Elect. Eng.—Circuits, Devices Syst.,
vol. 138, no. 2, pp. 182–190, Apr. 1991.
[35] M. Benaissa, S. Dlay, and A. G. J. Holt, “VLSI implementation issues
for the 2-D Fermat number transform,” Signal Process., vol. 23, no. 3,
pp. 257–272, Jun. 1991.
[36] A. V. Oppenheim and C. J. Weinstein, “Effects of finite register length
in digital filtering and fast Fourier transform,” Proc. IEEE, vol. 60, no.
8, pp. 957–976, Aug. 1972.
[37] O. Alshibami and S. Boussakta, “Decimation-in-frequency vector radix
algorithm for fast calculation of the 2-D NMNT,” in Proc. 9th Int. Conf.
Electron., Circuits Syst., Sep. 15–18, 2002, vol. 3, pp. 939–942.
[38] K. K. Parhi, “A systematic approach for design of digit-serial signal
processing architectures,” IEEE Trans. Circuits Syst., vol. 38, no. 4,
pp. 358–375, Apr. 1991.
[39] O. Nibouche, S. Boussakta, and M. Darnell, “A new architecture
for radix-2 new Mersenne number transform,” in Proc. IEEE ICC,
Istanbul, Turkey, 2006, pp. 3219–3222.
[40] Xilinx Data Sheets [Online]. Available: www.xilinx.com
[41] G. Bi and E. V. Jones, “A pipelined FFT processor for word-sequential
data,” IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 12,
pp. 1982–1985, Dec. 1989.
[42] A. M. Despain, “Fourier transform computers using CORDIC itera-
tions,” IEEE Trans. Comput., vol. C-23, no. 10, pp. 993–1001, Oct.
1974.
[43] C.-H. Chan, C.-L. Wang, and Y.-T. Chang, “Efficient VLSI architec-
tures for fast computation of the discrete Fourier transform and its in-
verse,” IEEE Trans. Signal Process., vol. 48, no. 11, pp. 3206–3216,
Nov. 2000.
[44] W. C. Yeh and C. W. Jen, “High-speed and low-power split-radix FFT,”
IEEE Trans. Signal Process., vol. 51, no. 3, pp. 864–874, Mar. 2003.
[45] C. Chao, Z. Qin, X. Yingke, and H. Chengde, “Design of a high per-
formance FFT processor based on FPGA,” in Proc. Conf. Asia South
Pacific Des. Autom., Shanghai, China, Jan. 18–21, 2005, pp. 920–923.
[46] S. Sukhsawas and K. Benkrid, “A high-level implementation of a high
performance pipeline FFT on Virtex-E FPGAs,” in Proc. IEEE ISVLSI,
pp. 229–232.
Omar Nibouche (M’04) received the B.Eng. degree
in electronics from the National Polytechnic Institute
of Algiers, Algiers, Algeria, in 1998 and the Ph.D.
degree in computer science from the Queen’s Uni-
versity of Belfast, Belfast, U.K., in 2001.
In 2001, he joined the University of Ulster as a
Lecturer. Later, in 2005, he joined the University of
Leeds as a Research Fellow, a position he held until
2006, when he moved to the University of Newcastle
upon Tyne, Tyne, U.K. Since 2007, he has been a Re-
search Fellow with The Institute of Electronics, Com-
munications and Information Technology, Queen’s University of Belfast.
Said Boussakta (SM’04) received the Ingenieur
d’Etat degree in electronic engineering from the
National Polytechnic Institute of Algiers (ENPA),
Algiers, Algeria, in 1985 and the Ph.D. degree in
electrical engineering (signal and image processing)
from the University of Newcastle upon Tyne, Tyne,
U.K., in 1990.
From 1990 to 1996, he was with the University
of Newcastle upon Tyne as a Senior Research As-
sociate in digital signal and image processing. From
1996 to 2000, he was with the University of Teesside,
Teesside, U.K., as a Senior Lecturer in communication engineering. From 2000
to 2006, he was with the University of Leeds as a Reader in digital communica-
tions and signal processing. He is currently a Professor of communications and
signal processing with the School of Electrical, Electronic and Computer Engi-
neering, University of Newcastle upon Tyne, where he lectures in communica-
tions and signal processing. His research interests are in the areas of fast DSP
algorithms, digital communications, cryptography, and digital signal/image pro-
cessing. He has authored and coauthored more than 150 publications.
Prof. Boussakta is a Fellow of the Institution of Engineering and Technology
(IET) and a Senior Member of the IEEE Communications and Signal Processing
Societies. He has served as Chair for signal processing for communications in
the IEEE International Conference on Communications, ICC 2006, ICC 2007,
and ICC 2008.
Michael Darnell (SM’96) received the B.Tech. de-
gree from Loughborough University, Leicestershire,
U.K., in 1963 and the Ph.D. degree from Cambridge
University, Cambridge, U.K., in 1968.
He spent 13 years working for the U.K. Royal
Naval Scientific Service and the North Atlantic
Treaty Organization Supreme Headquarters Allied
Powers Europe Technical Centre (Netherlands) in
the fields of long-range and naval communications.
From 1980 to 2001, he was a full-time Academic
at the University of York, York, U.K., University
of Hull, Hull, U.K., and University of Leeds, Leeds, U.K., with research
interests in communications systems, signal processing, and electromagnetic
compatibility. Currently, he is the Managing Director of HW Communications
Ltd., Lancaster, U.K., a Visiting Professor at the University of Lancaster,
Lancaster, U.K. and University of Newcastle upon Tyne, Tyne, U.K., and an
Advisory Professor with Southwest Jiaotong University, Chengdu, China.
Prof. Darnell is a Fellow of The Institution of Engineering and Technology
and a Fellow of the Institute of Mathematics and its Applications.
Authorized licensed use limited to: Newcastle University. Downloaded on March 08,2010 at 08:40:33 EST from IEEE Xplore.  Restrictions apply. 
