



The Design of Low Complexity Low Power Pipelined Short Length 
Winograd Fourier Transforms
Coskun, A., Kale, I., Morling, R.C.S., Hughes, R., Brown, S. and 
Angeletti, P.
 
This is a copy of the author’s accepted version of a paper subsequently published in the 
proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), 
Melbourne VIC, Australia, 1 to 5 June 2014.
It is available online at:
https://dx.doi.org/10.1109/ISCAS.2014.6865556
© 2014 IEEE . Personal use of this material is permitted. Permission from IEEE must be 
obtained for all other uses, in any current or future media, including 
reprinting/republishing this material for advertising or promotional purposes, creating 
new collective works, for resale or redistribution to servers or lists, or reuse of any 
copyrighted component of this work in other works.
The WestminsterResearch online digital archive at the University of Westminster aims to make the 
research output of the University available to a wider audience. Copyright and Moral Rights remain 
with the authors and/or copyright owners.
Whilst further distribution of specific materials from within this archive is forbidden, you may freely 
distribute the URL of WestminsterResearch: ((http://westminsterresearch.wmin.ac.uk/).
In case of abuse or copyright appearing without permission e-mail repository@westminster.ac.uk
The Design of Low Complexity Low Power
Pipelined Short Length Winograd Fourier
Transforms
Adem Coskun, Izzet Kale, Richard C. S. Morling
Applied DSP and VLSI Research Group
University of Westminster
London, W1W 6UW, United Kingdom
adem@alptron.com, kalei@westminster.ac.uk
Robert Hughes, Stephen Brown
EADS Astrium Ltd






Abstract—In this paper a novel pipelining approach applicable
to Winograd Fourier transforms is presented. The novel approach
makes use of reconfigurable multiplier blocks to implement the
real multipliers required for the transform as well as sharing the
hardware resources among additions. The additions are realized
using modified forms of butterfly circuits. The novel approach is
tested on a 5-point Winograd Fourier transform and the circuit
area and power dissipation of the design are estimated using an
in-house power estimation tool and compared to the state-of-the-
art approaches.
I. INTRODUCTION
Winograd Fourier Transform (WFT) [1] algorithm is highly
preferable in designs that involves Discrete Fourier Transform
(DFT). Twiddle factor multiplication is not required for WFT,
which in turn reduces the number of real multipliers needed.
Because multipliers are more resource demanding than other
circuit components, WFT is regarded as a power efficient
transform covering a smaller circuit area. Especially WFTs
with blocklengths of 2, 3, 5, 7, 8, 9 and 16 are widely used
in Digital Signal Processing (DSP) applications, the details of
which can be found in [7],[6].
It is possible to pipeline several short length WFTs to
generate transforms of larger sizes [2], [3]. However, realizing
a pipelined structure for short length WFTs is difficult due
to the irregularity in the flow of the signal within these
transforms. In the next section we will give more details on
WFT and present our novel approach on a 5-point WFT.
II. WINOGRAD FOURIER TRANSFORM ALGORITHM
The WFT [1] algorithm allows the DFT matrix
W =
2666664
1 1 1 1    1
1 ! !2 !3    !N 1







1 !(N 1) !2(N 1) !3(N 1)    !(N 1)2
3777775
(1)
of size NN to be decomposed into three matrices as follows
W = S1MS2; (2)
where ! = e 
j2
N , S1 and S2 are of size NM and MN
and composed of only  1; 0 and 1. M is a MM diagonal
matrix being composed of either purely real or purely imag-
inary numbers. For N = 5, which is taken as the reference
design for this paper,
S1=
266664
1 0 0 0 0 0
1  1  1 0  1 1
1  1 1  1 1 0
1  1 1 1  1 0
1  1  1 0 1  1
377775;S2=
26666664
1 1 1 1 1
0 1 1 1 1
0 1  1  1 1
0 1 0 0  1
0 1 1  1  1




M = diag[ m0; m1; m2; m3; m4; m5 ]






j(sin(u) + sin(2u)); j(sin(u)); j(sin(u)  sin(2u))];
(4)
where diag[m0; :::;mM ] represents a diagonal matrix, mi,
for i = 0; ::;M , being its diagonal components and u =
2=5. Matrix W is used to obtain X = Wx, where x =
[x0; x1; x2; x4; x3]
T is the data samples. Please note here
that both x and X are not in their natural order, which can
easily be arranged by designing the address generators to pick
the right locations at the input and the output buffer for the
5-point WFT.
The top plot in Fig.1 shows the signal flow graph for 5-
point WFT given in (3) and (4). Stages 1 to 3, Stage 4 and
Stage 5 to 7 are to implement operations required by S2,
M and S1 respectively. Here the components of vector x is
applied to the system all in parallel. However, this is just the
representation and in fact the samples are assumed to arrive
sequentially. The black triangles (I) indicate multiplications
by constants. The black dots () are the additions. As can be
seen for 5-point WFT 17 complex additions and 5 real-by-










Stage2Stage1 Stage3 Stage4 Stage5 Stage6 Stage7































Fig. 1. The plot on the top is the signal flow graph for 5-point WFT and the
plot on the bottom is the representation of our pipelined design with seven
stages.
x is complex. Pay attention to the irregularity in this graph,
which avoids a pipelined design to be deployed straightaway.
In [5] the structure of the 5-point WFT has been modified in
order to set the symmetry in its signal flow graph. However,
this modification brought the need for complex-by-complex
multiplications in the multiplier stage (i.e. Stage 4 in Fig.1),
which of course increases the complexity of the structure.
In our design given in the following section the pipelining
is introduced to the WFT without modifying its signal flow
graph.
III. PIPELINED 5-POINT WFT
Our design is a 7-stage pipelined architecture
(B1; B2; : : : ; B7), which is depicted in the plot at the
bottom of in Fig.1. What stage of the pipelined architecture
corresponds to which part of the 5-point WFT signal flow
graph is clearly shown in Fig.1 with vertical dashed lines.
It is pipelined and the components of x are fed sequentially
into the pipelined circuit in the same order that x is formed.
There are two types of components comprising the structure
of the novel design. The Reconfigurable Multiplier Block
(ReMB) [8] and modified butterfly circuits.
Designing the ReMB: In Fig.1, B4 is the stage where
the multiplications take place. Therefore, it corresponds to
the operations required by the diagonal matrix M. As the
data samples fed into the pipelined structure sequentially, B4
should be capable of performing one distinct multiplication
at a time instance. This can be achieved using a ReMB [9],
[10]. Here we assume each data sample is quantized using































Fig. 2. (a) The structure of the circuit that implements operations required by
M (b) The ReMB which implements the multiplication with the coefficients
m1; :::;m5.
Canonical Signed Digit (CSD) representations, each multipli-
cation coefficient, i.e. m1; :::;m5, can be implemented using
the ReMB given in Fig.2(b). Fig.2(b) is composed of 3 adders
and 3 multiplexers, each having 5 inputs. The order of the
inputs to each multiplexer is set in accordance with the order
of the multiplication coefficients used for the 5-point WFT.
For example; if m1 is to be implemented, the first inputs for
all three multiplexers should be activated. Carry-in inputs i1
and i2 are selected based on if the adder is to be implemented
as a subtractor or not. As can be seen, some inputs to the
multiplexers are inverted right before multiplexing. This also
serves the need for the use of a subtractor in generating
multiplications. The symbol >> represents hard-wired shift
operation with the value following this sign showing how
many shifts are needed. Each shift operation in Fig.2(b)
belongs to the branch that goes right below them.
Due to multiplication with complex signals, two ReMBs are
needed, one for the real and the second for the imaginary part
of the incoming complex signal. Fig.2(a) shows two ReMBs,






















Fig. 3. Butterfly circuit that implements the additions required by matrix S1
rest of Fig.2(a) is the negator and a 2x2 switch, which are
needed due to the multiplication with j as required in (4), for
m3;m4 and m5. For the other two multiplication coefficients
these two components are deactivated by the control logic,
which also provides the control signals and carry-in values
for each one of the ReMBs.
Modified butterfly circuit: The first three stages in Fig.1
implements the operations needed by S2, where there are 3
full and 2 half butterfly circuits (i.e. 5 butterfly operations). It
is in fact possible to accommodate only one butterfly circuit
and perform all of the 5 operations making use of this single
butterfly circuit, which will save a lot of hardware as this will
get rid of most of the adders and subtractors needed. Because it
is a 5 point Fourier transform, all of the 5 butterfly operations
will be accomplished within the duration of the transform
without causing any delays. Fig.3 shows the modified butterfly
circuit, being composed of an adder and a subtractor, along
with several multiplexers and registers to operate the processed
signals inside the processing element. The square-shaped com-
ponents are registers and where concatenated they represent
shift-registers. These registers are assumed to be enabled at
every clock signal. On the other hand the latches L1 and L2
are enabled on when control signals c5 and c6 are active.
To operate the whole system synchronously, a 3-bit counter
is required that counts from 001 to 101 bit-wise. The control
signals needed to operate the system in Fig.3 are given in Table
I. In total of 6 signals are needed and Table I shows which of
these control signals are active at which counter instance.
The solution for the modification of a butterfly circuit is
summarized in Fig.4. The modified butterfly circuit has two
TABLE I
CONTROL SIGNALS IN THE PIPELINED 5-POINT WFT CIRCUIT
Counter Value Control Signals that are active
001 c2; c3
010 c1; c3; c4













Fig. 4. The structure of the modified butterfly circuit that is designed to
merge several stages within a WFT. Three of these circuits are needed to
realize a 5-point WFT.
sets of registers, one that coordinates the butterfly operations
(shown as shift registers in Fig.4) and the other guides the
samples from one stage to the next stage in the signal flow
graph (named shift registers in Fig.4). The maximum number
of stage registers is equal to the number of stages that are
combined under one modified butterfly circuit. Due to the ir-
regularity of the WFT, extra multiplexers and registers/latches
are required in addition to the components depicted in Fig.4.
The timing is under the control of the control logic same as
Fig.2(a).
One butterfly circuit is enough to implement stages B5 and
B6, which is in similar structure shown in Fig.4. Note that
these two stages need for 4 butterfly operations (1 full and
3 half butterfly operations), which justifies the use of only
one butterfly circuit to realize B5 and B6. We leave it to the
reader to derive the lay-out of this modified butterfly circuit. A
conventional butterfly circuit without any modifications would
be enough to implement the last stage.
IV. COMPARATIVE STUDY
In order to understand the possible savings with the use
of our novel approach, in terms of both power dissipation and
circuit area, we have implemented several approaches from the
literature to realize a 5-point WFT by using an in-house cost
estimation tool and created the content for Table II. Power
dissipation for the designed circuits are estimated in mW
and the circuits area is in terms of gates. Power dissipation
TABLE II
COMPARISON OF APPROACHES FOR IMPLEMENTING 5-POINT WFT






has been evaluated considering the activation rate of each
processing component in the circuit individually.
In Table II, WFT is the straightforward implementation of
the 5-point Winograd Fourier transform. This structure is still
in use [11],[12] and needs for 5 multipliers. Kolba and Park’s
design [3] (which we name KolbaWFT) on the other hand
needs for only 4 multipliers and a shift operation. Therefore, it
is less complex. Rather than using general purpose multipliers,
a multiplierless approach may also be accommodated, as sug-
gested in [13], because of the fact that the coefficients for the
multiplication is fixed. This approach appears as MlessWFTA
in Table II. The savings if a multiplierless approach is obvious
over the conventional approaches.
Pipelining the 5-point WFT using the approach in [5]
(which appears as RegWFT in Table II, named with respect
to the regularity that is with the signal flow graph) would
of course decrease size of the circuit as the components
are re-used by making use of butterfly circuits. However,
the power dissipation of this circuit would be huge as the
butterfly circuits will consume too much power along with
many registers and multiplexers. That is in fact why our
approach both reduced the size of the circuit and decreased the
power consumed at the same time. Our approach, as we named
it PipelinedWFT in Table II, appears to have a superiority over
the other conventional methods.
V. CONCLUSION AND FUTURE WORK
In this paper a novel approach to pipeline the structure the 5-
point WFT is shown. The novel approach makes use of ReMB
and the rest of the circuit utilizes only 3 complex adders and
3 complex subtractors with several registers and multiplexers
attached to the design. The ReMB itself is composed of 3
multiplexers and 3 real adders only. The savings over the
approaches taken from the open literature is obvious. We have
employed an in-house design tool to estimate the cost of the
novel approach. We have observed that the novel approach
consumes lower power and occupies a smaller area on the
circuit compared to other possible solutions to realize a 5-
point WFT circuit. The work we have presented here can
be expanded to WFTs with different sizes and a generalized
approach applicable to all WFT sizes can be proposed, which
will be aimed at a later study. Although we have implemented
the circuit on FPGA, the realization of the circuit on a real
life chip solution is also aimed as later research objective.
ACKNOWLEDGMENT
The authors would like acknowledge Dr. Suleyman Sirri
Demirsoy, Altera Corporation, United Kingdom, for the con-
structive discussions on the reconfigurable multiplier blocks.
REFERENCES
[1] S. Winograd, “On Computing the Discrete Fourier Transform,” Mathe-
matics of Compututation, pp. 175-199, Jan. 1978.
[2] C. S. Burrus, Fast Fourier Transforms, Rice University, Texas, 2008.
[3] D. Kolba and T. Parks, A prime factor FFT algorithm using high-speed
convolution, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process.,
vol. SP-29, pp. 281294, Aug. 1977.
[4] W. Wyatt-Millington, S. Shepherd, and S. Barton, “A pipelined im-
plementation of the Winograd FFT for satellite on-board multi-carrier
demodulation,” Wireless Pers. Commun., vol. 2, no. 4, pp. 321334, 1995.
[5] E. H. Wold and A. M. Despain, “Pipeline and parallel-pipeline FFT
processors for VLSI implementations,” IEEE Trans. Comput., vol. C-
33, no. 5, pp. 414426, May 1984.
[6] K. R. Rao, D. N. Kim, and J. J. Hwang, Fast Fourier Transform:
Algorithms And Applications. Springer, 2010.
[7] R.E. Blahut, Fast Algorithms for Digital Signal Processing. Addison-
Wesley, 1985.
[8] S. S. Demirsoy, I. Kale, and A. G. Dempster, Reconfigurable multiplier
blocks: Structures, algorithm and applications, Circuits, Syst. Signal
Process., vol. 26, no. 6, pp. 793827, Dec. 2007
[9] S. S. Demirsoy, I. Kale, and A. G. Dempster, ”Synthesis of reconfig-
urable multiplier blocks: Part I fundamentals,” in Proc. IEEE Int. Symp.
Circuits Syst. (ISCAS), 2005, pp. 536539
[10] S. S. Demirsoy, I. Kale, and A. G. Dempster, ”Synthesis of reconfig-
urable multiplier blocks: Part II algorithm,” in Proc. IEEE Int. Symp.
Circuits Syst. (ISCAS), 2005, pp. 540543
[11] N. Aghaee and M. Eshghi, “A Pipelined Architecture for a 20-point
PFA,” in Proc. TENCON 2006, Nov. 2006, pp. 1-4.
[12] J. Chen, J. Hu, and S. Li, “High Throughput and Hardware Efficient
FFT Architecture for LTE Application,” in Proc. IEEE WCNC, Apr.
2012, pp. 826-831.
[13] M. D. Macleod, “Multiplierless Winograd and Prime Factor FFT Imple-
mentation,” IEEE Trans. Signal processing, vol. 11, no.9, Sept. 2004.
