Comparison of reconfigurable structures for flexible word-length multiplication by O. A. Pfänder et al.
Adv. Radio Sci., 6, 113–118, 2008
www.adv-radio-sci.net/6/113/2008/
© Author(s) 2008. This work is distributed under
the Creative Commons Attribution 3.0 License.
Advances in
Radio Science
Comparison of reconﬁgurable structures for ﬂexible word-length
multiplication
O. A. Pf¨ ander1, R. Nopper1, H.-J. Pﬂeiderer1, S. Zhou2, and A. Bermak2
1Microelectronics Department, University of Ulm, Germany
2Electrical and Computer Engineering Department, The Hong Kong University of Science and Technology, Hong Kong
S.A.R.
Abstract. Binary multiplication continues to be one of
the essential arithmetic operations in digital circuits. Even
though ﬁeld-programmable gate arrays (FPGAs) are becom-
ing more and more powerful these days, the vendors cannot
avoid implementing multiplications with high word-lengths
using embedded blocks instead of conﬁgurable logic. But
on the other hand, the circuit’s efﬁciency decreases if the
provided word-length of the hard-wired multipliers exceeds
the precision requirements of the algorithm mapped into the
FPGA. Thus it is beneﬁcial to use multiplier blocks with con-
ﬁgurable word-length, optimized for area, speed and power
dissipation, e.g.regardingdigitalsignalprocessing(DSP)ap-
plications.
In this contribution, we present different approaches and
structures for the realization of a multiplication with vari-
able precision and perform an objective comparison. This
includes one approach based on a modiﬁed Baugh and Woo-
ley algorithm and three structures using Booth’s arithmetic
operand recoding with different array structures. All mod-
ules have the option to compute signed two’s complement
ﬁx-point numbers either as an individual computing unit or
interconnected toa superior array. Therefore, ahigh through-
put at low precision through parallelism, or a high precision
through concatenation can be achieved.
1 Introduction
Reconﬁgurable hardware is becoming more and more popu-
lar today, thanks to the continuous evolution and improve-
ment of FPGA devices. FPGA-based systems can be re-
programmed after production by overwriting memory con-
tents and thus adapted to new requirements or even com-
Correspondence to: O. A. Pf¨ ander
(oliver.pfaender@uni-ulm.de)
pletely different application scenarios. One beneﬁcial ef-
fect on the economic side is the potential to save the de-
velopment and/or production costs of an application-speciﬁc
integrated circuit (ASIC), speciﬁcally for low and medium
volume production. FPGAs achieve their high ﬂexibility us-
ing programmable lookup tables and reconﬁgurable routing.
However, implementing algorithms with high word-lengths
and complex arithmetics on FPGAs can consume large hard-
ware resources and obtain comparatively slow clock frequen-
cies.
On the other hand, arithmetic operations such as ad-
dition, multiplication and combinations such as multiply-
accumulate (MAC) are required in almost every DSP al-
gorithm, e.g. digital ﬁlter designs or Fast Fourier Trans-
form (FFT) implementations (Parhami, 2000). Furthermore,
special applications make use of multi-precision arithmetic,
i.e. working at different levels of precision during their exe-
cution time. Examples for this type of applications are itera-
tive approximation of non-linear functions (Lachowicz and
Pﬂeiderer, 2008), neural networks with different precision
in training phase and operation phase (Bermak and Mar-
tinez, 2003), and also adaptive ﬁlter designs with variable
input or coefﬁcient deﬁnitions. Since multiplication is cru-
cial in all these situations, optimizing this particular opera-
tion can help to save hardware cost and power dissipation.
In our approach, we utilize embedded ﬁxed-wired multiplier
structures in FPGAs as dedicated hardware. Above that, we
upgrade these multipliers with data exchange interfaces and
also a sign handling capability. Hence, not only the operand
number system, but also the word-length of the multiplica-
tion can be directly controlled at run-time, substituting the
re-programming step by setting a number of appropriate con-
trol signals.
Published by Copernicus Publications on behalf of the URSI Landesausschuss in der Bundesrepublik Deutschland e.V.114 O. A. Pf¨ ander et al.: Structures for ﬂexible word length multiplication
Fig. 1. Realization of a periodical radix-4 Booth encoder, overlap-
ping 3 input bits.
This paper is organized as follows: Sect. 2 describes the
basic concept of concatenating several individual multiplier
blocks to form a superior multiplier, also covering the im-
provements over our previous designs, Sect. 3 compares the
different approaches in terms of hardware usage and speed,
and Sect. 4 presents a novel structure for a block-serial mul-
tiplier, followed by a conclusion.
2 Basic concept
In order to overcome the high area and routing expenses
when realizing higher word-length multiplication in FPGAs,
it is to the best advantage to embed multiplier blocks into
the fabric and connect them to the routing network. This
is already state-of-the-art in modern commercially avail-
able FPGA devices such as the Xilinx Virtex-II Pro and the
Virtex-IV with its DSP48 blocks. Both examples feature
18×18-bit multipliers, while our multiplier elements are de-
signed as fully functional n×n-bit blocks that can be com-
bined to form a superior multiplier array. For this work,
the parameter n was set to 8 because the percental hardware
overhead for reconﬁguration would be too large for a lower
number (Pf¨ ander et al., 2004), and a larger number would
constrain the word-length ﬂexibility too much. By concate-
nating m×m uniform blocks, a superior (m·n)×(m·n)-bit
multiplier can be realized. Since every element is a fully
functional multiplier itself and can work separately at its ba-
sic word-length, a high parallelism can be achieved.
There are several degrees of freedom for the multiplier ele-
ments’ inner structure. Firstly, there is the choice of multipli-
cation algorithm, i.e. parallel, serial, serial/parallel etc., and
secondly, the number system must be assigned, in order to
handle signed numbers. We have shown in Pf¨ ander and Pﬂei-
derer (2005) that a modiﬁed Baugh and Wooley algorithm
can be efﬁciently applied to a parallel and a serial/parallel
multiplier that accepts two’s complement signed numbers. In
both cases, this leads to a regular array structure – also per-
mitting an asymmetric operand scheme – and it also shows
a small overhead compared to an unsigned and stand-alone
Fig. 2. Partial product generation for non-MSB bits from Booth-
recoded signals.
multiplier. Reconﬁgurable interfaces and conﬁguration op-
tions include the exchange of carry and sum input signals
with neighboring elements and also the inversion of partial
products in certain array positions. In addition, introducing
pipeline stages is also possible with the Baugh and Wooley
scheme, as shown by Di and Yuan (2003). New improve-
ments of our previous designs include two major updates,
namely the the adoption of Booth’s arithmetic recoding of
input operands (Booth, 1951) and the use of a carry-save in-
stead of a carry-ripple adder array. While the Booth recod-
ing is aimed speciﬁcally at reducing the number of partial
products to save hardware costs, while still retaining the full
precision in contrast to e.g. truncation (Kidambi et al., 1996)
both measures help to further accelerate the multiplication.
The radix-4 Booth algorithm encodes three overlapping
digits of the multiplier into one new digit (Booth, 1951). The
number of multiplier digits and thus the number of partial
products is reduced by 50%, thus accelerating the multiplica-
tion signiﬁcantly thanks to a shorter critical path. Although
additional effort is needed for the encoding circuit, it only
consists of a few logic gates as depicted in Fig. 1. Also,
the partial products are one bit larger than the original mul-
tiplicand and some additional effort is needed for the partial
product generation (PPG). But since their number is divided
by two and the addition array is considerably smaller, the
encoding can be compensated and the hardware usage is re-
duced. Lee (2005) presents a dedicated circuit to handle dif-
ferent word-lengths in this context, but its scalability is lim-
ited and this method results in a large overhead. Figure 2
shows the logic circuit for our PPG.
Our previous designs, e.g. the conﬁgurable paral-
lel/parallel multipliers presented in Pf¨ ander et al. (2004),
were based on a carry-ripple adder array. This approach
can be combined with a Booth recoding scheme as we have
shown in Zhou et al. (2007) which already leads to the speed
improvement shown in Table 1 over a standard array mul-
tiplier. The arbitrary array size is still possible, while the
PPG is realized separately. Control signals manipulate the
carry propagation and also the sign handling functionality in
certain perimeter cells at run-time. For further acceleration
thanks to a shorter critical path, we have re-shaped the inner
structure of the partial product addition (PPA) stage and ap-
plied a carry-save addition technique instead. This enables us
to design the PPA independently from the ﬁnal adder, which
Adv. Radio Sci., 6, 113–118, 2008 www.adv-radio-sci.net/6/113/2008/O. A. Pf¨ ander et al.: Structures for ﬂexible word length multiplication 115
Fig. 3. Reconﬁgurable Booth multiplier with carry-save partial product generation and accelerated carry-select ﬁnal adder.
can either be a standard carry-ripple adder row or an acceler-
ated carry-select adder as pictured in Fig. 3.
3 Comparison
In order to perform a meaningful assessment in terms
of hardware complexity and data throughput, the differ-
ent structures for ﬂexible word-length multiplication are
compared with a selection of previous designs. We have
adopted two benchmarks for comparison: The theoretical
and hardware-independent approach is explained in Sect. 3.1
and the implementation results of an abstract hardware de-
scription in VHDL targeting a Xilinx Virtex-II Pro FPGA de-
vice is summarized in Sect. 3.2. Although this method does
not lead to the same results as when using a semi-custom de-
sign ﬂow toward an ASIC, it helps to quantify the designs
and estimate the implementation costs on a common basis,
i.e. the number of occupied slices in the FPGA device is as-
sumed to scale up with the area usage. From this data, we
derive a combined and normalized ﬁgure of merit in order to
compare the multiplier concepts, display them as vertical-bar
charts and discuss the results.
Table 1. Speed improvement of Booth recoding scheme over par-
allel array multiplier and percental overhead for reconﬁguration
(Zhou et al., 2007).
Word-length Speed Overhead for
(bit) Advantage Reconﬁguration
16 13.8% 2.0%
24 21.5% 1.7%
32 22.5% 1.6%
3.1 Theoretical hardware and speed comparison
Our theoretical hardware and speed comparison takes into
account the number of required transistors and the number
of full adder delays for a multiplier element. While the tran-
sistor numbers as depicted in Fig. 4 calculated for one exem-
plary implementation, e.g. assuming 24 transistors for one
full adder, show no signiﬁcant differences for the hardware
usage, the multiplication times are more interesting, because
they represent a hardware-independent ﬁgure for comparing
the speed. Figure 5 shows the estimated delay in terms of full
adders over the multiplier element’s word-length.
www.adv-radio-sci.net/6/113/2008/ Adv. Radio Sci., 6, 113–118, 2008116 O. A. Pf¨ ander et al.: Structures for ﬂexible word length multiplication
Fig. 4. Theoretical hardware usage for different word-lengths, ex-
pressed in terms of transistor numbers.
Fig. 5. Theoretical multiplication delay for different word-lengths,
expressed in terms of the number of full adders in the critical path.
ThediagramlegendsinFigs.4–7containthefollowingab-
breviations: USstandsforanunsignedparallel/parallelarray.
OP is one of our previous designs from Pf¨ ander et al. (2004)
using a modiﬁed Baugh and Wooley algorithm (Baugh and
Wooley, 1973) and a parallel/parallel array. SZ is our ﬁrst
approach to introduce a Booth recoding scheme with a carry-
ripple adder array, as covered in Zhou et al. (2007). RN1
incorporates a carry-save adder array and a carry-ripple ﬁnal
adder, while RN2 uses a carry-select ﬁnal adder instead, as
pointed out in the previous section. The numbers used for
the diagrams in Figs. 4–5 represent the hardware usage of a
basic multiplier element scaled to different word-lengths.
3.2 FPGA synthesis hardware and speed comparison
In order to get actual numbers for the hardware usage of our
multiplier elements, we use the synthesis results from map-
ping the VHDL structural description onto a Xilinx Virtex-II
Pro FPGA using the vendor’s Integrated Synthesis Environ-
ment(ISE)software. Moreprecisely, thenumberofoccupied
slicesintheFPGAdeviceisusedasameasureforareausage,
Fig. 6. Estimated delay for a cascaded superior multiplier assem-
bled from 8×8-bit basic elements, without interconnect inﬂuence.
Fig. 7. Combined assessment criterion for scaled multiplier ele-
ments with different word-lengths, normalized with respect to an
unsigned parallel multiplier.
and the critical path delay from the synthesis report is inter-
preted as a measure for multiplication time. Our experiments
have shown that the synthesis results reﬂect the trends of the
hardware-independent comparison and therefore conﬁrm the
theoretical prognosis.
The estimated multiplication time in nanoseconds for a
cascaded multiplier array that was constructed from 8×8-bit
basic elements is taken from the synthesis reports and shown
in Fig. 6. Moreover, the inﬂuence of the interconnect and
routing network has to be considered, but this should be done
using a semi-custom ASIC design ﬂow instead.
3.3 Combined comparison of area and delay
In addition to the comparisons summarized in the previous
subsections, we have adopted the product of area and delay
as a cost function normalized to the respective values of an
n-bit unsigned parallel multiplier:
Cost =
No. of transistors × No. of full adder delays
Unsigned reference multiplier
(1)
Adv. Radio Sci., 6, 113–118, 2008 www.adv-radio-sci.net/6/113/2008/O. A. Pf¨ ander et al.: Structures for ﬂexible word length multiplication 117
Fig. 8. Block-serial multiplication scheme with control circuit, registers, shifter and a conﬁgurable 32×8-bit MAC core.
www.adv-radio-sci.net/6/113/2008/ Adv. Radio Sci., 6, 113–118, 2008118 O. A. Pf¨ ander et al.: Structures for ﬂexible word length multiplication
Figure7plotsthecostfunctionoverdifferentword-lengths
for scaled multiplier elements and clearly shows the signiﬁ-
cantly higher efﬁciency due to the Booth encoding scheme.
4 Novel block-serial approach
Assembling arbitrary large multiplier structures through the
concatenationofconﬁgurablebasicblocksleadstohighrout-
ing efforts and a complex interface design communicating
with the circuit periphery. In order to avoid this hardware
overhead a priori, we have come up with an approach to pro-
vide a conﬁgurable precision multiplier with less operating
expense by introducing a half-serial computation of the mul-
tiplier operand. This means that only one of the PPA rows
is actually implemented in hardware – in the case of an 8-
bit basic block this is equivalent to a 32×8-bit multiplier-
accumulator (MAC). The other cells are replaced by buffered
intermediate results that are fed back to the MAC sum input.
After an certain number of computation steps, the multipli-
cation result is valid at the output with the required precision
that can be 64-bit maximum. With this arrangement, multi-
plications of 8×8-bit up to 32×32-bit can be realized, where
both input operands are varied independently in steps of 8-
bit. The major attribute of this block-serial structure is the
fact that in contrast to a fully parallel multiplier, a number of
clockcyclesdependingontheword-lengthnofthemultiplier
is required for computing the product. Since it is segmented
in blocks of 8-bit, the number of computation steps including
multiplication, shifting and addition is n/8. The multiplier’s
word-length is given by the 2-bit input LEN V as a posi-
tive number in the range of [0...3] where n=(LEN V+1)·8.
The principle design, composed of a control circuit, a shifter
registers for the operands, a 32×8-bit MAC and feedback
lines for the intermediate results to the sum input is shown in
Fig. 8 on page 117. CTRL incorporates a ﬁnite state machine
that receives and interpretsLEN Vand addresses the registers
and the MAC block accordingly. The outstanding feature of
this design is its possibility to compute at a precision that is
conﬁgurable at run-time, performing (8·x)×(8·y)-bit up to
32×32-bit multiplications and using 60% less hardware than
a fully parallel 32×32-bit array multiplier with sign handling
capability and data exchange interfaces.
5 Conclusions
In this work, we have described and compared a number
of four different run-time reconﬁgurable multiplier architec-
tures in terms of hardware usage and speed. Starting from a
previous design based on a parallel/parallel carry-ripple ar-
ray, we have improved the architecture by a Booth recod-
ing scheme, a carry-save adder array and an accelerated ﬁ-
nal adder. The overhead for a reconﬁgurable module over
a static unsigned carry ripple multiplier varies between 24%
and 37% at a module word length of 8-bit, but the critical
path can be reduced by up to 45%, therefore accelerating the
design signiﬁcantly. For an 8-bit basic building block, our
new approach requires 50% less hardware and decreases the
delay by about 60% according to our synthesis experiments
targetting a Xilinx Virtex-II Pro FPGA device. Moreover, we
have applied a normalized area-delay product for fair com-
parison and veriﬁed the signiﬁcant beneﬁt of the Booth en-
coder, while stillretaining the possibility to change the multi-
plier’s precision at run-time. We have also presented a novel
block-serial multiplication approach which turned out to be
noticeably more area-efﬁcient than a full 32×32-bit recon-
ﬁgurable multiplier, since it requires 60% less hardware. We
are currently working on the implementation of our multi-
plier architectures in standard-cell ASIC technology in order
toavoidtheFPGAsynthesislimitationsandmakefurtherim-
provements, and also developing a detailed routing and inter-
face design.
Acknowledgements. The work described in this paper is supported
by a grant from the Research Grant Council of Hong Kong
S.A.R. and the German DAAD (Project No. G-HK019/05).
References
Baugh, C. R. and Wooley, B. A.: A Two’s Complement Parallel
Array Multiplication Algorithm, IEEE Trans. Computers, C-22,
1045–1047, 1973.
Bermak, A. and Martinez, D.: A compact 3D VLSI classiﬁer us-
ing bagging threshold network ensembles, IEEE Trans. Neural
Networks, 14, 1097–1109, 2003.
Booth, A. D.: A Signed Binary Multiplication Technique, Quart. J.
Mechanics and Appl. Mathematics, 4, 236–240, 1951.
Di, J. and Yuan, J. S.: Run-time Reconﬁgurable Power-aware
Pipelined Signed Array Multiplier Design, in: International
Symposium on Signals, Circuits and Systems, 2, 405–408, 2003.
Kidambi, S. S., El-Guibaly, F., and Antoniou, A.: Area-efﬁcient
multipliers for digital signal processing applications, IEEE
Trans. Circuits and Systems II, 43, 90–95, 1996.
Lachowicz, S. W. and Pﬂeiderer, H.-J.: Fast Evaluation of Nonlin-
ear Functions using FPGAs, Adv. Radio Sci., 6, 2008.
Lee, H.: Power-Aware Scalable Pipelined Booth Multiplier, IEICE
Trans. Fundamentals, E88-A, 3230–3234, 2005.
Parhami, B.: Computer Arithmetic: Algorithms and Hardware De-
signs, Oxford University Press, Oxford/New York, 2000.
Pf¨ ander, O. A. and Pﬂeiderer, H.-J.: Dynamische Rekonﬁguration
von arithmetischen Einheiten auf Bitebene, Adv. Radio Sci., 3,
319–323, 2005,
http://www.adv-radio-sci.net/3/319/2005/.
Pf¨ ander, O. A., Hacker, R., and Pﬂeiderer, H.-J.: A Multiplexer-
Based Concept for Reconﬁgurable Multiplier Arrays, in: Inter-
national Conference on Field-Programmable Logic and Applica-
tions, pp. 938–942, Antwerp, Belgium, 2004.
Zhou, S., Pf¨ ander, O. A., Pﬂeiderer, H.-J., and Bermak, A.: A
VLSI Architecture for a Run-Time Multi-Precision Reconﬁg-
urable Booth Multiplier, in: International Conference on Elec-
tronics, Circuits and Systems, Marrakech, Morocco, 2007.
Adv. Radio Sci., 6, 113–118, 2008 www.adv-radio-sci.net/6/113/2008/