Configurable multiplier modules for an adaptive computing system by O. A. Pfänder et al.
Adv. Radio Sci., 4, 231–236, 2006
www.adv-radio-sci.net/4/231/2006/
© Author(s) 2006. This work is licensed
under a Creative Commons License.
Advances in
Radio Science
Conﬁgurable multiplier modules for an adaptive computing system
O. A. Pf¨ ander1, H.-J. Pﬂeiderer1, and S. W. Lachowicz2
1Microelectronics Department, University of Ulm, Germany
2School of Engineering and Mathematics, Edith Cowan University, Perth, Western Australia
Abstract. The importance of reconﬁgurable hardware is in-
creasing steadily. For example, the primary approach of
using adaptive systems based on programmable gate arrays
and conﬁgurable routing resources has gone mainstream and
high-performance programmable logic devices are rivaling
traditional application-speciﬁc hardwired integrated circuits.
Also, the idea of moving from the 2-D domain into a 3-D
design which stacks several active layers above each other is
gaining momentum in research and industry, to cope with the
demandforsmallerdeviceswithahigherscaleofintegration.
However, optimized arithmetic blocks in course-grain recon-
ﬁgurable arrays as well as ﬁeld-programmable architectures
still play an important role. In countless digital systems and
signal processing applications, the multiplication is one of
the critical challenges, where in many cases a trade-off be-
tween area usage and data throughput has to be made. But
the a priori choice of word-length and number representation
can also be replaced by a dynamic choice at run-time, in or-
der to improve ﬂexibility, area efﬁciency and the level of par-
allelism in computation. In this contribution, we look at an
adaptive computing system called 3-D-SoftChip to point out
what parameters are crucial to implement ﬂexible multiplier
blocks into optimized elements for accelerated processing.
The 3-D-SoftChip architecture uses a novel approach to 3-
dimensional integration based on ﬂip-chip bonding with in-
dium bumps. The modular construction, the introduction of
interfaces to realize the exchange of intermediate data, and
the reconﬁgurable sign handling approach will be explained,
as well as a beneﬁcial way to handle and distribute the nu-
merous required control signals.
1 Introduction
Today’s ever-increasing demand for computing power in vir-
tually all application sectors, including the emerging pho-
Correspondence to: O. A. Pf¨ ander
(oliver.pfaender@uni-ulm.de)
tonics arena, is placing heavy demands on current inte-
grated systems. Crucial scenarios are e.g. modern wired
and wireless communication over highly complex networks,
processing of content-rich multimedia, and mobile appli-
cations where area and power are primary considerations.
New ways to overcome today’s technological barriers of
silicon integrated circuits (ICs) are emerging, aiming to
reduce both non-recurring engineering costs and time-to-
market periods, and also considering production factors like
re-usage of lithography masks. Reconﬁgurable computing
has gone mainstream, with the many forms of reconﬁgurable
logic gradually rivaling traditional application-speciﬁc ICs.
Thanks to the progress in IC manufacturing techniques and
the shrinking feature sizes, it has become possible to inte-
grate a multitude of different functions into one single chip,
while still meeting restrictions like package size or footprint.
Thus, a whole system incorporating multiple special func-
tions that had to be realized with multiple specialized sepa-
rate ICs can now be implemented in a single system-on-chip
(SoC). Going one step further from SoC, the idea of systems-
in-a-package (SiP) emerges, when high-end concepts such as
ﬂip-chip or chip stacks are used to reach an even higher level
of design integration and minimization. While the IC den-
sity continues to grow, another key feature is becoming more
and more essential for many of today’s widely spread ap-
plication scenarios; that is, ﬂexibility and/or conﬁgurability.
There is a paradigm change in the world of microelectronics,
because the term hardware does no longer stand for the to-
tal opposite of software. The notion of ICs offering options
to conﬁgure them, even by the end customer, is nothing but
intriguing. As feature sizes are constantly shrinking and the
design complexity is continually increasing, design method-
ologies are forced to develop at an increasing rate and the de-
signer faces many new obstacles. Novel 3-D integration sys-
tems such as 3-D SoC (Joyner et al., 2004) or 3-D-SoftChip
(Eshraghian et al., 2003), targeted to satisfy the enormous
demand for more computation throughput by effectively ma-
nipulating the functionality of hardware primitives through
Published by Copernicus GmbH on behalf of the URSI Landesausschuss in der Bundesrepublik Deutschland e.V.232 O. A. Pf¨ ander et al.: Conﬁgurable multiplier modules
ALU
REG
PE
ICS
ICS
PE
REG
ALU
ICS
PE
REG
ALU
ICS
PE
REG
ALU
ICS
Fig. 1. The starting point is a 2-dimensional course-grain soft re-
conﬁgurable array of processing elements (PEs) and intelligent con-
ﬁgurable switches (ICS).
vertical integration of two 2-D chips, are becoming an at-
tractive solution to combat the rising requirements for inter-
connect wires (Davis et al., 2005).
This paper is organized as follows: Sect. 2 reviews the 3-
D-SoftChip adaptive computing system and its approach to
architectural mapping and vertical integration, Sect. 3 con-
centrates on the design of embedded multiplier blocks into
the 3-D-SoftChip’s processing elements, and Sect. 4 con-
cludes the paper and gives suggestions for further investiga-
tion.
2 3-D-SoftChip adaptive computing system
The concept of the 3-D-SoftChip adaptive computing sys-
tem is located in the intersection of three major ﬁelds of re-
search: Modern very large scale integration (VLSI) in deep
sub-micron IC design, reconﬁgurable hardware, and 3-D in-
tegration. This section will at ﬁrst highlight the basic method
of transforming a 2-dimensional course-grain reconﬁgurable
array into a sophisticated 3-dimensional integrated system,
then mention the architectural mapping approach and also
present the vertical integration method using indium bumps.
2.1 2-D to 3-D transformation
The starting point for the transformation into the 3-D-
SoftChip system is a 2-dimensional course-grain soft recon-
ﬁgurable array of processing elements as depicted in Fig. 1.
Each processing element (PE) contains arithmetic and log-
PE PE
PE
PE PE
PE PE
PE PE PE
PE PE
PE PE
PE PE
ICS
ICS
ICS
ICS
Fig. 2. Functional separation into a sea of PEs and an array of ICS
blocks. Each ICS connects a group of 4 PEs using vertical wiring
resources.
ical functional blocks combined in an arithmetic logic unit
(ALU) and also buffer elements (registers, REG). The ALUs
are designed for increased ﬂexibility in order to support vari-
able data word-lengths for various types of computation. The
overall layout is very much like a checkerboard, with intel-
ligent conﬁguration switches (ICS) accessing the surround-
ing PEs through a next-neighbor interconnect scheme. Thus,
each ALU is able to communicate with all adjacent ALUs
through the ICS which serves as a cross point switch with
embedded memory.
Looking at the checkerboard layout of Fig. 1, two func-
tional layers are noticeable: First, the array of PEs, and sec-
ond, a superordinate array of ICS blocks that covers multi-
ple management, interface and I/O functions. However, the
area usage of a complete 2-dimensional array is consider-
able, since there are as many ICS blocks as there are PEs
and the area of the respective blocks can differ to a great ex-
tent, depending on the provided features. The area usage can
be optimized if these two functional layers are separated in
space. The proximate solution would be to have an arrange-
ment in groups – one group of PEs in one part of the chip and
a group of ICS blocks in another, like separating the black
squares from the white squares on a checkerboard. But this
solution is impracticable, since it would require an oversized
amount of interconnect circuitry and wire space to connect
the two groups. A better way to cope with this situation is
to think 3-dimensional: By using a novel method for ﬂip-
chip bonding, as will be explained in Subsect. 2.2, the two
functional layers can be realized as two separate chips that
are integrated vertically. To share e.g. memory or outbound
routing resources, four ICSs are merged, and the resulting
blocks are placed on a second chip. On the ﬁrst chip, the
PEs can now move together and the area can be compacted
signiﬁcantly. The data exchange now happens not only inter-
nally on each of the two chips, but also between them, as will
be pointed out later. Figure 2 illustrates the basic principle of
transforming the planar 2-D architecture of Fig. 1 into two
separate functional layers.
Adv. Radio Sci., 4, 231–236, 2006 www.adv-radio-sci.net/4/231/2006/O. A. Pf¨ ander et al.: Conﬁgurable multiplier modules 233
Fig. 3. To achieve a vertical 3-dimensional integration of the upper
ICS chip and the lower CAP chip, indium bumps are used for ﬂip-
chip bonding.
2.2 Vertical integration
As described in the previous subsection, the sea of PEs is
now separated from the array of ICS blocks. The lower chip
that now contains the PEs only is called the conﬁgurable ar-
ray processor (CAP). The transformation of the planar 2-D
architecture through the approach depicted in Fig. 2 into two
vertically integrated chips, namely the upper ICS chip and
the lower CAP chip, is illustrated in Fig. 3.
Each of the two chips has an array of aluminum
pads on the top metallization layer, each with a size of
10µm×10µm. The pad pattern on the lower chip is a mirror
image of the upper chip’s pattern. When the chips are placed
face-to-face on top of each other, this creates an interface
for a vertical connection between them, and the actual con-
nection is realized by depositing indium bumps onto the alu-
minum pads on the upper and lower chip, respectively. Cre-
ating the bumps comprises the following processing steps:
After the silicon substrate has been oxidized and the alu-
minum pads have been patterned the normal way, there is an
additional photoresist coating and patterning step involved to
cover and protect the pads’ surrounding die surface. Then, ti-
tanium (as a diffusion inhibitor), gold (as a contact layer) and
indium get evaporized and thus brought onto the aluminum
pads. After the lift-off, there is an extra reﬂow process to
make use of the surface tension of the low-melting indium
material, in order to transform the block-shaped deposits into
bumps with an increased height-to-width ratio. This ensures
a certain self-adjusting capability once the two chips get con-
nected (Pf¨ ander et al., 2005). The bumps have a diameter of
about 7.5µm after reﬂow. The upper ICS chip is ﬂipped and
bonded face-down to the lower CAP chip, then the space be-
tween is ﬁlled with a curing material to ensure a mechanical
stability.
2.3 Architectural mapping
The upper chip’s function (ICS) is to act as a massively par-
allel cross point switch as well as a parallel interconnected
buffer memory, to allow for a very high-speed data manipu-
lation within the plane. The lower chip (CAP) is a highly par-
allel array of soft-programmable processing elements, which
is capable of carrying out complex calculation tasks directly
on data stored in the CAP plane or – using the 3-D in-
terconnect – stored in the top plane. Each of the PEs in-
cludes its own embedded register ﬁle, along with functional
ALU blocks, glue logic and instruction decoding circuitry.
Software-programmed instructions are forwarded globally to
all processors from on-chip RAM. Even transforms and other
processing tasks may be carried out according to embedded
software instructions on the highly parallel sea-of-PE array.
Two levels of hierarchy within the CAP architecture facili-
tatetheconﬁgurationoftheALU’sword-length: Whileatthe
ﬁrst level, four processors and one ICS are utilized, this basic
group can communicate with adjacent groups at the second
level. The interconnection between the parallel array pro-
cessors is provided by a bus architecture for rapid extraction
or insertion of data. Due to the programmable nature of the
CAP, the system is highly ﬂexible, and as a result of the ver-
tical interconnects and highly parallel conﬁgurable architec-
ture, the efﬁciency is improved compared to the 2-D planar
architecture that was the starting point. On the array level
in the system, addressing will follow a switched bus archi-
tecture. The multi-tasking requirement of the system intro-
ducesaparticularneedforformulatingthearchitecturethatis
driven by a variable and conﬁgurable number of bits. There-
fore, the identiﬁcation of generic primitives such as adders
and multipliers suitable for word-length expansion becomes
an important task to realize a higher system ﬂexibility with-
out compromising performance.
3 Conﬁgurable multiplier modules
3.1 Heterogeneous processing elements in the CAP
For generic ALU functions, the 3-D-SoftChip’s standard
PE is optimized for bit-level computation. However, when
it comes to frequently used arithmetic functions – e.g. in
signal processing applications where the data path con-
tains many multiplication steps involving constantly chang-
ing ﬁlter coefﬁcients – it becomes more efﬁcient to imple-
ment ﬁxed-wired and dedicated functional blocks instead
of combining the conﬁgurable logic (Haynes et al., 1999).
This approach is now common practice in modern high-
performance FPGA designs, like the Xilinx Virtex-II Pro that
uses 18×18bit multiplier hard IP blocks strategically placed
on the chip (Xilinx, 2005). Introducing a processing acceler-
ator PE with embedded special blocks (Pf¨ ander et al., 2005),
such as a barrel shifter or an accumulator/subtractor unit,
helps to reduce the data transfer and increases the compu-
tation efﬁciency. There is only a medium ﬂexibility trade-off
compared with a complete homogeneous type PE array.
The processing accelerator PE (PA-PE) thus provides in-
creased performance for speciﬁc tasks thanks to its dedicated
special-purpose functions. In the following, we will concen-
trate on the multiplication, since it is one of the most impor-
www.adv-radio-sci.net/4/231/2006/ Adv. Radio Sci., 4, 231–236, 2006234 O. A. Pf¨ ander et al.: Conﬁgurable multiplier modules
S_IN(1)
A(1)
S_IN(0)
A(0)
B(0)
B(1)
B(2)
PSL_IN(2)
PSL_IN(1)
CTRL_V
CTRL_I_1
CTRL_C_1
C_OUT(0)
PSL_IN(0)
C_OUT(1)
C_OUT(2)
CTRL_C_2
C_IN(2)
C_IN(1)
C_IN(0)
CTRL_I_2
A(2)
0
0
0
CTRL_H
S_IN(2)
000
M(0) M(1) M(2) M(3) M(4) M(5)
Fig. 4. A parallel-parallel modiﬁed Baugh-Wooley array multiplier
with multiplexer-based interfaces
tant and widely used mathematical operations and requires a
large amount of hardware resources when implemented in a
straight-forward way. Instead of that, we have come up with
a modular multiplier design that enables the system to com-
pute at a conﬁgurable word-length, which will be explained
in the following.
3.2 Basic scheme and module characteristics
With the concept of arithmetic operations accommodating
word-lengths that are conﬁgurable at run-time, a digital hard-
ware circuit can adapt to changing accuracy requirements
easily and fast. With little overhead, new extensions can help
to improve the embedding of multiplier architectures in the
surroundings of the 3-D-SoftChip system. As was presented
in Pf¨ ander and Pﬂeiderer (2004), we have designed differ-
ent variants of multiplier modules that rely on a multiplexer-
based connectivity extension for an intermediate data ex-
change. Thegeneralideaistotaptheﬂowofcarryandoutput
bits in speciﬁc positions inside an array of basic cells. These
cells compute the partial product and provide sum and carry,
and with the help of multiplexers and corresponding control
signals, the module’s behavior can be directly inﬂuenced at
run-time. The ﬂexible word-length architecture and its dy-
namic reconﬁgurability provides either a higher throughput
at low levels of precision or a higher precision by grouping
multiple elements together, thus greatly increasing the de-
sign’s efﬁciency. Also, since every module is a fully func-
tional multiplier itself, a high parallelism can be achieved
when all modules compute separately.
Thebasicmodularbuildingblockisann×nbitmultiplier.
By concatenating m×m of these uniform blocks through the
use of interconnect resources, in our case also involving the
vertical inter-chip interface, a superior (m×n)×(m×n)bit
multiplier is formed that enables the computation at ﬂexible
precision in steps of n. The input operand word-lengths of
the multiplier elements do not necessarily have to be equal, a
scheme of n1 ×n2 bit with n1 6=n2 is also possible.
Fig. 5. A concatenation of four 3×3bit multiplier modules creates
a 6×6bit superior multiplier. (Control signals not shown for better
overview)
3.3 Realization options
Starting from an unchanged parallel-parallel array multiplier
core (Hwang, 1979), there are different expansion steps re-
quired in order to achieve the desired connectivity options.
Here, since we are dealing with signed numbers, this de-
mands extra sign handling circuitry. An architecture compar-
ison in terms of hardware usage and complexity has shown
that the modiﬁed Baugh-Wooley design (Baugh and Woo-
ley, 1973) offers the least overhead of the proposed two’s
complement multipliers, since there is no need for a sign
extension (Pf¨ ander and Pﬂeiderer, 2004). The modiﬁed de-
sign incorporates the data exchange interfaces mentioned in
the previous subsection and the possibility to handle differ-
ent number representations, due to the modiﬁed and conﬁg-
urable basic cells in speciﬁc array positions. It can handle
the following number systems: Unsigned (when the special
cells are conﬁgured to act without negating the partial prod-
uct), Signed-Magnitude (when the sign bit is calculated ex-
ternally, namely utilizing an XOR gate) and Two’s Comple-
ment (Pf¨ ander et al., 2005).
Figure 4 shows an n=3⇒3×3bit parallel array multi-
plier as an example, and Fig. 5 represents a concatenation
of four modules to form an (m=2; n=3)⇒6×6bit mul-
tiplier. The control signals are used to control the interfaces
and also the conﬁgurable partial product inversion step in the
special basic cells, in order to obey the mathematical require-
ments given in Baugh and Wooley (1973). The signals are
mapped according to the position in the concatenation and
the desired number system using a control decoder in order
to save I/O and interconnect resources. Thus, a part of the
ICS control logic is transferred onto the CAP plane and inte-
grated into the multiplier modules.
The identical operand scheme that builds the foundation of
the parallel-parallel array multiplier can also be implemented
in a serial-parallel fashion as depicted in Fig. 6. However, the
control signals and also the correcting term must be provided
as serial-input data words (Bermak et al., 1997).
Adv. Radio Sci., 4, 231–236, 2006 www.adv-radio-sci.net/4/231/2006/O. A. Pf¨ ander et al.: Conﬁgurable multiplier modules 235
FF
FA
B(0)
Cb(0)
FF
FF
FA
(MSB first)
Cb
S_IN
A with LSB first
Ca with LSB first
Ca
S_OUT Adder
FF
B(3)
Cb(3)
FF
FF
FA
B(2)
Cb(2)
FF
FF
FA
B(1)
Cb(1)
Fig. 6. The modiﬁed Baugh-Wooley multiplier can also be realized
in a serial-parallel way. The ﬁrst multiplicand and the control signal
vector are fed in serially.
3.4 Comparison of multiplication schemes
Looking at the schemes shown in Figs. 4 and 6, it becomes
evident that the number of basic cells N is a function of
the word-length. For the parallel array N ∝n2 and for the
serial-parallel multiplier N ∝n. Note that the serial-parallel
multiplier’s alleged area advantage is compensated by the
expenses for the shift registers needed to store one input
operand and the control vector. However, both options have
in common that considerable overhead to enable the connec-
tivity is entailed, for example 40% at n=4 for the parallel
array (Pf¨ ander and Pﬂeiderer, 2004). It is not sufﬁcient to
judge from the schematic hardware overhead alone; for a fair
comparison, the following aspects have to be considered ad-
ditionally:
– Area usage – depending on the available technology and
cell library etc.
– Computation time – since the serial-parallel approach
uses 2n clock periods
– Data throughput
In order to increase the data throughput, a fully pipelined
parallel array is possible, but more than 5n2 extra registers
would be necessary, resulting in even higher area penalty and
increasing the complexity of a processing element. Thus, the
decisionwhichapproachtousedependsonamultitudeofpa-
rameters. The range of applications plays the dominant role
by asserting the claims for throughput and speed, and also
other restrictions in terms of area usage and power consump-
tion may apply. In general, the connectivity option requires
a considerable extra amount of circuitry in each processing
element.
4 Conclusion
In this paper, we have reviewed the 3-D-SoftChip adaptive
computing system and highlighted the 2-D to 3-D transfor-
mation and its architectural mapping. To achieve a high level
of integration, a novel ﬂip-chip bonding technique based on
indium bumps on aluminum pads is used to build a system-
in-a-package. The indium bumps enable a 3-dimensional
routing between the lower conﬁgurable array processor chip
and the upper intelligent conﬁgurable switch chip. Look-
ing at the heterogeneous processing elements located on the
lower CAP chip, we have presented conﬁgurable embed-
ded multiplier modules for accelerated processing. These
modules are based on a modiﬁed Baugh-Wooley multiplier
and expanded by multiplexer-based data exchange interfaces
to provide a connectivity option. Both parallel-parallel and
serial-parallel array multiplication schemes are possible, and
three different number systems – namely unsigned, signed-
magnitude and two’s complement – can be handled. By con-
necting multiple multiplier elements together via 2-D or 3-D
interconnect resources, a superior multiplier computing at an
increased word-length is formed. Thus, the multiplication
word-length can be chosen at run-time by dynamically real-
izing a concatenation of separate modules. There is a signif-
icant amount of extra hardware needed to make the dynamic
array arrangement possible. But when the word-length of an
optimized arithmetic unit is not ﬁxed but can be chosen dy-
namically at run-time, this opens the door for multi-precision
algorithms as well as a massively parallel and efﬁcient usage
of resources in the 3-D-SoftChip system.
As an outlook on future work, the impact of a certain
choice of processing element architecture on the actual area
usage as a function of the speciﬁc technology parameters will
become necessary. Then, more detailed conclusions about
hardware overhead, area efﬁciency and required interconnect
resources can be made. A reﬁnement of the architecture con-
cepts as well as a design space exploration in consideration
of the high-potential 3-D integration approach using indium
bumps are currently in the process.
References
Baugh, C. R. and Wooley, B. A.: A Two’s Complement Parallel
Array Multiplication Algorithm, IEEE Trans. Computers, C-22,
1045–1047, 1973.
Bermak, A., Martinez, D., and Noullet, J.-L.: High-Density 16/8/4-
bit Conﬁgurable Multiplier, IEE Proc. Circuits Devices Systems,
144, 272–276, 1997.
Davis, W. R., Wilson, J., Mick, S., Xu, J., Hua, H., Mineo, C.,
Sule, A. M., Steer, M., and Franzon, P. D.: Demystifying 3D
ICs: The Pros and Cons of Going Vertical, IEEE Design & Test
of Computers, 22, 498–509, 2005.
Eshraghian, S., Lachowicz, S., and Eshraghian, K.: Ultra High
Bandwidth Image and Data Processing using 3-D Vertically Inte-
grated Architectures, Proceedings of the SCI 2003, Orlando, FL,
X, 189–195, 2003.
Haynes, S. D., Ferrari, A. B., and Cheung, P. Y. K.: Flexible Recon-
ﬁgurable Multiplier Blocks suitable for enhancing the Architec-
ture of FPGAs, Proceedings of the IEEE 1999 Custom Integrated
Circuits Conference, San Diego, CA, 191–194, 1999.
Hwang, K.: Computer Arithmetic – Principles, Architecture, and
Design, John Wiley and Sons, New York, 1979.
Joyner, J. W., Zarkesh-Ha, P. J., and Meindl, J. D.: Global Inter-
connect Design in a Three-Dimensional System-on-chip, IEEE
www.adv-radio-sci.net/4/231/2006/ Adv. Radio Sci., 4, 231–236, 2006236 O. A. Pf¨ ander et al.: Conﬁgurable multiplier modules
Transactions on VLSI systems, 2004.
Pf¨ ander, O. A. and Pﬂeiderer, H.-J.: Dynamische Rekonﬁguration
von arithmetischen Einheiten auf Bitebene, Advances in Radio
Science 2004, Miltenberg, Germany, 319–323, 2004.
Pf¨ ander, O. A., Lachowicz, S. W., and Pﬂeiderer, H.-J.: Flexible
Multiplier Blocks for Accelerated Processing in a 3D-SoftChip
Adaptive Computing System, Proceedings of the IFIP VLSI-SoC
2005, Perth, Western Australia, 485–491, 2005.
Xilinx: Virtex™-IIPlatformFPGAsandProductSpeciﬁcation, Xil-
inx Document DS031 (v3.4), 2005.
Adv. Radio Sci., 4, 231–236, 2006 www.adv-radio-sci.net/4/231/2006/