Design and Implementation of a Radix-4 Complex Division Unit with Prescaling by Dormiani, Pouya et al.
Design and Implementation of a Radix-4 Complex
Division Unit with Prescaling
Pouya Dormiani, Milos Ercegovac, Jean-Michel Muller
To cite this version:
Pouya Dormiani, Milos Ercegovac, Jean-Michel Muller. Design and Implementation of a Radix-
4 Complex Division Unit with Prescaling. 20th IEEE International Conference on Application-
specific Systems, Architectures and Processors (ASAP’09), Jul 2009, Boston, United States.
IEEE Computer Society, 2009. <ensl-00379147v2>
HAL Id: ensl-00379147
https://hal-ens-lyon.archives-ouvertes.fr/ensl-00379147v2
Submitted on 13 Jan 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
Design and Implementation of a Radix-4 Complex
Division Unit with Prescaling
Pouya Dormiani
Computer Science Department
University of California at Los Angeles
Los Angeles, CA 90024, USA
Email: pouya@cs.ucla.edu
Milosˇ D. Ercegovac
4731H Boelter Hall
Computer Science Department
University of California at Los Angeles
Los Angeles, CA 90024, USA
Email: milos@cs.ucla.edu
Jean-Michel Muller
CNRS-Laboratoire CNRS-ENSL-INRIA-UCBL LIP
Ecole Normale Supe´rieure de Lyon
46 Alle´e d’Italie
69364 Lyon Cedex 07, France
Email: Jean-Michel.Muller@ens-lyon.fr
Abstract—We present a design and implementation of a
radix-4 complex division unit with prescaling of the operands.
Specifically, we extend the treatment of the residual bound and
errors due to the use of truncated redundant representation. The
requirements for prescaling tables are simplified and a detailed
specification of the table design is given. All principal components
used in the design are described and the proposed optimizations
are explained. The target platform for implementation was an
Altera Stratix II FPGA [15] for which we report timing and area
requirements. For a precision of 36 bits, the implementation uses
1185 ALUTs, achieving a latency of 157 ns. The maximum clock
frequency is 173.49 MHz.
I. INTRODUCTION
Complex division is used in applications such as signal
processing (e.g., the complex SVD), multiantenna systems
(MIMO-type) [1], GPS [2], astronomy [3], and non-linear RF
measurement [4]. Unlike for complex multipliers [10], [12],
its implementation has been commonly provided in software.
To improve its performance, a hardware implementation is
considered. With that objective, a hardware-oriented algorithm
and the corresponding theory for general radix-r complex
valued division based on a digit-recurrence algorithm has
been introduced in [6]. A high-level design of a complex
divider is discussed in [7] without implementation details. In
this paper we focus on the design and implementation of a
radix-4 complex-valued division unit with the quotient-digit
set {−3, . . . , 3}. The operands and the result are in fractional
fixed-point form. We also refine some of the derivation results
from [6] to improve the implementation.
Specifically, with the dividend z = zR + izI and divisor
d = dR + idI , i =
√−1, the design discussed computes
q = z/d. A high-level description of the algorithm is
Initialization: j = 0
w[0] = z (1)
Recurrence iterations: j = 1, . . . , n
qj+1 = Sel(4w[j], y) (2)
w[j + 1] = 4w[j]− qj+1y (3)
Result:
q =
z
d
= 0.qR1 q
R
2 q
R
3 . . . q
R
n + i0.q
I
1q
I
2q
I
3 . . . q
I
n (4)
The recurrence for complex division corresponds to the con-
ventional real-valued division discussed in [5] and similar
conditions such as the containment and continuity as well
as bounded residuals apply. The complex residual is w[j] =
wR[j] + iwI [j]. The quotient digits are qj+1 = qRj+1 +
iqIj+1, with the real and imaginary components q
R
j+1 and
qIj+1 ∈ {−3, . . . , 3}. These signed-digits can be converted
during the iterations using on-the-fly conversion [5] to obtain
conventional representation of the result. The complex residual
recurrence decomposes into two separate recurrences for the
real and imaginary part which can be computed in parallel:
wR[j + 1] = 4wR[j]− qRj+1dR + qIj+1dI (5)
wI [j + 1] = 4wI [j]− qRj+1dI − qIj+1dR (6)
where wR[0] = zR and wI [0] = zI . The quotient-digit
selection in the complex domain is a two-dimensional problem
because both qRj+1 and q
I
j+1 must be selected in such a
way that the real and imaginary residuals (wR[j], wI [j])
remain bounded. This is much more difficult than single-digit
selection used in the real case. We solve this problem by
scaling the operands by factor K such that Kz/Kd = x/y
where y = Kd ≈ 1. Consequently, yR ≈ 1 and yI ≈ 0,
and the selection of qRj+1 and q
I
j+1 can be performed on
the real and the imaginary shifted residuals separately in a
manner similar to real-valued division selection. To determine
the prescaling factor K, we assume that
‖Kd− 1‖∞ < s (7)
where ‖α‖∞ = max(|αR|, |αI |).
After prescaling step the recurrences are
wR[j + 1] = 4wR[j]− qRj+1yR + qIj+1yI (8)
wI [j + 1] = 4wI [j]− qRj+1yI − qIj+1yR (9)
where wR[0] = xR and wI [0] = xI . Because the scaling
makes yI ≈ 0 and yR ≈ 1 − s the selection of the real part
of the quotient can be performed by rounding the shifted real
residual and taking its integer part. Similarly for the selection
of the imaginary part of the quotient digit. Moreover, we
can use estimates with σ fractional positions of the shifted
residuals 4wR[j] and 4wI [j] in the selection. Consequently,
the residuals can be computed in redundant form to keep the
cycle time short. The selection functions are
qRj+1 = Sel(est(4w
R[j], σ)) (10)
= sign(4wR[j])× b|est(4wR[j], σ)|+ 1
2
c
qIj+1 = Sel(est(4w
I [j], σ)) (11)
= sign(4wI [j])× b|est(4wI [j], σ)|+ 1
2
c
The selection function Sel satisfies
|Sel(est(x, σ))− x| < 1
2
+ 2−σ (12)
The est(x, σ) is x truncated to σ fractional positions with
an error bound
estERR(x, σ) = |x− est(x, σ)| < 2−σ
If x is in carry-save form x = xC + xS then truncating the
carry and sum vector to σ + 1 fractional bits results in the
same maximum error committed, i.e. , estERR(x, σ) < 2−σ
and estERR(xC , σ + 1) + estERR(xS , σ + 1) < 2−σ .
Using (10), (11) and (12), a bound on the residual is deduced
which ensures that the digit (qRj+1, q
I
j+1) selected by rounding
is in the digit set {−3, . . . , 3}. Namely,
‖w[j] ‖∞ ≤ 14
(
3 +
1
2
+ 2−σ
)
(13)
As shown in [6], assuming that the scaling error is s and
a = 3, the residual is bounded by
‖w[j]‖∞ < 2× 3× s + 12 + 2
−σ (14)
Consequently,
6s +
1
2
+ 2−σ ≤ 1
4
(
3 +
1
2
+ 2−σ
)
(15)
Satisfying this condition guarantees convergence of the digit-
recurrence algorithm and allows the choice of s and σ to
optimize the implementation characteristics.
II. DESIGN
The design of the complex division unit consists of several
components: the prescaling module, the recurrence modules
for the real and imaginary parts, the on-the-fly converters to
obtain conventional representations, and a simple controller.
A high level block diagram of the design is shown in Fig. 1
with the timing shown in Fig. 2. The prescaling module in
Fig. 1 performs a ROM look-up using a short estimate of
the value of the divisor d as an address, in which the ROM
stores K = 1/d. It then computes the complex product Kz,
which is used to initialize w[0] in the recurrence modules. The
prescaling module computes Kd in parallel to the initialization
of the recurrence modules, which is then used to perform the
iterations of the recurrence. The initial delay of the module
to perform prescaling can be amortized by overlapping the
prescaling of the next operation with digit-recurrence iterations
z d
Prescale
Imag. Rec.Real Rec.
OFC OFC
qR q I
qRj q
I
j
Fig. 1. High-level block diagram of the complex division unit.
Prescale
Real Rec.
Imag. Rec.
Lookup prescaling values
Prescale z
Prescale d
wR[0]
wI [0]
t
t
t
wR[1]
wI [1]
wR[n]
wI [n]
...
...
idle
Fig. 2. Timing relationships between modules.
of the current operation–this however has not been performed
in the current implementation. Detailed design of the prescal-
ing is discussed in Section II-A.
The two recurrence modules (one for the real recurrence
and one for the imaginary) perform nearly identical operations
which can be mapped to the same hardware. Detailed design
of the recurrence module is discussed in Section II-C.
A. Prescaling
Prescaling consists of several steps: obtaining the factor K
from a table based on an short-precision estimate of d, and
computing Kz and Kd.
We define a function rnd(a, b) which returns a rounded
value of a to b fractional places, s.t. |a− rnd(a, b)| ≤ 122−b.
The factor K can be determined by using a short estimate of
d to q fractional positions, i.e., rnd(dR, q), rnd(dI , q) as an
address to a ROM which stores the corresponding values of
K with precision of t fractional positions,
KR = rnd(1/rnd(dR, q), t)
KI = rnd(1/rnd(dI , q), t)
Error analysis for the choices of parameters q and t is per-
formed in [13]. These effect s used in (15) to guarantee con-
vergence of the algorithm. Radix 4, with digit set {−3, . . . , 3}
offers the most favorable choice of parameters by minimizing
the number of bits required for the ROM among radices 4, 8,
and 16, except radix 2 which has lowest memory requirements.
Over-redundant digit sets are another design choice but we
decided to restrict our design to maximally redundant digit
set which allows faithful rounding [6].
r a σ q t KBits (approx.)
2 1 4 5 5 7.5
4 2 5 7 7 146
4 3 4 6 6 33
Radix: 8, 16 see [13] ≥ 146
TABLE I
MEMORY (ROM) REQUIREMENTS OF DIFFERENT RADIX (r), DIGIT SET
{−a, . . . , a}, PRECISION OF RESIDUAL ESTIMATE FOR SELECTION (σ),
PRECISION OF d USED TO PERFORM TABLE LOOK-UP (q) AND PRECISION
OF TABLE ENTRIES (t).
The value of the divisor is in the usual range
1
2
≤ ‖d‖∞ < 1 (16)
noting that larger values can be scaled to this range. Its
estimate rnd(d, q) can be represented as 2 two’s complement
numbers for the real and imaginary parts
rnd(d, q) = rnd(dR, q) + i rnd(dI , q) (17)
rnd(dR, q) = κR0 .κ
R
1 κ
R
2 κ
R
3 . . . κ
R
q−1κ
R
q (18)
rnd(dI , q) = κI0.κ
I
1κ
I
2κ
I
3 . . . κ
I
q−1κ
I
q (19)
An additional bit κ−1 is required (to represent +1) as
‖rnd(d, q)‖∞ ≤ 1, which will be handled as a special case.
To reduce the number of address bits, the table can store
corresponding values for |rnd(dR, q)| and |rnd(dI , q)|,
|rnd(dR, q)| = 0.αR1 αR2 αR3 . . . αRq (20)
|rnd(dI , q)| = 0.αI1αI2αI3 . . . αIq (21)
which eliminates the need for bits κR0 and κ
I
0 (the sign) to
be used when forming an address. Likewise, since ‖d‖∞ ≥ 12
we know that either αR1 = 1 or α
I
1 = 1 [6] (or both). Had an
address been formed using
αR1 α
R
2 α
R
3 . . . α
R
q α
I
1α
I
2α
I
3 . . . α
I
q
then the address would require 2q bits. Given that
γ(dR, dI) =
1
dR + idI
=
dR − idI
(dR)2 + (dI)2
(22)
γ(dI , dR) = −γ(dR, dI) (23)
we could check if αR1 = 1, if so then the address is formed
via
αR2 α
R
3 . . . α
R
q α
I
1α
I
2α
I
3 . . . α
I
q
otherwise, it must be true that αI1 = 1 so the address is formed
as
αI2α
I
3 . . . α
I
qα
R
1 α
R
2 α
R
3 . . . α
R
q
and the results obtained from the table look-up are negated
based on (23). This reduces the number of address bits to
2q− 1 (halving the memory required) while introducing little
additional overhead.
Extra care must be taken with the aforementioned ap-
proach; although it is true that dividend is assumed to be
bounded by −1 < ‖d‖∞ < 1, it is certainly not true that
−1 < rnd(dR, q) < 1, in fact −1 ≤ rnd(dR, q) ≤ 1 (same
holds for rnd(dI , q)). The two’s complement representation
of the rounded divisor shown in equations (18) and (19)
has range [−1, 1). Negating -1 in two’s complement with
the given representation is a special case; recalling that +1
is also a special case, the input is divided into two cases:
‖rnd(d, q)‖∞ < 1 and ‖rnd(d, q)‖∞ = ±1.
Another special case occurs when negating the results
obtained from the table look-up due to the swapping discussed
earlier. For positive values of dR and dI the real part of
1/d is positive and the imaginary part negative. The real
part of 1/d is positive for positive values of dR and the
imaginary part of 1/d is negative for positive values of
dI . Since 1/2 ≤ ‖rnd(d, q)‖∞ ≤ 1 and the table only stores
values for positive dR and dI values, then 0 ≤ KR ≤ 2 and
−2 ≤ KI ≤ 0. Therefore the table should only contain the
magnitude of the value, which can be represented in 2 + t
bits–this will present no anomalies if 3 integer bits are used
for the negated values, i.e., the ROM will store 2+ t bits, but
the negated value will be 3 + t bits
Here we describe the operation of the table incorporating
the special cases,
rnd(dR, q) = κR−1κ
R
0 .κ
R
1 κ
R
2 κ
R
3 . . . κ
R
q
rnd(dI , q) = κI−1κ
I
0.κ
I
1κ
I
2κ
I
3 . . . κ
I
q
AR = |rnd(dR, q)| = αR0 .αR1 αR2 αR3 . . . αRq
AI = |rnd(dI , q)| = αI0.αI1αI2dI3 . . . αIq
A =
{
αR2 α
R
3 ...α
R
q α
I
1α
I
2α
I
3...α
I
q if α
R
1 =1,
αI2α
I
3...α
I
qα
R
1 α
R
2 α
R
3 ...α
R
q otherwise
As =
{
αI1α
I
2α
I
3...α
I
q if A
R=±1,
αR1 α
R
2 α
R
3 ...α
R
q otherwise
(UR, U I) =

(1/2,−1/2) if AR=1,AI=1,
(1/2,1/2) if AR=1,AI=−1,
(−1/2,−1/2) if AR=−1,AI=1,
(−1/2,1/2) if AR=−1,AI=−1,
ROMs[As] if AR=±1 and AI 6=±1
or AI=±1 and AR 6=±1
ROM [A] otherwise
negR =
{
1 if real and imaginary swapped,
0 otherwise
negI =
{
1 if real and imaginary not swapped,
0 otherwise
KR = (−1)negRUR
KI = (−1)negIU I
From Table I, we have q = 6 and t = 6 for radix r = 4 and
a = 3.
dR dI
rnd( . ,6) rnd( . ,6)
8 8
ABS ABS
6 6αR1 . . . αR6 αI1 . . . αI6
5 5
αR1 1 0
ROM
1 0
ROMs
11 11 6 6
6
16 16
8 8 8 8
1/
2
-
1/
2
-
1/
2
1/
2
1 0
NEG
1 0
NEG
KR KI
UR U I
negR negI
9 9
9 9 9 9
AR = ±1
κR−1 . . . κ
R
q κ
I
−1 . . . κ
I
q
Fig. 3. Prescaling ROM. The ABS block computes the absolute value of a
two’s complement number. Blocks rnd(., 6) round their argument to the sixth
fractional position. NEG blocks negate their argument, a two’s complement
number.
• ROM : This ROM has 11 address bits and is 16 bits
wide, which can be mapped to 8 Altera Stratix II M4K
RAM blocks, constituting less than one percent of the
total block memory bits in an EP2S60F672C3 device.
• ROMS : This ROM has 6 address bits and is 16 bits
wide, which was mapped to logic and registers.
A schematic corresponding to the described look-up scheme
is shown in Fig. 3.
The other two parts of the prescaling step involve computing
Kz and Kd which will be used to initialize and carry out the
digit recurrence algorithm. Once K is determined x and y can
be computed via,
x = (KR + iKI)(zR + izI)
= (KRzR −KIzI) + i(KIzR +KRzI)
y = (KR + iKI)(dR + idI)
= (KRdR −KIdI) + i(KIdR +KRdI)
Since multipliers are costly in hardware, the complex valued
products will be computed one at a time. Coincidentally,
y = Kd is not required until after the residuals have been
initialized with x = Kz, which can be computed in the
previous cycles. Figure 4 shows the block diagram for the
scaling module. The module uses several signals to control the
data path: eninputs, enpres, ensc, and selmul. Control signals
enx are clock enable signals to registers to control when data
is latched. Clock enables on registers are used to facilitate
multi-cycle paths which are necessary due to the larger delay
of the prescaling logic.
In Fig. 4 eninputs controls when the inputs to the complex
division unit are latched such that the values can be retained
throughout the course of the operation–this is not necessarily
unique and depends on the how the module is interfaced
to other logic. For example, if the external logic feeds the
arguments to the complex division unit in two cycles: sending
(zR, zI) in the first and (dR, dI) in the second, then only 2
register banks are required for the inputs as opposed to 4. The
current design reflects the assumption that the module receives
its arguments in the same cycle, i.e., as (zR, zI , dR, dI).
Signal enpres controls storing of the results of the prescaling
ROM look-up, retained throughout the course of the operation.
Signals selmul and ensc are used to share the multipliers
so that prescaling of the dividend and the divisor occurs
in separate prescaling cycles. Although the prescaled value
x = Kz is also fed through the registers controlled by ensc,
its value is not retained but over-written in the next cycle by
y = Kd. The same enable signal (ensc) is used once more to
assure that the value of y is retained in these registers which
feed the recurrence modules discussed in Section II-C.
B. Bounds of Values
It is important to characterize the bounds of the inputs to
the complex division module in addition to the bounds of the
prescaled values which predetermine the width of inputs to
the recurrence modules.
The input d is in the range 1/2 ≤ ‖d‖∞ < 1,
and through our convergence analysis further constrained
‖Kd− 1‖∞ < s. This implies that the prescaled value y
satisfies
max(|yR − 1|, |yI |) < s
⇒ |yR − 1| < s
|yI | < s
Since |yR| < 1 + s, its representation in two’s complement
would require 2 integer bits and n fractional bits.
KR
KI
AR AI
KI
KR
AR AI
D D
Prescaling
ROM
D D D D
divisordividend
AR AI
zR zI dR dI
selmul
KIKR
ensc
1 0 1 0
enpres
eninputs
D D
QR QI
Fig. 4. Prescaling module. The Prescaling ROM block above is the module
shown in Fig. 3.
Likewise, the constraint (14) determines the maximum value
that the residual could possibly take. For our design point
σ = 4 which means that the residual is bounded by,
‖w[j]‖∞ ≤ 14
(
3 +
1
2
+ 2−4
)
= 57/64
⇒ |wR[0]| = |xR| ≤ 57/64
|wI [0]| = |xI | ≤ 57/64
Therefore, the prescaled value (xR, xI) requires only a single
integer bit, and n fractional bits. We are interested in deter-
mining a bound on z which we can derive from the bound on
w,
‖w[0]‖∞ = ‖Kz‖∞ ≤ 2‖K‖∞‖z‖∞ ≤ 57/64 (24)
since ‖K‖∞ ≤ 2 then
‖z‖∞ ≤ 57/256 (25)
requiring only n − 1 fractional bits, with most significant bit
having weight 2−2.
C. Digit-Recurrence Iterations
The digit-recurrence iterations compute the residuals (5) (6)
and perform quotient-digit selection based on a short non-
redundant estimate of the residuals as shown in Eq. (10) and
(11).
The recurrences in (5) and (6) are structurally the same.
Namely,
w[j + 1] = 4w[j] + σ1yR + σ2yI (26)
The residuals are computed in redundant form in order to
reduce the cycle time by eliminating the need for long carry
Real OFC
a b c d
MG MG
c
0
in
c
1
in
c
2
in
c
3
in
m0
m1
m2
m3
m0
m1
m2
m3
[6:2] Adder
D Dws wc
CPA
Sel
QR
enres
e f
QI
×4 ×4
-qR
qR
NEGATE
(To Imaginary Recurrence)
qR
qI (From Imaginary
    Recurrence)
10 10
QR 0
initresinitres
Fig. 6. Real recurrence module. Blocks ×4 shift their argument right by
2 binary places. Blocks MG compute σ times their argument using the σik
decomposition discussed. The CPA module is a carry propagate adder which
computes a short non-redundant estimate of the residual. The Sel module
takes as argument this estimate and outputs the next quotient digit.
chains. In our implementation we used a carry-save form. The
operation is expressed as
(wC [j + 1], wS [j + 1]) =
ADD[6:2](4wC [j], 4wS [j], σ11y
R, 2σ21y
R, σ12y
I , 2σ22y
I) (27)
where ADD[6:2](a, b, c, d, e, f) is a [6 : 2] carry-save adder
taking 6 inputs and producing a carry vector and sum vector,
shown in Fig. 5. The digits σ1 and σ2 are in the digit set
{−3, . . . , 3} so we implement this digit multiplication by
decomposing σk = 2σ2k + σ
1
k where σ
i
k ∈ {−1, 0, 1}. Multi-
plying by negative one is achieved by inverting the input and
adding a carry-in to the reduction module. A block diagram
of the structure used to compute the real recurrence is shown
in Fig. 6.
Digit selection is performed by taking a short precision
estimate of the residual and rounding it to the nearest in-
teger via a small CPA and table. In the discussion that
follows we generally say residual without referring specif-
ically to the real or imaginary part–the analysis holds for
both residuals wR and wI . In Section II-B we determined
that the residual has a single integer bit and n fractional
bits, i.e., it is of the form w = w0.w1w2 . . . wn with
FA FA
FA
FA
FA FA
FA
FA
FA FA
FA
FA
FA FA
FA
FA
. . .
c
0
in
c
1
in
c
2
in
c
3
in
a0 b0 c0 d0 e0 f0a1 b1 c1 d1 e1 f1am bm cm dm em fm am-1 bm-1 cm-1 dm-1 em-1 fm-1
FA
S0S1Sm-1SmSm+1 C0C1Cm-1CmCm+1Cm+2 Repeat Least Sig.
Most Sig.
Fig. 5. [6 : 2] Adder module. The adder consists of three different slices: the least significant slice, which sums 6 arguments and takes 4 carry-ins, the repeat
slice which sums 6 arguments and takes 4 lateral carries and produces 4 lateral carries to the subsequent slice, and the most significant slice.
value
∑n
i=0 wi2
−i. In redundant form w = wc + ws where
(wc, ws) = (C0.C1C2C3 . . . Cn, S0.S1S2S3 . . . Sn). Recall-
ing that selection is performed via,
qj+1 = Sel(est(4w[j], σ))
where σ = 4 as determined in section II-A, we know that
est(4w[j], 4) = w0w1w2.w3w4w5w6 =
n∑
i=0
wi2−i+2
estERR(4w[j], 4) < 2−4
now since w is in redundant form,
est(4wc[j], 5) = C0C1C2.C3C4C5C6C7
est(4ws[j], 5) = S0S1S2.S3S4S5S6S7
g = est(4wc[j], 5) + est(4ws[j], 5) (28)
estERR(4wc[j], 5) + estERR(4ws[j], 5) < 2−4
which gives us the short precision estimate of the residual g.
It is important to realize that g 6= est(4w[j], 4) in general
but that they commit the same maximum error 2−4 in their
approximation of w[j]. The addition in equation (28) requires
the CPA that we have been referring to during this discussion.
g−2g−1g0.g1 . . . g5 =
CPA(C0C1C2.C3 . . . C7, S0S1S2.S3 . . . S7) (29)
To round g and take the integer part one can use a small
table as in table II by introducing an additional variable gz =
g2 + g3 + g4 + g5 (i.e. the logical or of bits g2 through g5).
This table is a function of 5 bits and produces three bits of
output (for the encoding of qj+1) and will efficiently map to
LUTs.
g−2 g−1 g0 g1 gz qj+1
0 0 0 0 - 0
0 0 0 1 - 1
0 0 1 0 - 1
0 0 1 1 - 2
0 1 0 0 - 2
0 1 0 1 - 3
0 1 1 0 - 3
1 0 0 1 1 -3
1 0 0 1 1 -3
1 0 1 0 - -3
1 0 1 1 0 -3
1 0 1 1 1 -2
1 1 0 0 - -2
1 1 0 1 1 -1
1 1 1 0 - -1
1 1 1 1 1 0
TABLE II
ROUNDING TO INTEGER PART.
D. Optimizing the Recurrence Implementation
A straightforward implementation of the recurrence is
shown in Fig. 7. There are several opportunities for its
optimization:
• Since the residual is in the range (−57/64, 57/64), there
is only one integer bit required to store the value of the
residual. Based on this observation there is no need to
find the sum of bits with weight greater than 20 = 1.
• The recurrence implementation can be optimized in the
most significant bits by using the non-redundant value
computed for selection in the adder instead of the re-
dundant form stored in the registers. Although addition
of a short CPA delay to the most significant bits seems
counter-intuitive to optimization, it turns out that for all
fitting attempts to the Stratix II architecture this path was
not the critical path–paths with routing delays dominated
the critical path (short carry chains don’t exhibit routing
delays as there are dedicated carry paths in Adaptive
Logic Modules [15]). Since using the non-redundant
portion didn’t introduce a new critical path and reduced
the input bits it served as a pragmatic optimization
technique. The non-redundant approximation g computed
for selection can be used in the addition as opposed to
using the [6 : 2] adders. This simplifies the [6 : 2] adder
c
-2 c-1 c0 c1 c2 c3 c4 c5 ...
d
-2 d-1 d0 d1 d2 d3 d4 d5 ...
e
-2 e-1 e0 e1 e2 e3 e4 e5 ...
f
-2 f-1 f0 f1 f2 f3 f4 f5 ...
CPA
g
-2 g-1 g0 g1 g2 g3 g4
s
j+1
0 s
j+1
1 s
j+1
2 s
j+1
3 s
j+1
4 s
j+1
5
...
c
j+1
0 c
j+1
1 c
j+1
2 c
j+1
3 c
j+1
4 c
j+1
5
...
[6:2] Adder
s
j+1
0 s
j+1
1 s
j+1
2 s
j+1
3 s
j+1
4 s
j+1
5
c
j+1
0 c
j+1
1 c
j+1
2 c
j+1
3 c
j+1
4 c
j+1
5
s
j+1
6
c
j+1
6
Se
le
ct
io
n
s
j+1
-2 s
j+1
-1 s
j+1
0 s
j+1
1 s
j+1
2 s
j+1
3 s
j+1
4 s
j+1
5
...
c
j+1
-2 c
j+1
-1 c
j+1
0 c
j+1
1 c
j+1
2 c
j+1
3 c
j+1
4 c
j+1
5
...c
j+1
-3
s
j+1
0 s
j+1
1 s
j+1
2 s
j+1
3 s
j+1
4 s
j+1
5
...
c
j+1
0 c
j+1
1 c
j+1
2 c
j+1
3 c
j+1
4 c
j+1
5
...
Registers
s
j
0 s
j
1 s
j
2 s
j
3 s
j
4 s
j
5
c
j
0 c
j
1 c
j
2 c
j
3 c
j
4 c
j
5Ca
rry
-S
av
e 
Re
sid
ua
l
...
...
s
j+1
7
c
j+1
7
s
j+1
6
c
j+1
6
s
j+1
7
c
j+1
7
g5
Fig. 7. First implementation of recurrence reduction. Each rectangular box
represents some functional block where the bits inside show the inputs to
that block and the bits beneath show the corresponding outputs. There are 5
bits produced by the [6 : 2] adder in this figure which have a shaded square
background to signify that these output bits don’t drive any logic and are left
“open”.
to a [5 : 2] adder requiring 4 lateral carries (as opposed
to a conventional [5 : 2] adder which only requires 3
lateral carries) which we denoted as [5 : 2]4–the lateral
carries come from the previous [6 : 2] adder. The interface
between the [6 : 2] adder, the [5 : 2]4 adder and the XOR
slice is shown in Fig. 9.
• The [5 : 2]4 adder produces both sj+11 and c
j+1
0 , it is
unnecessary to produce sj+10 with the same module since
we will discard cj+1−1 . Bit s
j+1
0 is just the sum modulus 2
of all bits of weight 1 plus the lateral carries, which can
be computed via exclusive-ors (XOR).
Applying all mentioned optimizations we get an improved
design shown in Fig. 8.
III. DESIGN METHODOLOGY AND RESULTS
A. Methodology
The proposed designs were written at the RTL level us-
ing VHDL and simulated for functional correctness with
Modelsim-Altera Edition 8.1. They were mapped to an Altera
Stratix II architecture using Quartus II 8.1 flow tools. The
Quartus Classic Timing Analyzer was used to determine the
timing characteristics of the circuit in addition to placing
constraints on ROM look-up and prescaling registers to inform
the tool of multi-cycle paths.
The multiplies performed in the prescaling module map
to the Altera DSP blocks for precisions up to 36 bits–these
modules support up to 36 × 36 multiplication. It does not
c0 c1 c2 c3 c4 c5 ...
d0 d1 d2 d3 d4 d5 ...
e0 e1 e2 e3 e4 e5 ...
f0 f1 f2 f3 f4 f5 ...
...
...
[6:2] Adder
g4 g5
s
j+1
0 s
j+1
1 s
j+1
2 s
j+1
3 s
j+1
4 s
j+1
5
...
c
j+1
0 c
j+1
1 c
j+1
2 c
j+1
3 c
j+1
4 c
j+1
5
...
g0 g1 g2 g3
[5:2]4 AdderXOR
s
j+1
0 s
j+1
1 s
j+1
2 s
j+1
3 s
j+1
4 s
j+1
5
...
c
j+1
0 c
j+1
1 c
j+1
2 c
j+1
3 c
j+1
4 c
j+1
5
...
Registers
s
j
0 s
j
1 s
j
2 s
j
3 s
j
4 s
j
5
c
j
0 c
j
1 c
j
2 c
j
3 c
j
4 c
j
5Ca
rry
-S
av
e 
Re
sid
ua
l
...
...
c6
d6
e6
f6
s
j+1
8
c
j+1
8
s
j+1
6
c
j+1
6
CPA
g
-2 g-1 g0 g1 g2 g3 g4
Se
le
ct
io
n
g5
s
j
0 s
j
1 s
j
2 s
j
3 s
j
4 s
j
5
c
j
0 c
j
1 c
j
2 c
j
3 c
j
4 c
j
5
s
j
6
c
j
6
s
j
7
c
j
7
Fig. 8. Optimized implementation of recurrence reduction. The effective
reduction is visualized–the shaded circles signify input bits that were removed
as they were deemed unnecessary.
really make sense to go beyond this precision as the current
design choice is targeted for an architecture which supports
fast multipliers. For larger precisions it seems more sensible
to design an efficient custom rectangular multiplier.
B. Implementation Area and Delay Characteristics
The results show the number of ALUTs (Adaptive LUTs
[15]), of which there are two in every ALM (Adaptive Logic
Module) : the basic building blocks for logic in Altera Stratix
II devices. The DSP blocks on Stratix II architectures support
either eight 9×9 multiplies, four 18×18 multiplies or one 36×
36 multiply. The proposed design is limited to the availability
of multiplication units and therefore we have only reported
results for two design points, one utilizing a single DSP block
with four 18 × 18 multipliers and the other using four DSP
blocks each performing 36×36 multiplies. The results include
on-the-fly conversion costs.
The most common scenario we foresee a designer will face
when determining the usefulness of a complex division unit
is when comparing performance to a software based solution.
One such software solution presented in [14] is based on the
following,
a+ jb
c+ jd
=
{
a+b(d/c)
c+d(d/c) + j
b−a(d/c)
c+d(d/c) if |c| ≥ |d|
b+a(c/d)
d+c(c/d) + j
a−b(c/d)
d+c(c/d) if |d| ≥ |c|
(30)
which requires significantly more arithmetic operations, 4
conventional divisions ad 3 multiplications. A complex divider
has been described in [16] implementing Smith’s formula
with a pipelined multiplier, divider, and adder for an 8-bit
precision (+4 guard bits). The scheme uses small number
FA FA
FA
FA
FA HA
FA
FA
s
j+1
7 c
j+1
7 c5 d5 e5 f5g5 c4 d4 e4 f4
S5S4 C5C4
FA HA
FA
FA
g2 c1 d1 e1 f1
S1 C1
. . . 
g2 c1 d1 e1 f1
XOR XOR
XOR
XORXOR
XOR
XOR
C3S0 C0
XOR [5:2]4 [6:2]
. . . 
Fig. 9. Interface of the [6 : 2] adder, [5 : 2]4 adder and the XOR slice.
Precision [bits] 16 36
ALUTs 566 1185
DSP Block (9-bit elements) 8 36
Registers (FFs) 318 598
M4K RAM blocks 8 8
Critical path (ns) 5.685 5.764
Max. frequency (MHz) 175.90 173.49
Prescaling look-up (Cycles) 3 3
Prescaling (Cycles) 2*4 2*4
Total prescaling (Cycles) 11 11
Iterations (cycles) 8 16
Total time (latency) (ns) 108 156
TABLE III
RESULTS FOR PRECISION 16 AND 36 COMPLEX DIVISION UNITS
IMPLEMENTED ON AN ALTERA STRATIX II FPGA.
of Xilinx Virtex-II slices and operates at 100 MHz. Another
design for a complex divider is proposed in [17]. It uses an
algorithm similar to the SRT division. It also has an efficient
implementation and a latency for 15-bit precision of about
600ns, and a throughput of 1.6MHz. These two approaches
are not comparable to our higher-radix approach in terms of
speed. They have an advantage that there is no prescaling
and no tables for prescaling factors. Radix-2 complex online
arithmetic developed in [9] is not directly comparable to our
implementation.
IV. CONCLUSIONS AND FUTURE WORK
We presented the design and implementation of a radix-
4 complex division unit with a single prescaling table. The
implementation on an Altera Stratix II FPGA device requires
1185 ALUTs, with a critical path of 5.764 ns, and a maximum
frequency of 173.49 MHz. The prescaling table requires 2K
words of 16 bits. To our knowledge no comparable imple-
mentation exists at the time and our results initiate a point
of reference for other hardware based designs. In future work
we plan on exploring the use of multipartite tables to reduce
the table requirements in addition to developing specialized
rectangular multipliers to enable higher radix designs.
Acknowledgments. We thank Altera Corporation for provid-
ing the tools and FPGA devices used in this research.
REFERENCES
[1] A. F. Molisch. Wireless Communications. John Wiley and Dons Ltd.,
2005.
[2] J. X., L. Guo, Y. Chen, and J. Zhang. Study of GPS Adaptive Antenna
Technology Based on Complex Number AACA, IEEE International Con-
ference on Wireless Communications, Networking and Mobile Computing,
2008, pp. 1-4.
[3] S.R. Dicker et al. Cbm observations with the Jodrell Bank - iac
interferometer at 33 Ghz. Mon. Not. R. Astron. Soc., 2000, 00:1-12.
[4] G. Vandersteen et al. Comparison of arithmetic functions with respect to
Boolean circuits. In 58th ARFTG Conference Digest RF Measurements
for a Wireless World, 2001, pp. 466-470.
[5] M.D. Ercegovac and T. Lang, Digital Arithmetic, Morgan Kaufmann
Publishers, San Francisco, 2004.
[6] M.D. Ercegovac and J.-M. Muller. Complex Division with Prescaling
of Operands. IEEE International Conference on Application-Specific
Systems, Architectures and Processors, pp. 293-303, 2003.
[7] M.D. Ercegovac and J.-M. Muller, Design of a complex divider. Proc.
SPIE on Advanced Signal Processing Algorithms, Architectures, and
Implementations XII, pp. 51-59, 2004.
[8] M.D. Ercegovac and J.-M. Muller. Complex Square Root with Operand
Prescaling. IEEE International Conference on Application-Specific Sys-
tems, Architectures and Processors, pp. 293-303, 2004.
[9] R.D. McIlhenny, Complex Number On-line Arithmetic for Reconfigurable
Hardware: Algorithms, Implementations, and Applications, Ph.D. Disser-
tation, Computer Science Department, University of California, 2002.
[10] V. Oklobdzija, D. Villeger and T. Soulas, An Integrated Multiplier for
Complex Numbers. J. of VLSI Signal Processing, vol.7, no. 3, pp.213-
222, May 1994.
[11] A.F. Tenca, M.D. Ercegovac. Design of high-radix digit slices for
online computations. In SPIE Conference on High-Speed Computing,
Digital Signal Processing, and Filtering Using Reconfigurable Logic,
Bellingham, 1996.
[12] B.W.Y. Wei, H. Du, and H. Chen, A Complex-Number Multiplier Using
Radix-4 Digits. Proc. 12th IEEE Symposium on Computer Arithmetic,
pp. 84-90, 1995
[13] P. Dormiani, M.D. Ercegovac, and J-M. Muller, On the Design
and Implementation of Complex-valued Division Unit with Operands
Prescaling. Computer Science Department, UCLA, Internal Report 2009.
[14] R.L. Smith. Algorithm 116: Complex division. Communications of the
ACM, 5(8):435, 1962.
[15] http://www.altera.com/
[16] F. Edman and V. Oewall, Fixed-point Implementation of a Robust
Complex Valued Divider Architecture,Proceedings of ECCTD05, Cork,
Ireland, August 2005.
[17] J. Liu, B. Weaver and Y. Zakharov, ”FPGA Implementation of
Multiplication-Free Complex Division”, Electronic Letters, 17th January
2008, Vol. 44, No. 2.
