Architecture design for FPGA implementation of Finite Interval CMA by Jan Schier & Phillip Regalia
ARCHITECTURE DESIGN FOR FPGA IMPLEMENTATION OF FINITE
INTERVAL CMA
Anton´ ın Heˇ rm´ anek1,2, Jan Schier2, Phillip Regalia1
1Institute National des T´ el´ ecommunications/GET
Dept. Communication, Images and Information Processing, Evry, France
2Institute of Information Theory and Automation
Academy of Sciences of the Czech Republic, Prague, Czech Republic
ABSTRACT
In the paper, we present the architecture design of the
Finite Interval Constant Modulus Algorithm (FI-CMA)
for FPGA implementation. For ﬂoating point calcula-
tions required in the algorithm we use the library based
on the Logarithmic Number System (LNS). In the de-
sign, the resource reuse and minimization of the total
latency is emphasized.
1. INTRODUCTION
In modern digital communication systems an estima-
tor of transmited symbols represents one of the critical
parts of the receiver. The estimator consists typically
of an equalizer and of a decision device. Recent systems
(such as GSM system) use well known methods based
on training sequences, where a part of signal is known
and repeated. The equalizer is based on matching its
output to the reference signal, by adapting its param-
eters to minimize some criterion (typically MSE). Un-
fortunately the training sequence consumes a consider-
able part of the overall message (approx. 25% in GSM).
For this reason, much research eﬀort has been devoted
to blind deconvolution algorithms, i.e., algorithms with
no training sequence. Perhaps the most popular blind
algorithm is the Constant Modulus Algorithm (CMA),
originally proposed by Godard [2].
The FPGA represents the technology with massive
ﬁne tuned parallelism and high data throughput. As
target device for the algorithm implementation we have
chosen the XCV2000E device. This device oﬀers both
enough on-chip dual-port block RAMs, which are nec-
essary to store the intermediate results, and suﬃcient
amount of logic.
The paper is organised as follows: in the next sec-
tion the system model and the optimisation criterion
are discussed. In Section 3, the Finite Interval CMA
algorithm is brieﬂy reviewed. In Section 4, numerical
issues are considered. Section 5 presents the proposed
design architecture and ﬁnally in the Section 6, the work
is concluded.
2. CONSTANT MODULUS ALGORITHM
In this section the system model and CM criterion are
brieﬂy presented. In the following text all vectors are
assumed to be in column orientation.
Let {sn} be the symbol sequence to be transmited.
It is assumed to be independent and identically dis-
tributed, and adhering to a constant modulus (CM)
constellation, such as PSK. The data symbols are trans-
mitted over a Single-Input Multiple-Output (SIMO) dis-
crete channel with impulse response matrix H, assumed
to have ﬁnite length. The received signal has the form:
un = Hsn + bn (1)
where sn = [sn sn−1 ... sn−M]
T collects the M most
recent input symbols, bn is a background noise vector,
H is the P × Mc channel impulse response matrix, and
P is the number of antennas or the oversampling factor.
An equalizer is viewed as a linear combiner of order M
and its output can be written in the form:
yn =
M X
k=0
gT
k un−k = gTUn (2)
where n is the discrete baud rate time and g and U are
deﬁned as:
g =




g0
g1
. . .
gM



, Un =




un
un−1
. . .
un−M




The CMA algorithm tries to minimize a cost function
deﬁned by the constant modulus (CM) criterion which
penalizes deviation in the magnitude of the equalizer
output from a ﬁxed value. This criterion has the form:
JCMA(g) =
1
4
E
h
(|yn|
2 − γ)2
i
(3)
where E [·] is expectation operator and γ is a constant
chosen as a function of the source alphabet.
The main advantage of this criterion is that the re-
sulting gradient descent algorithm is very similar to the
well known LMS algorithm. The relation between the
LMS and gradient CMA algorithm have been discussed
in many papers. It was shown that the CMA’s cost sur-
face can be directly related to LMS and that the LMS
convergence rate expressions can be used to provide the
limits of CMA tracking capabilities. In addition, the
relatively slow convergence of CMA (≈ 104 iterations),
as well as its dependence on the initialization and on the
step-size parameter, are recognized drawbacks.
20393. FINITE INTERVAL CMA
The FI-CMA is a windowed version of (3) where a time-
window operator is applied to the received data (i.e.,
the expectation operator is replaced by summation over
ﬁnite data interval) and its cost function has the form:
J(g) =
N X
n=1

|yn|
2 − 1
2
=
N X
n=1
 gTUn
 2
− 1
2
(4)
where the constant γ is replaced by 1 without loss of
generality because its value does not change the position
of the local extrema points. In [5] it was shown that
local extrema of (4) coincide with the local extrema of
the function:
F(g) =
PN
n=1 y4
n
(
PN
n=1 y2
n)2 (5)
From the equation (2), N successive equalizer out-
puts can be directly rewritten in matrix form as:
y =




y1
y2
. . .
yN



 =


 

UT
1
UT
2
. . .
UT
N


 

| {z }
U
g = QRg = Qw (6)
where the QR-decomposition of matrix U is used to ob-
tain an orthonormal matrix Q.
The optimal equalizer coeﬃcients are reached by the
following iterative procedure:
vi+1 = wi − µQTy3/Fi
wi+1 = vi+1/kvi+1k (7)
where i is the iteration counter. In [6], the optimal step-
size has been derived of the form µi = 1
3−Fi. With that
setting of µ and with the oversampling factor 2, the
procedure (7) convergs typically after 10 iterations.
The QR-decompositon is calculated using Givens ro-
tations as follows. Let ui,k be the ith element of vector
Uk. Let Rk be a triangular matrix obtained by trian-
gularization of sub-matrix Uk (ﬁrst k rows of matrix U)
and Qk is its corresponding orthonormal matrix. Then
matrices Qk+1 and Rk+1 can be calculated as follows:

Rk+1
0

= GNGN−1 ...G1

Rk
uk+1

(8)
[ Qk+1 qk ] =

Qk 0
0 1

GT
1 ...GT
N−1GT
N (9)
where the matrix Gi eliminates ui,k and is deﬁned as
follows:
Gi =



ci si
I
−si ci
I


, I =


1
...
1


The sine and cosine parameters are computed using
the following formulae:
ci =
ri,i q
r2
i,i + u2
i,k
si =
ui,k q
r2
i,i + u2
i,k
(10)
Note: The calculation of R and Q for k = 1...N is
provided in the similar way (see [3] for more details).
4. NUMERICAL REQUIREMENTS
To summarize, the FI-CMA algorithms consist of two
successive parts: the QR-decomposition of input data
matrix and the iterative procedure (7). The dynamic
range of data and intermediate results in both these
parts requires that ﬂoating point arithmetic is used for
arithmetic operations. That is, however, rather costly
for an FPGA implementation. In our work, we use the
Logarithmic Number System (LNS) arithmetic.
4.1 Logarithmic arithmetic
This arithmetic is based on logarithmic numeric rep-
resentation of ﬂoating-point numbers. The logarithmic
equivalent of a number – a two’s complement ﬁxed-point
value equal to log2 |x| – is mapped to an integer. Current
versions of the library oﬀer 19- and 32-bit wordlengths.
The number consists of an integer part, which is always
8 bits long, and of a fractional part, the size of which
depends on the selected data precision.
Multiplication and division then transform to a
ﬁxed-point addition and subtraction, and the square-
root operation becomes a simple bit-shift.
To implement addition and subtraction, the loga-
rithmic function log2(1 ± 2b−a) has to evaluated. This
function is evaluated using a ﬁrst-order Taylor-series ap-
proximation with look-up tables. While the size of these
tables often represents substantial problem, in our solu-
tion, they are kept small by using an error correction
mechanism and a range-shift algorithm [1].
The logarithmic operations were implemented in
the High-Speed Logarithmic Arithmetic (HSLA) library.
They are fully pipelined, addition and subtraction have
each 8 clock cycles latency, other operations have 1 clock
cycle latency. In order to utilize the look-up tables,
which are implemented in dual-port Block RAMs, ef-
ﬁciently, the ADD/SUB unit has been implemented as
a twin unit. For more details see [1, 4].
The resource utilization ﬁgures of the HSLA opera-
tions are summarized in Table 1.
Parameters (19/32-bit, XCV2000E-6)
Op. Lat. SLICEs BRAMs
ADD/SUB 8 8/13% 3/70%
MUL 1 1% 0%
DIV 1 1% 0%
SQRT 1 1% 0%
Table 1: Resource utilization of the HSLA cores
4.2 Performance comparison
In this section we compare the performance of our 32-
bit and 19-bit LNS implementations with conventional
204032-bit ﬂoating point. The comparison is based on the
computer simulations using double- and single precision
ﬂoating point arithmetic and 32- and 19-bit bit-exact
LNS functions.
As a reference model, we have chosen the implemen-
tation using the IEEE double precision arithmetic. As
a performance comparison measure we have used the
variance of the error signal e = (yi − ˆ yi).
We deﬁne the signal-to-noise ratio caused by the
round-oﬀ error as:
SNR = −10log10
σ2
σ2
e
(11)
where σ2 is the variance of the output y, and σ2
e is the
variance of the corresponding error signal.
The values were measured for N = 500,M = 4,P =
2, for the following conﬁgurations (note that SNR in-
creases with N due to accumulation of errors in the QR
decomposition):
• Floating point 32 bit and LNS 32 bit: SNR=−90 dB;
• QR 64 bit, iterative equalizer update 19 bit – to test
performance degradation of the update itself: SNR
= −56 dB;
• All computations in 19 bit LNS, to test the total
performance degradation: SNR = −20 dB.
Although the SNR drops in the last case, the ISI eye
remains open. Following the results of experiments not
included in the paper, the 19-bit implementation should
be suﬃcient.
5. DESIGN ARCHITECTURE
To implement the algorithm, the following computa-
tional units are needed:
1. The QR-processor which consist of:
• the diagonal processor of the matrix R, comput-
ing the rotation sine and cosine values;
• the oﬀ-diagonal processor, which rotates rows of
the matrix R;
• the column processor, which rotates the columns
of the matrix Q.
Note that in contrast to the usual design of the QR-
update, we compute explicitly the ﬁrst N columns
of the Q matrix.
2. The Equalizer update processor, which consists of the
following parts:
• the matrix-vector multiplier for y = Qw;
• the parallel processor for calculation of
ˆ w = QTy3 and Fi =
P
y4;
• the parameter update processor for v = w −
µˆ w/Fi and w = v/kvk.
For the overall implementation architecture design we
have to keep in mind the following factors: matrix di-
mensions, restrictions on parallel access to the device
memories (dual ported RAMs) and limited number of
the LNS adders that can be ﬁt into a single device.
For the matrix dimensions, let N and M be the
equalizer order and the data block size, respectively.
Then, N  M. Typical values are N = 10...20 and
M = 250...1000. It follows that while R is a triangular
(N × N) matrix, Q is an (N × M) matrix, i.e., it has
many more rows than columns.
Because of the complexity of the design (the number
of “processors” to be implemented), we have decided to
use the 19-bit precision, to be able to use more adders
while achieving acceptable precision. We assume that
the data block is stored in some external FIFO memory.
During each update step, only the data vector un is read
which means that only P 19-bit values are transfered (P
is an oversampling factor). The input data vector Un is
composed/stored in an internal circular buﬀer which is
used to prepare the input for the QR processor.
5.1 QR-processor architecture
The R matrix update procedure is implemented using
one diagonal and one oﬀ-diagonal processor. While the
oﬀ-diagonal cell depends on the output of the diagonal
cell for the same input data, the rotation of i-th row of
the matrix R can be fully pipelined and can be com-
puted in parallel with the update of Q. Because of this
data dependency, the diagonal and oﬀ-diagonal proces-
sor can share a single twin-adder. Data ﬂows in both
cells are shown in Figure 1 and 2.
r
2
i,i
ui,i ×1 /1
√
2
√
1 /2
ci
si
+ r
02
i,i
Figure 1: Data ﬂow of diagonal cell
×1
×2
×3
×4
uij
rij
uij
rij
ci
ci
si
si
−
+
Twin
u(i+1)j
r
0
ij
Figure 2: Data ﬂow of R oﬀ-diagonal and Q column cell
Since the number of rows in Q increases with every
new data vector in a single block, the computational
complexity of this matrix grows linearly from the be-
gining to the end of the block. (Note the complexity of
update of matrix R remains constant.) The intermedi-
ate results are stored in two dual-port RAMs: one for
the elements of the Q matrix, the other for the right-
most column of the composed matrix in equation (9).
The data ﬂow in the column processor is the same as in
the oﬀ-diagonal processor of R (Figure 2).
The implementation probably cannot be extended to
employ more processors, because of the above mentioned
2041i-col j-col
BRAM
Qe
yi(k + 1) yi(k)
y
BRAM
+
Qd
BRAM
k-col l-col
wl
×
+
wk
×
wj
×
+
wi
×
+
Figure 3: Matrix-vector multiplier
limitations (only two ports in each RAM and a growing
number of rows).
5.2 Equalizer update procedure
The equalizer output (6) is computed using the vector-
matrix multiplier. When the multiplication is evaluated
in the usual way, i.e., row-wise with respect to matrix
Q, the eﬃciency of the resource utilization will be rather
small. This is due to the relatively high latency of the
LNS adder (and, consequently, the latency of the LNS
Multiply-and-Accumulate unit) with respect to the row
length (the number of elements in the MAC operation).
To improve the eﬃciency, we propose to compute the-
multiplication in a column-wise form: all elements of ith
column will be multiplied by the i-th element of the co-
eﬃcient vector w. The result will be stored in temporal
memory. Using the dual-port Block RAMs, we are able
to address two columns of Q at the same time. Since we
do not accumulate (just multiply and add), there will be
no dummy cycles. When partitioning the Q matrix to
two Block RAMs, for example to Qe and Qd with even
and odd columns of Q, we may employ two multiply-
add units in parallel. As a result, all adders that were
used for the QR-decomposition will be utilized. The
structure of the unit is shown in Figure 3
The parallel processor computes the value of the cri-
terion function which is simply
P
y4
i and the matrix-
vector multiplication QTy3. Since M is much higher
than the latency of the MAC unit, it can be used ef-
ﬁciently here. With the Q matrix partitioned into two
separate block RAMs, we may use up to four MAC units
and reuse all the available hardware. The structure of
the processor is shown in Figure 4. It has to be noted
that with the LNS arithmetic, computation of the pow-
ers y3 and y4 is cheap on the amount of logic and can
be calculated in a single cycle.
The structure of the parameter update processor is
simple: it performs mainly computation of the step size
µi = 1
3−Fi, division and multiplication of a vector by a
constant and a second norm calculation. Again, thanks
to the properties of the LNS arithmetic, the computa-
tion of multiplication/division as well as the square- and
fourth root are inexpensive. The architecture of the unit
is straigforward.
MAC
BRAM
y
ˆ wi
z−1
+
×
(·)2
(·)4
4×
P
y
4
BRAM
Qe
Qd
Figure 4: Parallel processor architecture
6. CONCLUSIONS
An architecture for an FPGA implementation of the
FI-CMA algorithm has been proposed in this paper.
This algorithm represents complex and computation-
ally intensive advanced DSP algorithm, which requires
ﬂoating-point arithmetic operations including higher or-
der power and root operations. The properties of the
Logarithmic Numbering System have been used with ad-
vantage: multiplication and power/root operations with
very low latency and low resource utilisation. The aim
of the design was that the addition core, which con-
sumes most resources of all operations (for the neces-
sary look-up tables) is reused as much as possible. Also
the memory partitioning was considered so as to allow
higher parallelization of computations and to keep the
total latency as small as possible, to achieve maximum
performance. The resulting design will ﬁt in a single
XCV2000E device.
REFERENCES
[1] J. N. Coleman, E. I. Chester, C. I. Softley, and
J. Kadlec. Arithmetic on the european logarithmic
microprocessor. IEEE Transactions on Computers,
49(7):702–715, 2000.
[2] D. N. Godard. Self-recovering equalization and
carrier tracking in two-dimensional data commu-
nication systems. IEEE Trans. Communications,
28:1867–1875, November 1980.
[3] G. H. Golub and C. F. van Loan. Matrix Computa-
tions. Johns Hopkins University Press, 1996.
[4] R. Matouˇ sek, M. Tich´ y, Z. Pohl, J. Kadlec, and
C. Softley. Logarithmic number system and ﬂoating-
point arithmetics on FPGA. In M. Glesner, P. Zipf,
and M. Renovell, editors, Field-Programmable Logic
and Applications: Reconﬁgurable Computing Is Go-
ing Mainstream, volume 2438 of Lecture Notes in
Computer Science, pages 627–636, Berlin, 2002.
Springer.
[5] P. A. Regalia. A ﬁnite interval constant modulus
algorithm. In Proc. International Conference on
Acoustics, Speech, and Signal Processing(ICASSP-
2002), volume III, pages 2285–2288, Orlando, FL,
May 13-17 2002.
[6] P. A. Regalia and E. Koﬁdis. Monotonic convergence
of ﬁxed-point algorithms for ICA. IEEE Trans. Neu-
ral Networks, 14(4):943–949, July 2003.
2042