Hardware Efficient Fixed-Point VLSI Architecture for 2D Kurtotic FastICA by Acharyya, Amit et al.
Hardware Efﬁcient Fixed-Point VLSI Architecture
for 2D Kurtotic FastICA
Amit Acharyya, Koushik Maharatna, Jinhong Sun, Bashir M. Al-Hashimi and Steve R. Gunn
Pervasive Systems Centre, School of Electronics and Computer Science,
University of Southampton, Southampton - SO17 1BJ, United Kingdom
Email: faa07r, km3, js3v07, bmah, srgg@ecs.soton.ac.uk
Abstract—Fixed-point VLSI architecture for 2-Dimensional
Kurtotic FastICA with reduced and optimized arithmetic units, is
proposed. This reduction is achieved through the removal of the
dividers for eigenvector computation and replacing the dividers
in the Whitening block of the architecture by multipliers. In
addition, the number of multipliers required in the Whitening
block is further reduced by exploiting datapath symmetry present
in that block. We have addressed also the numerical error issue
associated with the ﬁnite wordlength representation of ﬁxed-point
arithmetic and proposed an efﬁcient approach in dealing with
such error. The proposed architecture occupies 3:55 mm
2 silicon
area and consumes 27:1 W power at 1:2 V @ 1 MHz using
0:13 m standard cell CMOS technology.
I. INTRODUCTION
Independent Component Analysis (ICA) is one of the most
commonly used algorithm in blind source separation [1].
In emerging applications such as Wireless Sensor Network
(WSN) [2], this property of ICA can be utilised [3]. Al-
though different algorithms for ICA have been reported, the
FastICA (FICA) algorithm has been shown to have advantage
in terms of convergence speed [4]. Recently, the ﬁrst VLSI
implementation of FICA based on ﬂoating-point arithmetic
has been reported in [5]. However it involves costly arithmetic
operations including matrix inversion, square root evaluation,
multiplication and division. Such implementation occupies
large silicon area and consumes signiﬁcant power and may
not be suitable for resource-constrained applications like WSN
where the sensors are typically battery powered and are
expected to operate for long time. In addition ﬂoating-point
arithmetic contributes to the silicon area and thus ﬁxed-point
arithmetic may be an optimal choice although a compromise
with accuracy may be necessary.
In this paper we propose a hardware-efﬁcient ﬁxed-point
architecture for 2-D Kurtotic FICA. The efﬁciency is achieved
through (i) removal of division operation for eigenvector com-
putation, (ii) replacement of division operations by multiplica-
tions and (iii) reduction of number of multipliers and adders
for whitening matrix computation through detailed algorithmic
analysis and exploiting the resulting architectural symmetry.
Low numerical error is achieved by (i) introducing suitable
Scaling Factors (SF) and (ii) internal data-bus width vari-
ability wherever necessary. The rest of the paper is organized
as follows: brief introduction to FICA is given in Section II,
and the proposed architecture is described in Section III. The
performance analysis, validation and implementation results
are given in Section IV and the conclusions are drawn in
Section V.
II. THEORETICAL BACKGROUND
In this paper we restrict ourselves to the common case where
the number of independent sources (n) is equal to the number
of sensors. A mixed signal (X) can be deﬁned as [5]:
X = AS (1)
where X = fxig, S = fsig, i 2 (1;n); A is a full-rank nn
mixing matrix; si = fsi;jg, xi = fxi;jg where j 2 (1;m) and
m is equal to the frame-length. To apply FICA ﬁrst X needs
a preprocessing step that converts it to a zero-mean signal
(Centering process - Fig. 1) and then transforms this zero-
mean X to a new vector Z whose components are uncorrelated
with variances equal to unity (Whitening process - Fig. 1) [6].
This process is carried out by Eigen Value Decomposition
(EVD) of the Covarience Matrix (CX) of X. Mathematically,
the vector Z can be described as [5]:
Z = PX = [D 1=2ET]X (2)
where Z = fzig and P is the whitening matrix, D = diag(di)
is a diagonal matrix containing the eigenvalues of CX and
E = feig is an orthonormal matrix of eigenvectors. The next
step is to estimate the output vector (b S) from Z by computing
an unmixing matrix B of dimension nn which can be deﬁned
as [5]:
b S = BTZ (3)
The kth column of B represents the weight vector wk as-
sociated with kth estimated independent component where
k 2 (1;n). FICA computes wk by introducing a contrast
function within the basic iterative equation [6] and checks
for its convergence (see Fig. 1). To prevent convergence of
different wk to the same maxima the generated vectors are
orthonormalized after every iteration [6]. Once all the wks are
derived, b S can be computed using (3).
III. PROPOSED FASTICA ARCHITECTURE
The block diagram of the complete FICA architecture is
shown in Fig. 1. It is clear from Fig. 1 that the direct
implementation of the complete FICA architecture requires
complex arithmetic operations like division, square root and
multiplication which are costly in terms of silicon area and
978-1-4244-3896-9/09/$25.00 ©2009 IEEE 165Fig. 1. Complete FastICA Architecture
power dissipation. To design an area and a power efﬁcient
FICA architecture special attention has to be paid to (i)
optimize different arithmetic units and at the same time (ii)
use the architectural symmetry to reduce number of arithmetic
operations wherever possible.
The proposed architecture is for 2 dimensional ( two source
- two sensors) scenario and for demonstration purpose, we
designed it considering frame-length 512 and wordlength 16-
bit for the incoming data samples. The implementation of the
Centering Unit is done by accumulating the incoming data
samples and then shifting the accumulated sum to the right by
nine bits (division by the frame-length = 512 = 29).
Fig. 2. Proposed divider-less architecture for (a) eigenvalue and (b)
eigenvector computation
A. Eigenvalue and Eigenvector Computation (Divider Re-
moval)
Solving the characteristic polynomial of CX, the corre-
sponding eigenvalues (d1;d2) can be given as:
d1;d2 = ((C00 +C11)
p
(C00   C11)2 + 4C01C10)=2 (4)
where Cij are the elements of CX. In this particular case C01
= C10 [5]. Fig. 2(a) shows the architectural implementation
of this unit along with the variation of wordlengths adopted
to keep the numerical error low. The term ‘4C01C10’ is
implemented using one multiplier and shifting the result by
two bits left. The non-restoring square root algorithm [7],
has been followed here to implement the square root circuit.
Finally d1 and d2 are obtained by shifting the numerator of (4)
right by one bit. Considering the above condition, the eigen
matrix E can be given by:
E =

e11 e21
e12 e22

=

1 1
(d1   C00)=C01 (d2   C00)=C01

(5)
It is evident from (5) that to compute E one needs to compute
only e12 and e22. But it needs two division operations.
To reduce the hardware complexity we apply the following
concept : if an eigenvector ei corresponding to the eigenvalue
di satisﬁes the characteristic equation of CX, then any scalar
multiple of ei will also satisfy the characteristic equation.
From (5) we observe that the denominator C01 of e12 and
e22 remains ﬁxed for a frame and thus can be treated as a
scalar quantity. Therefore, (5) can be modiﬁed as:
E =

e11 e21
e12 e22

=

C01 C01
d1   C00 d2   C00

(6)
Comparing (5) and (6) it can be noted that the divider circuit
can be removed completely from the eigenvector computation.
The resulting simpliﬁed divider-less architecture is shown in
Fig. 2(b).
B. P and Z computation (Replacement of Division by Multi-
plication)
Using the normalized form of (6), the whitening matrix P
can be represented as:
P =

e11=
p
d1(e2
11 + e2
12) e12=
p
d1(e2
11 + e2
12)
e21=
p
d2(e2
21 + e2
22) e22=
p
d2(e2
21 + e2
22)

(7)
Using (7) in (2) the whitened data matrix Z can be deﬁned
as:
Z =

Z1
Z2

=

(e11X1 + e12X2)=
p
d1(e2
11 + e2
12)
(e21X1 + e22X2)=
p
d2(e2
21 + e2
22)

(8)
The block diagram of the architecture for computing (8) is
shown in Fig. 3(a). Since the denominators of Z1 and Z2 are
constants for a frame, its value has been calculated only once
at the beginning of each frame and stored in a memory (in
the form of inverse). To maintain the accuracy of this term
while applying the inversion operation, we represent decimal
1 as 215. This division operation is performed following the
approach presented in [8]. For rest of the data within the frame
this value has been used repeatedly as a multiplication factor
(shown as the link “inverse divisor” in Fig. 3(a)) and thereby
translating divisions into multiplications. The data wordlengths
166adopted at different parts of this circuit are also shown in
Fig. 3(a). Observing the dataﬂow graph from Fig. 3(a) one
can do further optimization for the section shown within the
dashed line segment by exploiting the data-ﬂow symmetry. The
optimized circuit is shown in Fig. 3(b) where three multipliers
have been replaced by three simple multiplexers reducing the
hardware cost further.
Fig. 3. (a) Proposed architecture of the Whitening block with divider to
multiplier translation, (b) optimized architecture of the segment surrounded
by dashed line in (a).
C. Fixed-point wk Computation, Projection and Normaliza-
tion
Fig. 4(a) shows the block diagram of wk computation unit
within FICA Iteration block (see Fig.1). Computation of wk
based on Kurtosis contrast function [5], involves computing
4th power of the whitened data. Since the output wordlength of
the whitening unit is 32 bits, we use 64-bit internal wordlength
for the wk computation unit. The effect of inclusion of the SF
(= 216) before square-rooting (Fig. 3(a) and sub-section III-B),
after each arithmetic operation inside the FICA block is clearly
shown in Fig. 4(a). However from practical wordlength con-
sideration, when the overall data scaling reaches 224 (shown
in the sub-block (a-2) of Fig. 4(a)), we downscaled it by 16-
bit. The block diagrams of the projection and normalization
units are shown in Fig. 4(b) and Fig. 4(c) respectively, where
 B = [ b1; b2], is the previously determined column of B.
The architecture in Fig. 4(b) is derived from the following
projection equation [5], [6]:
wk  

w1;k
w2;k

 
 b1( b1w1;k + b2w2;k)
 b2( b1w1;k + b2w2;k)

(9)
The normalization operation of wk in the normalization unit,
raises the same concern of loosing the information contained
in the fractional part of the normalized data as discussed in
Fig. 4. Fixed-point architecture of (a) wk computation, (b) Projection and
(c) Normalization unit. ‘wi;k P’ and ‘wi;k norm’ represent “projected” and
“normalized” wi;k respectively.
subsection III-B. To overcome this problem, as shown in Fig.
4(c), an SF = 231 is introduced in the design.
IV. RESULTS AND DISCUSSION
A. Functional Validation and Performance Analysis
To do the functional validation of the proposed architecture,
we generated a C-model of the FICA algorithm. The functional
output of this model is compared with the corresponding
Verilog model of the proposed architecture. As test vectors,
we have chosen the data samples as given in [9] equivalent
to one frame-length. Due to page limitation only two sets
of comparison results are shown in Fig. 5. The estimated
outputs from the C model and Verilog model are shown in the
left and right side of the Fig 5 respectively. As can be seen
from Fig. 5, there is close correlation between the waveforms
conﬁrming correct functionality of the proposed ﬁxed-point
FICA architecture.
To examine the overall effect of the numerical error accumu-
lation on the estimated output, we have plotted the probability
of error vs bit position in Fig. 6. It can be seen from Fig.
6, mostly the error occurs 8th bit-position onwards which is
acceptable from practical implementation point of view.
Following the traditional approach of measuring the com-
putational complexity, we have determined total number of
arithmetic operations involved in 2D Kurtotic FICA process
as shown in Table I. It can be observed from Table I that
the proposed architecture needs 15 more additions and 1019
multiplications than the direct-mapping method, but requires
1049 less division operations than the direct-mapping one
considering maximum iteration for convergence M = 5.
However, since the hardware complexity of divider in terms
167Fig. 5. Left side - Estimated waveforms generated from the C model, right
side - estimated waveforms generated from the Verilog model of the proposed
FICA architecture. (a) SET 1 results, (b) SET 2 results.
of gate-count and delay is much higher than that of multiplier
and adder, the effective gain (as shown in Table I) in hardware
complexity of the proposed architecture is expected to be
greater than the direct mapping approach.
Fig. 6. Probability of error vs. bit position in the proposed architecture. (a)
Set-1, estimated source-1, (b) Set-1, estimated source-2, (c) Set-2, estimated
source-1, (d) Set-2, estimated source-2.
B. Implementation Results
To give insight into the area and power cost, the proposed
low complexity architecture is coded using Verilog and syn-
thesized using Synopsys Design Compiler in 0:13m standard
cell CMOS technology. The synthesized area and power con-
sumption are 3:55 mm2 and 27:1 W @ 1 MHz frequency
for VDD = 1:2 V. The power value is obtained by feeding
continuously 16-bit random vectors equal to one frame-length
into the synthesized netlist and applying Synopsys Prime
Time. Since, to the best our knowledge, there is no such
published results available on ﬁxed-point implementation in
terms of area requirement and power consumption, we were
unable to compare these parameters with any other work.
TABLE I
COMPARISON IN TERMS OF NUMBER OF ARITHMETIC OPERATIONS. “M” IS
THE MAXIMUM NUMBER OF ITERATIONS FOR CONVERGENCE OF wk.
POSITIVE GAIN DENOTES THE SAVINGS IN HARDWARE AND NEGATIVE
DEONOTES THE OVERHEAD.
Architec Add Subtract Multiply Divide Sq.Root
ture
Direct [5] 4608 1028 5640 1031 3 + 2M
+3074M +8M +6163M +6M
Proposed 4608 1028 6664 2 + 2M 3 + 2M
+3077M +8M +6160M
Gain  3M 0  (1024 +(1029 0
 3M) +4M)
V. CONCLUSION
We proposed in this paper a hardware-efﬁcient ﬁxed-point
VLSI architecture of the FICA algorithm. The detail archite-
cural description showed that it is possible to reduce the
hardware complexity by optimizing different arithmetic units
instead of direct one-to-one mapping of the algorithm into
architecture. In the proposed architecture, the reduction in
hardware leads to the low power consumption and thereby
making it a good candidate for WSN applications. However,
the impact of different algorithmic parameters e.g. frame-
length, convergence threshold, are under investigation which
may lead to further architectural optimization.
REFERENCES
[1] H. Du, H. Qi and X. Wang, “Comparative Study of VLSI Solutions to
Independent Component Analysis”, IEEE Trans. Industrial Electronics,
vol. 54, no. 1, February, 2007.
[2] D. Estrin, D. Culler, K. Pister and G. Sukhatme.“Connecting the Physical
World with Pervasive Networks”, IEEE Pervasive Computing, vol. 1, no.
1, pp. 59-69, 2002.
[3] B. Lo, F. Deligianni and G. Z. Yang, “Source Recovery for Body Sensor
Network”, IEEE International Workshop on Wearable and Implantable
Body Sensor Networks, April, 2006.
[4] E. Oja and Z. Yuan, “The FastICA Algorithm Revisited: Convergence
Analysis”, IEEE Trans. Neural Networks, vol. 17, no. 6, November,
2006.
[5] K. K. Shyu, M. H. Lee, Y. T. Wu and P. L. Lee, “Implementation of
Pipelined FastICA on FPGA for Real-Time Blind Source Separation”,
IEEE Trans. Neural Networks, vol. 19, no. 6, pp. 958-970, June, 2008.
[6] A. Hyv¨ arinen, “Fast and Robust Fixed-Point Algorithms for Independent
Component Analysis”, IEEE Trans. Neural Networks, vol. 10, no. 3,
May, 1999.
[7] Y. Li and W. Chu, “A New Non-Restoring Square Root Algorithm and
Its VLSI Implementation”, IEEE International Conference on Computer
Design, pp. 538-544, 1996.
[8] J. P. Deschamps, G. J. A. Bioul and G. D. Sutter, “Synthesis of
Arithmetic Circuits: FPGA, ASIC, and Embedded Systems”, John Wiley
and Sons Inc., 2006.
[9] A. Cichocki and R. Unbehauen, “Robust Neural Networks with On-Line
Learning for Blind Identiﬁcation and Blind Separation of Sources”, IEEE
Trans. Circuits and Systems-I: Fundamental Theory and Applications,
Vol. 43, no. 11, pp. 894-906, November, 1996.
168