INTRODUCTION
With non-recurring cost of cell-based ASICs at $4M for a 60 nm prototype, it is no surprise that Field Programmable Gates Arrays (FPGAs) have become the dominating DSP technology [1] [2] [3] [4] [5] [6] for real-time, high speed systems replacing COTS microprocessors and cell-based ASIC [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] designs. COTS microprocessors usually do not have the high MAC count needed in modern DSP systems. FPGAs enjoy many attributes of cell-based ASICs such as security against illegal copies, reduced development time, reduced board and test cost when compared to TTL boards. Cell-based ASICs on the other side are now only used in very high volume or extreme low power applications. A typical task is blind source separation (BSS) where we try to find the independent sources of a mixture, see Figure 1 .
The term "blind" does not mean we could not measure it, it just mean that we do not know exactly the parameters, shapes or period of the source as in ECG signals where the exact period is not known a priori.
The PCA method (MatLab code): Cxx = cov(x', 1); %%%% Calculate the covariance matrix [E, D] = eig(Cxx); %%%% Eigenvector and Eigenvalues V=D^-.5 * E'; %%%% Compute reconstruction matrix z=V*x; %%%% Apply reconstruction is discussed in a companion paper and works fine to reduce the dimension of the input data space. However, if we use the PCA to separate input signals into independent components this will not always work since the PCA captures the component with maximum power in a mean square sense. Sometime this gives the component we expect, but under certain conditions this does not work. Sometimes just a small change in the system already is sufficient for the PCA method not to produce the desired results. This happens for instance when the signal power of the sources are very close, or when the input to the mixing matrix is switched or the amplitude of the digital signal is increased to one.
One pioneering work that offers a solution to this kind of problem was introduced in 1986 by Jeanny Herault and Christian Jutten 17 . These methods are summarized under the name Independent Component Analysis (ICA). They observed that by using higher order statistics the BSS problem can be solved for many tasks that did not work with the PCA method. The higher order statistics are generated via nonlinearities. A summary of typical nonlinearities used in ICA is shown in Figure 2 . While Herault and Jutten showed some success it was soon discovered that their method had some limitations too. First it cannot be generalized to more than two channels and second it is also very gain dependent since the x k are directly combined to form the y k . Several modifications to the ICA had been published in the following years and the results have been summarized in text books 1, 18, 19 , tutorials 20 , and toolboxes. Many ICA methods are closely related. In fact is has been shown 21 that the three theories: minimization of the mutual information, maximization of non-Gaussianity, and maximization of likelihood have the same solutions. Let us briefly review the most popular ICA algorithms in the following.
Some of the ICA algorithms may need as a preprocessing step the whitening of the signals, i.e., a transformation that the resulting cross-correlation matrix C xx only has nonzero diagonal elements. The PCA method can be used to whiten the signals and the correlation matrix C zz of z will have only values zero as non-diagonal elements.
The PCA is just one option for the whitening, but, as we have seen in the companion paper, already requires substantial resources. There are several methods that seemed to be easier to implement than the PCA if just the whitening is needed. This can be done with an online learning 18, 22 that is similar to the power method, i.e., we have the two matrix equation systems
Another attractive method is the classic orthogonalization method from Gram-Schmidt 23 (p. 176) that also produces a white cross-correlation matrix. For the Gram-Schmidt in text books 23 a final normalization is usually carried out; however, if we just need orthogonal and no orthonormal signals this seemed to be unnecessary.
INDEPENDENT COMPONENT ANALYSIS ALGORITHMS
Two classic ICA algorithms are direct extensions of the PCA using non-linearities. This will lead to the nonlinear PCA algorithm that can be implemented in where I=eye(N) is the identity matrix with one in the diagonal and zero otherwise. This NGA seemed to be a reduced complexity when compared to the nonlinear PCA.
The algorithms discussed so far need a whitening operation first before the actual ICA algorithm, which may be costly. It was shown first by Cardoso and Laheld 22 in 1996 that we can combine ICA higher order moments processing via nonlinearities with the whitening. 
Note that we now use x(:,k) and not the whitened z(:,k). As nonlinearity Cardoso and Laheld 22 used the tanh() function, see Fig. 2 , which also worked best in our simulations.
Other algorithms that have been discussed in the literature include the Algebraic ICA (A-ICA) and the fast ICA. These algorithms have been analyzed 25 in term of suitability for a real-time FPGA implementations and the finding can be summarized as follows:
The A-ICA requires only one iteration to converge, and is relatively quick to execute. The A-ICA algorithm does not require pre-whitening, which makes the algorithm easy to implement. However the algorithm poorly separates the signals, and works best for two signals. The algorithm requires a large number of resources for implementation.
The Fast ICA has a relatively fast convergence and effectively separates mixed signals. The Fast ICA separated several source signals. The Fast ICA algorithm requires pre-whitening. The need for pre-whitening means the Fast ICA is a complex algorithm to implement and needs a vast supply of resources for implementation. For a microprocessor or MatLab the FastICA 24 algorithm for instance is very popular since it converges fast and is robust; however, from an implementation standpoint the required symmetric orthogonalization:
would be very expensive to build in hardware 25 .
Ch. /easi3/s1_in
p . /easi3 mu_in -Ou ts:
The EASI algorithm has relatively fast convergence and effectively separates mixed signals for multiple sources. The EASI algorithm does not need pre-whitening, and is reasonable simple to implement in hardware. The only drawback with this algorithm is a large number of resources required for real-time hardware implementation.
The Algebraic-ICA, Fast ICA, and EASI ICA were implemented in software and their performance was analyzed. The performance and the number of resources required for implementation of each algorithm was compared and contrasted. The EASI algorithm was chosen as the algorithm for further analysis and implementation on hardware. The convergence speed, simplicity of the algorithm, and effectiveness made it the best algorithm in the end.
IMPLEMENTATION OF THE EASI ICA ALGORITHM
The EASI algorithms seemed to be the most attractive ICA algorithm to be implemented in hardware 17, 25, 26, 27 and we wish to design a 3x3 EASI system to separate mixtures of a sine, a digital signal and a triangular signal. The original undisturbed input signals s k (t) are shown in Figure 3 . to produce the three observed mixed system inputs x 1 , x 2 and x 3 , i.e.,
These three mixtures x are shown in the upper part of Figure 4 and as can be seen it is very difficult to identify the three original signals from these mixtures. For the EASI algorithm we basically have according to equations (3)- (5) to update all coefficients of matrices B and H. With I 3 = [1 0 0; 0 1 0;0 0 1] we get the following equations for H:
where f k = f(y k ) is the nonlinearity applied to y k . We get the following differential equations ΔB: The nine update equations for the matrix B then become
We now have all the equations together to build the EASI system. H11:=resize(one -y1*y1,s,fixed_wrap,fixed_truncate); H12:=resize(f1*y2-y1*y2-y1*f2,s,fixed_wrap,fixed_truncate); H13:=resize(f1*y3-y1*y3-y1*f3,s,fixed_wrap,fixed_truncate); H21:= resize(f2*y1-y2*y1-y2*f1,s,fixed_wrap,fixed_truncate); H23:= resize(f2*y3-y2*y3-y2*f3,s,fixed_wrap,fixed_truncate); H22:= resize(one -y2*y2,s,fixed_wrap,fixed_truncate); H31:= resize(f3*y1-y1*y3-y3*f1,s,fixed_wrap,fixed_truncate); H32:= resize(f3*y2-y2*y3-y3*f2,s,fixed_wrap,fixed_truncate); H33:= resize(one -y3*y3,s,fixed_wrap,fixed_truncate); --update matrix Delta B DB11:=resize(B11*H11+H12*B21+H13*B31,s,fixed_wrap,fixed_truncate); DB12:=resize(B12*H11+H12*B22+H13*B32,s,fixed_wrap,fixed_truncate); DB13:=resize(B13*H11+H12*B23+H13*B33,s,fixed_wrap,fixed_truncate); x1_out <= to_slv(x1); --Redefine bits as 32 bit SLV x2_out <= to_slv(x2); x3_out <= to_slv(x3); END fpga;
After specifying the I/O ports the mixing matrix a k,i is defined as constant values. In the architecture first the input signals are redefined in SFIXED format. These reinterpretations of the bits should not need any hardware resources. The EASI algorithm is implemented with a single PROCESS statement. First registered x k are set to zero and the weight registers B = I are initialized with the unity matrix. After the rising_edge statement the mixing is applied, followed by the y k component including the tanh approximation through f(y)=y for all |y|≤ 1 type. Remember the sequential evaluation within the PRCOCESS such that registers are only inferred for the SIGNALs, but not for the VARIABLEs. Then the computation of H, ΔB, and B follows. All arithmetic is done with the fixed_wrap and fixed_truncate attributes to avoid the extra hardware cost of the default threshold operation. Without the arithmetic flag about twice as many LEs would be required. The PROCESS also includes the output assignments to y k since we like to store these in registers to measure the registered performance of the design. Finally some more internal data are assigned to output pins for monitoring. As expected, this ICA system is much more robust than the PCA discussed in the companion paper or Herault and Jutten ICA systems. We can switch the inputs, change the scale of the amplitudes, and still get good convergence. We have substitute the tanh() nonlinearity by a simpler one such as the f(y)=y for |y|≤1 type. At a cost of a little more residual noise the signals are still separated as has been shown in Figure 4 .
If we compare the EASI algorithm with the GHA PCA design we see that we need about the same hardware resources and get the same registered performance, but the EASI algorithm is much more robust and can separate many more signals than the PCA algorithm. 
ALTERNATIVE BSS ALGORITHMS
The EASI was generalized by Karhunen 28 et al. using two nonlinearities to introduce even more higher order statistic. These will lead to the generalized EASI algorithm that can be implemented as follows in where the nonlinearities such as those shown in Figure 2 can be selected via the parameters n1 and n2.
The "trick" used in the EASI algorithm to combine learning and whitening to have a robust algorithm can also be used with other ICA algorithms. For instance the nonlinear subspace learning algorithm (see literature 18 , p. 262) can be modified in such a way and we would get which also does not need the inputs to be white.
There are many more ICA algorithms available and, given the fact that each can be implemented with a different nonlinearity, we have many choices. There are also a couple of non-ICA algorithms that have been successful used in the BSS problem such as SOBI or AMUSE to name just two. These algorithms only use second order statistics, no nonlinearities, but require the computation of time delayed (cross) correlation information. One algorithm that is particular easy to implement and gave good results in our simulations is the Algorithm for Multiple Unknown Signals Extraction (AMUSE) and it works as follows: 
CHALLENGES IN ICA ALGORITHMS
We have seen that ICA methods offer a wider choice in solving tasks such as the BSS problem. However, there are still many challenges with the use of ICA algorithms. These start with the selection of the algorithm, its performance, and implementation cost. This also depends on the nonlinearity chosen that is closely related to the kurtosis of the signals; remember that the algorithms use the Hebbian learning method
and the sign σ = ±1 depends on whether the signals are sub-or supergaussian 29 . Remember from the kurtosis discussion that technical signals are most often subgaussian while natural signals such as speech signals are supergaussian; see literature 1 p. 539.
The next problem comes from the fact that ICA other than PCA do not show the PC in a special order; they may appear in any order at the output and unlike the PCA the number of sources and output signals must be the same.
Comparing different algorithm is also challenging, when considering the number of learning steps, computation effort, and residual errors. Many methods are in use here ranging from visual inspection to measuring the product of mixing matrix times reconstruction matrix AB→1. This product does not give a unit matrix since the components are in random order at the output. However, the product should only show a single nonzero value in each row and column since then a de-correlation has taken place.
EASI ARITHMETIC COUNTS AND HDL SYNTHESIS RESULTS
The EASI algorithm, as most ICA algorithms, is resource intensive. Already for a 4x4 system we ran out of embedded multipliers for 32 bit arithmetic implementations. Remember a 32x32 bit multiplier need four 18x18 bit embedded multipliers, or eight 9-bit multiplier blocks for the Altera devices. Let us have a closer look at the arithmetic count for the whole ICA system including the input mixing matrix A. The EASI system equations consist of the following matrix/vector operations: Multiplication of a scalar (SP) with a matrix requires N 2 multiplications and no additions
This gives to the following arithmetic count for systems of size 2 to 4 and in the general case as shown in Table 1 . 
-3N
Note that in the computation of H the diagonal component from the outer product cancel each other out such that 2N less adder and multipliers are needed than the matrix operations indicate.
When implementing with FPGA software the following optimization done by the Altera tools have to be considered A Simple multiplier operations such as 1.5s or 0.75s are replaced by add operations such as 1.5s=s+s>>1 or 0.75s=s>>1+s>>2, where >> is a binary shift that does not required resource since it is hardwired. Only constant coefficient that require substantial number of adder such as a coefficient 1/3=0.3333 is implemented using embedded multipliers.
Resource sharing is applied whenever possible. E.g., if one outer product such as H12 generates f1*y2 and H21 requires y2*f1 then this product is only computes once and the multiplier is shared between this two equations. Table 2 shows synthesis results for the N=2, 3 and 4 systems for Altera Quartus II 13.0. The first result row lpm_mul shows the true number of multipliers that are implemented. The constant multipliers for A are implemented most often via CSD code, and for larger size the resource sharing in particular for computation of H becomes more substantial when compared to the Total Effort row from Table 1 . The true multiplier number is 20% to 24% lower than original estimated from the arithmetic count. The second row shows the true number of 9-bit multiplier blocks used for synthesis. In worst case the 9-bit row would be 8 times that of the lpm_mul. However, we observe this only for N=2. For N=3 drops to an average of 5.4, while for N=4 the device does not have enough 9 bit blocks such many multipliers are implemented with LEs. The number of logic elements (LEs) is shown in row 3. Note that due to the fact that the resize operation is used after the products are added and the product has 64 bit, such that LE even for N=2 is already large.
The maximum frequency is shown in row 4 and was measured with the TimeQuest tools using slow 85C model. The design Fmax drops down from 20 to 13 MHz since we do not pipeline the matrix operations.
Compile time, shown in row 5, also becomes essential for N=4 due to the fact that substantial number of multipliers are design via LEs. The last row 6 shows the Altera device used. These are devices that can be found on the popular DE2 boards and can be compiled with the web Quartus II version that is free of charge.
CONCLUSION AND FUTURE STUDIES
The Algebraic ICA, Fast ICA, and EASI algorithms were examined and compared. The best algorithm required the least complexity and fewest resources while effectively separating mixed sources. The best algorithm chosen was the EASI algorithm. The EASI algorithm was further analyzed by looking and stabling the best fit mixing matrix, nonlinearity, number of resources, and a variation of the algorithm. The algorithm and parameters were set from these results to create the best set up for its implementation on hardware. The EASI ICA was implemented with a Altera Cyclone FPGA in an Altera Quartus II VHDL compact coding environment using new VHDL 2008 data types SFIXED. A source and mixed signal test bench and a nonlinearity implementation test bench were developed to ensure the EASI algorithm was correctly implemented in hardware. Simulations were run to ensure the hardware implementation matched the MatLab/Simulink simulations. The values were similar to the values generated from the MatLab simulation. There were truncation errors present and slight discrepancy of the simulation output values because the hardware implementation utilized fixed-point integers, while MatLab values are floating-point numbers.
Future goals include the design of a vector/matrix machine that allows the design of larger EASI systems than N=4 and a MatLab/Simulink based HDL design using the advanced block sets that allow an easy resource sharing when using floating-point implementation.
