The main focus of the paper is to bring out the differences in performance related issues of Fast-ICA algorithm associated with floating point and fixed point digital signal processing ( 
INTRODUCTION
LIND SOURCE separation (BSS) involves separating a number of unknown sources from a set of observed mixture of sources. The problem of BSS arises in diverse fields like image processing, biomedical signal processing, speech processing [1] [2] [3] [4] [18], etc., where independent component analysis (ICA) [5] methods have been successfully applied. Some applications involving speech, acoustic noise, biomedical signals, etc., require real time processing. Fast-ICA, a commonly used ICA algorithm based on fixed point iteration is suitable for real time operation because of its faster convergence speed [6] .
The present work investigates the feasibility of integrating Fast-ICA algorithm into high end consumer devices. For DSP chips embedded in handheld devices like mobile phones and pagers, fixed point implementation is preferred due to its lower cost and lower power consumption ability as compared to the floating point case. Few papers report issues related to real time implementation of ICA algorithms. Charonsek and Sattar [7] propose a design method for an ICA based BSS algorithm on FPGA platform to suit the real time environment. Shyu and Li [9] have demonstrated the implementation of Fast-ICA algorithm on FPGA using floating point arithmetic. The main focus of this paper is on the reduced execution time and high accuracy achievable through the use of hierarchical design procedure and 32 bit floating point format. The hardware realization of a source separation algorithm known as DUET (Degenerate Unmixing estimation Technique) has been proposed in [8] . DUET employs time-frequency masking methods to separate an arbitrary number of sources from just two mixtures and is reported to be computationally inexpensive. The paper highlights the performance related issues associated with implementation of DUET on floating point and fixed point DSP processors.
In this paper the ICA algorithm has been tested on a 32-bit fixed point as well as on a 32-bit floating point DSP processor. The major manufacturers in the DSP industry are Texas Instruments (TI), Analog Devices and Motorola. TI and Analog Devices offer both fixed point and floating point DSP families while Motorola offers fixed point DSP families. We selected TI for our experimentation as its products are most widely used. The performance of the algorithm on both cases has been evaluated for its accuracy of separation and execution time. While accuracy of separation is judged by the coefficient of correlation between original source and separated source, execution time is the time taken by the processor to execute the algorithm.
The organization of the paper is as follows: Section II brings out the problem under consideration. Section III gives an overview of the Fast-ICA algorithm. Section IV outlines approaches and considerations for implementation of Fast-ICA on a DSP processor. The issues involved while migrating from the floating point to fixed point platform have also been discussed in this section. Section V reports performance of the Fast-ICA algorithm on both floating point and fixed point processors. Section VI is the conclusion of the work.
PROBLEM FORMULATION
The basic BSS model employing linear instantaneous mixing and possessing equal number of sources as that of sensors (n = m) may be expressed as: Equation (2) may be expanded as:
G is known as a global system matrix having dimension ) ( n n × . In the ideal sense k G should be a generalized permutation matrix, each row and each column of which contain only one non-zero element.
FAST-ICA ALGORITHM
Fast-ICA algorithm is one of the most popular methods used to solve problems in BSS. It is easy to use and more reliable as it does not depend upon any user defined parameters. Experimental results establish the fact that it outperforms most of the ICA algorithms in convergence speed. The algorithm uses a fixed point iteration scheme to find the local maxima of the cost function:
where G is a non-linear function and E stands for expectation.
The cost function to be maximized can be based on mutual information, likelihood, and some approximation of nonGaussianity or some variations of these properties. A widely used contrast function is based on kurtosis and may be expressed as:
where:
under the constraint ||w|| =1 and F is a penalty factor due to the constraint. The learning rule has the form: As the third term in (5) can be written in the form (scalar × w), the final form of (5) is
The scalar term in the above equation is insignificant and the effect can be eliminated with normalization.
In the basic Fast-ICA method, the observed mixture signals are preprocessed and then whitened before being subjected to the separation algorithm. The mathematical relationships can be described as:
where, x is the observed signal, s is the source signal, V is the whitening matrix, A is the mixing matrix and v is the whitened signal. Here, = v VAs, where = B VA is an orthogonal matrix,
i.e.,
The objective is to determine B responsible for separating the independent signals.
Fixed Point Algorithm for ICA:
The steps of Fast-ICA for separating one independent component are shown.
1. Pre-whiten the observed data x to obtain v. 
Take a random initial vector ( )
. The expectation operator can be estimated using a large number of samples.
.e., normalize w(j).
If
( ) ( )
is not close to 1, then set 1 j j = + and repeat step 3. Otherwise output vector w(j). 6. Using w(j), one of the separated signals is given by
To estimate n independent components, the above algorithm is run for n times.
To ensure that different independent components are estimated each time, an orthogonalizing projection is added inside the recursive loop given above. Each ( 1,2, ) i i = w K vector found in the above process is a column vector of the orthogonal matrix B. Thus independent components are estimated one by one by projecting the current solution w(j) on the space orthogonal to the columns of matrix B previously found. Define B % as a matrix whose columns are the columns of matrix B. The projection operation is added to the beginning of step 4 above, which now becomes: 
TESTING
The Fast ICA algorithm has been tested for both synthetic signals and audio signals on two platforms:
• TI 6713 -32 bit floating point platform • TI 6416 -32 bit fixed point platform. The mixing matrices are generated randomly. The performance of the algorithm is compared for floating point and fixed point implementations in terms of execution speed and accuracy of separation.
Floating point implementation
The TMS320C6713 was selected as the target processor [10] , [17] for developing the algorithm. The device is a 32 bit processor based on the high performance very long instruction word (VLIW) architecture. Operating at 225 MHz, the C6713 delivers up to 1350 million floating point operations per second (MFLOPS), 1800 million instructions per second (MIPS) and with dual fixed/floating point multipliers up to 450 multiply-accumulate operations per second (MMACS). It has a 264 KB system on chip memory consisting of 4 KB Level 1 program cache, 4 KB Level 1 data cache and 256 KB Level 2 memory cache. Further details on the DSK and chip are available on the TI website [10] .
The signals are downloaded to the real time floating point platform. Before being subjected to the separation algorithm, the signals are mixed using the randomly generated mixing matrix and then whitened on the DSK itself.
Fixed point implementation
The TMS320C6416 was selected as the target processor for the fixed point implementation. The device is a 32 bit processor based on the second generation high performance, very long instruction word (VLIW) architecture. With performance up to 5760 million instructions per second (MIPS) at a clock rate of 720 MHz, the C6416 possesses the operational flexibility of high speed controllers and the numerical capability of array processors. It can produce four 16-bit multiply-accumulates (MACs) per cycle for a total of 2880 multiply-accumulate operations per second (MMACS) or eight 8-bit MACs per cycle for a total of 5760 MMACs. The C6416 also has a 1056 KB system on chip memory consisting of 16 KB L1 program cache, 16 KB L1 data cache, and 1024 KB L2 cache. Full details on the DSK and chip are available on the TI website [11] . A fixed point implementation equivalent to the floating point system was carried out in C using the built in routines of the code composer studio v3.1 IDE [12] in the first case. This is the same as emulating the floating point program on the fixed point processor (C6416). Although the accuracy of separation is high, the execution time is large.
To reduce the execution time alongside maintaining an appreciable level of separation, in the second case, the fixed point code is derived through manual fixed point programming. The manual code optimization involves steps like [13] , [15] :
• Replacement of floating point variables by fixed point ones, encoded as integers.
• Selection of appropriate fixed point format for scaling to avoid overflows and to reduce loss of precision.
• Implementation of arithmetic operations like addition, subtraction, multiplication, division, square root, shifting, truncation and change of exponent, etc., in fixed point arithmetic using a series of preprocessor macros [13] . The following subsections highlight the basic concepts of fixed point arithmetic used in developing the algorithm on the fixed point platform [15] [16] .
Fixed point representation
A fixed point number can be thought of an integer multiplied by a two's power with negative exponent, i.e., Similarly the precision of a fixed point number is the smallest difference between two successive numbers, e.g., 16 16 16 : precision is 1/2 • .
Fixed point arithmetic rules [16]
The rules for doing some basic arithmetic operations on fixed point integers are listed below:
• Conversion from real to fixed point numbers -
Multiply by 2 and divide by as integers b -As the intermediate result is expected to produce overflow, store it in 64 bit integer.
EXPERIMENTAL RESULTS & DISCUSSION
The experimental results have been demonstrated for three different test cases:
• Synthetically generated signals with no noise added during mixing.
• Sound signals generated by musical instruments • Synthetically generated signals with noise added during mixing. The following three implementations of the Fast-ICA algorithm have been tested for the above cases.
• Implementation on floating point processor TI 6713 • Floating point program emulated (migration using inbuilt method) on TI 6416.
• Implementation of fixed point algorithm (code optimization through manual fixed point programming) on TI 6416. Test Case 1 comprises of three signals, i.e., a sine wave having frequency of 800Hz, a square wave having frequency of 700Hz and a saw-tooth wave having frequency of 600Hz generated synthetically using MATLAB code. The signals were sampled at 10 KHz and mixed using the mixing matrix given by: A= [0.0891 0.3906 -0.3408; -0.8909 -0.6509 0.8519; 0.4454 0.6509 -0.3976] Fig.1, Fig.2 and Fig.3 show the source signals, mixed signals and the separated signals as a result of implementation on floating point DSP (C6713), respectively. Fig.4 shows the separated signals as a result of the floating point emulation on fixed point DSP platform (C6416). Fig.5 shows the separation results when Fast-ICA was coded using fixed point arithmetic and implemented on fixed point processor (C6416). Fig.6 presents a comparison of execution time required for number of samples in the form of bar-graph for three different implementations. Table  1 reflects the separation results in terms of correlation coefficients between the original source and the separated source after implementation on the DSP. Test Case 2 comprises of sound signals generated from three musical instruments, i.e., violin, drums and piano recorded at a sampling frequency of 8 KHz to serve as source signals. 19200 numbers of samples have been considered for experimentation. The samples of the signals were mixed in the same way for all implementations using the mixing matrix A as specified for test case 1. Fig.7 , Fig.8 and Fig.9 show the original sound signals, mixed sound signals and the separated sound signals after implementation on the floating point DSP (6713), respectively. Fig.10 shows the separated signals as a result of floating point program emulated (without any optimization) on the fixed point DSP (6416). Fig.11 presents the results of optimized Fast-ICA implemented on a fixed point DSP (6416). Table 2 and Table 3 display the correlation coefficients for the separated sound signals and the execution time required for 19200 numbers of samples for three different implementations, respectively.
Test case 3 comprises of three synthetically generated signals as specified in test case 1, exposed to additive Gaussian noise of 27dB during the mixing stage. The same mixing matrix A was used for the mixing purpose. Three different implementations have been done as was done for the first two test cases. Fixed-point Implementation on 6416 with optimization 1.6297 Fig.12 presents the results of implementation on the floating point DSP (6713). Fig.13 displays the separation results of the optimized fixed point Fast-ICA implemented on a fixed point DSP (6416). Table 4 shows the performance indices [14] obtained as a result of implementation of Fast-ICA on floating point DSP (6713) and implementation of Fast-ICA in optimized fixed point form on fixed point DSP (6416). The execution times for 100 numbers of samples for both implementations are almost the same as those of Table 1.   0  10  20  30  40  50  60  70  80  90  100  -2   0   2   Time sample  A m p litu d e   0  10  20  30  40  50  60  70  80  90 This paper presented a comparative study of a fixed point algorithm implemented on a fixed point platform with respect to another floating point processor. The accuracy and speed were found to be acceptable. In addition, the fixed point processor needs less space and consumes less power. More work needs to be done in this direction to embed these codes in portable consumer devices, without further deterioration of energy efficiency.
