Ahsiract-This paper considers the implementation ofa signal subspace bused movie user tracking systein that utilizes an efficient Conjugate Gradient (CG) based step-by-step adaptation scheme. First, we estimate the computational complexity of different units of the tracking system. Based on these estimations, we partition the implementation task into two parts: witware and hardware. FinuUiy, for the hardware implementation of the tracking unit a systolic architecture is proposed. With the aid of the proposed systoiic array the time complexity of the tracking unit is reduced to O(M).
I. INTRODUCTION
The channel parameter tracking problem arises in numerous situations. One example is mobile user tracking in which the spatial beamforming procedure must be carried out continuously for cach user. In [I], for the mobile user tracking system, a step-by-step update scheme of the CG method was implemented. It was shown that the proposed CG based tracking system has a better tracking pe~formance in ternis of faster and smoother convergence and sinaller misadjustmeat.
In this paper, we consider the VLSI implementation of the tracking system of [I], and focus on developing efficient systolic architectures that are suitd.de for real-time applications. This paper is organized as follows. Section II briefly presents the structure of the CG based tracking system 111 Section III, the coinputational complexity of the system is evaluated. In Section IV, impternentation issues of the tracking system are discussed. Furthermore. for the tracking unit a novel systolic architecture is designed that reduces the required computational time by an order of magnitude. Finally, concluding remarks are provided in Section V. Figure I illustrates [he overall sysieni rrrodel of our tracking systcm [I] . In this section, we briefly describe (he function of cach unit. 
SYSTEM MODEL

A. Trucking Unit
The signal subspace tracking problem is fomiulated as [lie quadratic cost function for which a step-by-step update scheme has been implemented [I].
For the step-by-step update sciietne the niodikd CG (MCG) algorithm has been utilized [23. In the MCG algorithm of Table I , &Ti) is the step size that miniiiiizes the cost function through the line search procedure along the search direction P,,-~. The residual vector g(n) points to the direction ofthe steepest descent. A,is the forgetting factor and r/ should be (Af -0.5) S 7 5 A, [2] . Factor f l n ) ensures that the R-ortlwgoianlity is preserved between search directions. Set initid conditions: ~( 0 )
forn= I, 2, ...
In our system derivation, for the sake ol simplicity initially unknown antema response vector a( Ok) has been replaced with a weight vector estimate w(n). The weight Y~C I O~ w(n) as provided by the tracking unit convcrges to the desired steering vector which correlates best the desired user signd. In the Direction-of-Arrival (DOA) extraction unit, the new tracking angle estimates &(ti), (k= I ,. . . ,N) are computed through the Least Square (U) fitting criterion that are based on the small deviations in the array manifold [I] .
The LS criterion is based on the linear niodel and can be expressed as:
where q ( n ) consists of m a y samples of O, , and H is the MxI observation vector.
C. Beunrfr,rm'itg ilnir
In this unit, thc following conventional beamforming
Iu. COMPLEXITY OF THE SYSTEM
In this section, the cotnputational complexity of different units are estimated and compared. In the tracking, DOA, and heamforming units, the tnosl cornputationalIy intensive operations aw the calculation of step sizc a, J3j. (I), and Eq-(2), respectively. In Table I1 [or M sources, the order of computational complcxity for different units of the tracking systeni is calculated. 
O ( N M )
M: Nttrrrbcr ofuiikmius N: Nionl>er of S~J t t I T C S
As can be seen from Table 11 , the backing unit has the highest order of complexity. The core of this unit is the sample-by-sample CG algorithm. As coinpared to the conventional CG algorithni also rekrred to as Block Conjugate Gradient (BCG), in the MCG algorithm, the computation of the residual vector g(n) and the factor f i n ) are more complex and require a higher number of vector inner products. Next, we study and compare the computatioiial complexities of the sampie-by-sample CG and the BCG algorithms. The results are shown in Table UI .
It is cleiu that the computational complexity of the BCG depends on the number of iterations I and for a large M, the BCG is I times more coriiplex than the sample-by-sample CG algorithm.
IV. IMPLEMENTATION OFTHE SYSTEM
For real-time applications, in order to meel the demand oT high sampling rates the conventional DSP-based implementation inelhods are not sufticient. Consequently, for the implementation of units with high coniputational coniplexity, application-specific integrated circuits (ASIC) shoufd be utilized.
As can he seen from Table 11, in our system die most computationally intensive block is the tracking unit. In this unit, the order of coinplexity for N sourccs is OCNM'). 111 Figure 2, the hardware (WW)/software (SW) partitioning of our system is illustrated.
In this section, we discuss the implcnientntion of the HW partition that is needed for the MCG algorithm and focus 011 developing an efficient VLSI array processor that is suitable for real time applications. For this purpose, we design a systolic array that targets the most computationally intensive block OC the MCG algorithm.
A. Review of the Irtryli'iiiL'litc~tioI~ Techniques
As can he seen from Table I, in our tracking unit the niosi computationally intensive operations are the niairix compuiations. Furthermore, in order to meel the demand for high sampling rates and to achieve acceptable execution speed the conventional serial implementation methods are not sufficient. Thus, parallel architectures should be utilized.
For the matrix-vector computations needed in the MCG iilgorithnl of Table I 
B. Systolic implerirentution
In this section, we design a systolic architecture that reduces the time coniplexity of the MCG algorithm to O(M).
As discussed in the previous section, due to the serial nature of the algorit.hm, there is B very IOW degree of parallelism in the algorithm. Furthermore, due to the iterative nature of the algorithm and the requirement for different resetting scheines for p [2] , direct mapping of the MCG algorithm to ASIC is not practical. Therefore, our systolic architecture targets the matrix-vector and vector-vector products needed in the calculation of the step size a and the Factor p.
Consider the calculation of the step size a:
For simplicity, we introduce the new variabfe V(II) as L'ollows:
Due to the sample-by-sampIe update sclieine in the MCG algorithm, the correlation matrix R(n) varies in every sample.
However, when calculating the weight vectors for N individual sources, R(rt) remains the same and therefore, for N iterations the same R(n) is used. As a result, for the systolic architecture a 2D array iniplementation is adopted. The elements o l the K(n)= r,,{n) (ij = 1 , . . ., M) are preloaded into this army processor and remain constant for N iterations. Now, consider thc I'ollowing vector-vector multiplicatioiis that are needed in Lhe MCC algorithm. For the realization of Eq. ( 5 ) and Eq. (6). B linear m a y is selected. For synchronization purposes, the lincar array is placed below the 2D array. Figure 3 illustrates the proposed systolic array when M=4. In Figure 3b , thc cell function of each Processor Element (PE) is illush-ated.
Furthermore, lhis architeclure utilizes the availability of the residual vector g(n) and performs the Following vector inner product needed in the calculation ofj3 gg(n -I) = gH(H -l)g(n -1)
As can be seen froni Table I , for the calculation of the residual vector &r), matrix-vector multiplication of (4j. i.e. vector ~( n ) , is required. For utilizing v(n) Lwo methods can be exercised. One method is to keep the elernenrs of v(n) by allocating a local memory to each PE2 and then sequentially transfer them to the host From the last PE2 of the linear array.
The second method is to slightly modify the PE%. This can be achieved by adding an extra output port to the PE2s is it is illustrated in Fig. 4 . The total number of PES required in [his systolic arcliitecture is &+M. For the implementation of the complex multipliers needed in the PES, Strength Reduction (SR) transformation technique has been utilized [7] . By utilizing the SR transforniation the iota1 number of real multiplications needed in a complex multiplier is reduced to only three. Tu clarify this further, consider rhe complex multiplications required in PE1, i.e. pr = (prR + jar,) = pl . r . By utilizing the SR technique we hwe:
As can be seen from Eq. (8) and Figure 5 , by utilizing lhe SR transformation the total nurnbcr of real inultiplications needed in a complex multiplication is reduced to only three. This is at the expense of having three additional adders. However, it is well known that multiplications are more complex than additions and consume much more power as wel!. In fact, for 8 single complex multiplication power reductions of up to 25% can be achieved [7] . Thus, the SR transformation can result in reiiinrkable savinss in consumed power and silicon area.
In order to calculate the throughput of h e systolic may, WE assume that one time step of the glohai clock corresponds to the processing time required for each PE. For [he initialization of PEls, A4 time steps are needed. Thus, the tutal computation time required by the away is 3M steps. Figure 6 illustrates the flow of data in the proposed systolic architecture for different time steps,
V. CONCLUSIONS
In this paper, implementation of a signal subspace based mobile user tracking system was discussed that utilizes a i efficient sample-hy-sample CG algorithm. First, the computational coinplexity of different units of our mobile user lracking system was estimated. Based on these estimations, for a more realistic iinplementation, the tracking system was partitioned into two parts: IW and SW. For the hardware iinplanentation OF the tracking unit, a systolic architecture was proposed. With the aid of this systolic array, the time complexity of the most computationally intensive unit was reduced to O(m. Furtlierniore, for thc implementation of the complex multipliers needed in the PES, Strength Reduction transformation technique was utilized. As B result, reinarkahle savings in consumed power and silicon area were achieved. Future research should hc directed towards mapping the system into a fixed nwnber of processors when the number of antennas M is large.
