Abstract
receiver. These algorithms have extremely high complexity and they put increased pressure on the next generation receivers, which have stringent time, power and size constraints.
One of the main computational bottlenecks in the basestation receiver is the estimation and detection of the transmitted bits from the received signal. Multiuser channel estimation and detection is being considered as part of the advanced receiver structures for next generation communication systems. For our evaluation purposes, we choose one of the designed computationally intensive algorithms for W-CDMA multiuser channel estimation and detection [7] . This algorithm is based on a joint multiuser estimation and detection technique which eliminates the need for conventional extraction of parameters in channel estimation and thus gives gains both in performance in terms of Bit Error Rates (BER) as well as computational complexity. This algorithm includes chip level despreading, which is one of the main bottlenecks in the receiver. An implementation of this algorithm on a Texas Instrument's C6x DSP Simulator [9] , chosen as a example of current processor technology, does not meet real time constraints. We develop techniques to accelerate the implementation of this algorithm. A task decomposition of the algorithm is done to explore the data dependencies between the different tasks in the algorithm and take advantage of available pipelining and parallelism.
The analysis also shows how frequently we can update the algorithm based on decision feedback.
The main contributions of this paper are twofold. First, we develop techniques to accelerate the implementation of the wireless communication algorithms on hardware, which is independent of the final hardware mapping of the algorithm. We show this specifically for the joint estimation and detection algorithm. Also, we show that an application specific approach of multiple processing elements is more effective than a single processor system in meeting the realtime constraints in the base-station receiver. 
Joint Estimation and Detection

Channel Model
The channel estimation and detection block in the basestation receiver is shown in Figure 1 . We assume an asynchronous Code Division Multiple Access (CDMA) based system with Binary Phase Shift Keying (BPSK) modulation, where the signal is multiplied with a short repeating spreading code before transmission. As the spread signal is sent through the channel, it experiences undesirable effects such as delays due to multiple paths, interference from other users, fading and noise. The detector needs to acquire synchronization with the input signal in order to correctly detect the incoming bit sequence. Hence, the parameters of the channel need to be estimated for proper detection. Channel estimation involves estimating and tracking the delays of each users' bits and the channel attenuation over the different paths. The channel estimation scheme uses the Maximum Likelihood principle [6] to estimate the channel parameters and directly feeds this information to the detector without actually extracting the channel parameters. The detection scheme is the Differencing Multistage Detection method [ 1 11, based on the principle of Parallel Interference Cancellation (PIC) [lo] . The detector uses the information from the channel estimation block to remove the interference from other users.
The channel information is built on the basis of U priori information obtained by transmission of a pilot signal (b), which is a sequence of bits that is known at the receiver. The pilot signal received at the base-station (pilot), is compared with the known bits to form an estimate of the channel. The decisions from the multiuser detection block (d) are fed back to the channel estimation block along with the received data bits (data), delayed by the time required for detection, for tracking the algorithm when the pilot signal is absent.
Real-time Requirements
Data transmission in the next generation wireless systems [5] is done in frames of 1Oms. The data transmission can be done in variable rates depending on the spreading factors (SF), as shown in Table 1 . The table gives an example of the number of bits in a frame for spreading factors of 4, 32 and 256. We assume BPSK modulation for our algorithm. To support real-time, the number of bits detected per frame should be at the rate of transmission. We choose a short Gold Code sequence of length 3 1 for our spreading (which matches nearest to the proposed spreading factor of 32). This implies that the real-time requirement of our joint estimation and detection scheme is to detect input data bits at a rate of 128 Kbps.
Computations Involved
The derivation of the joint estimation and detection algorithm is detailed in [ 7 ] . We include the algorithm here to explain the computational aspects involved without proof. The model for the channel can be expressed as
where ri E CN are the received bits of all K asynchronous users, spread with a spreading factor N,
are the bits of K users to be detected, Ai E C K x N is the estimate of the channel containing information about the spreading codes, attenuation and delays from the various paths, 71, is the noise, which is assumed to be Gaussian (AGWN) and i is the time index. The computations that occur during the synchronization phase [7, 
where L is the length of the pilot sequence, Rbr E C2K is the cross-correlation matrix between the synchronization bits b, and the received signal ri and Rbb E is the autocorrelation matrix. The channel estimate Ai can be obtained by solving
Dropping the subscript i for convenience, the matrix A, can be rearranged into its odd and even columns Ao, A1 E C K x N which corresponds to the bits bi-1 and bi in the estimate. In vector form,
It has been shown that detecting a block of bits simultaneously (multishot detection) can give performance gains [ 111. Also, multishot detection is near-far resistant as it accounts for the interference from both the overlapping symbols of the interfering users. In order to do multishot detection, the above model should be extended to include multiple bits. Let us consider D bits at a time (i = 1 , 2 , . . . , 0).
So, we form the multishot received vector r of length N D
by concatenating D ri-s (i = 1,2,. . . , D). 
where y(') and d') are the soft and hard decisions after the first stage of the joint detector and S E R K D x K D is the diagonal elements in AHA. These computations are iterated 1 = 1,2, . . . , A4 where M is the maximum number of iterations.
(1 1)
The structure of AHA E CKDxKD is as shown
The hard decisions, d , which are made at the end of the final stage, are fed back to the synchronization block and to the rest of the processing blocks in the receiver.
Task Decomposition and Implementation
The algorithm is implemented on a TI C6x DSP simulator [9] : assuming a TI TMS320C6701 (C67) floating point processor. This processor is taken as an example of the current generation processor technology for our analysis. The C67 is one of the recent DSPs from TI, which has a highperformance VLIW (Very Long Instruction Word) architecture and has been proposed for wireless base-stations. It has a 32-bit architecture with 8 functional units, consisting of 2 multipliers, 4 ALUs and 2 Load/Store Units. It has hardware support for IEEE single and double precision floating point instructions and can produce 2 Multiply and Accumulate's (MAC) per cycle. The algorithm was written in C.
The algorithm was written in a memory-efficient manner so as to avoid transposes and uses inplace computations. The entire code and data fits in the internal memory of the DSP. In this initial implementation, the LU decomposition was used to calculate (4). We use the TI C Compiler ver 3.0 to generate the assembly code for the DSP. The highest possible compiler optimizations recommended by TI [8] were used. The optimizations perform software pipelining, loop unrolling and other program level optimizations to exploit 
Task Decomposition
The sequential implementation of the entire algorithm on the DSP does not meet real-time constraints. In fact, the achieved data rates for just the detection block implementation, assuming a single stage iteration, shows the data requirements falling short by a factor of 6. So, a task decomposition of the algorithm is carried out to find the data dependencies and to identify all available sources of pipelining and parallelism. A coarse grained pipelined-parallel task decomposition of the joint estimation and detection algorithm is as shown in Figure 2 . The input to the channel estimation block to the left is either the known pilot bits (b) and the received pilot bits (pilot) or the detected data bits (d) and the received data bits, delayed by the time required for detection (data'). The dotted blocks (I-IV) represent pipelined operations whereas the blocks inside a dotted block represent operations that can be done in parallel. The input data bits are streaming in continuously in the receiver, which has to ensure that the received data stream is being continuously processed so as to meet the real-time constraints. However, the channel estimation can be updated less frequently so as to meet with the requirements of the detection. (We neglect the effect of channel estimation on bit error rates for this purpose). The parts of multiuser detection which depend on the input data are the calculation of AHr, as in (7), and the multistage detection loop (8-13). An order complexity analysis was also done on the algorithm to find the bottlenecks in each block.
Simulations and Analysis
An in-depth profiling of the various blocks was carried out using the clock function in the C6x DSP. The cycle count for the various blocks is as shown in Table 2 for more fine grain parallelism from the above task partition graph. Table 3 shows the advantages of various levels of parallelism(P1) and pipelining(Pp). Let A refer to the calculation of AHr in block I11 and B to block IV. Let (A + B Sequential) be the present solution obtained. If A and B were pipelined (A B), the required computation becomes the maximum of A and B. Next, A H r can be done for each user in parallel as each row of AH corresponds to a user, reducing the time to 885 cycles. This puts the bottleneck to block B (PI(A) B) . Hence, block B is also unrolled into different stages. The first stage now has the most complexity, it becomes the new bottleneck, needing 3367 cycles ((Pl A) (Pp B)). It has been shown [ I 11 that each successive stages in B requires less computation than the previous stage. Hence, fewer or less powerful processing elements need to be used to these stages. Each stage can also be split into multiple processing elements in a manner similar to A. This reduces the cycles needed to 225, putting the bottleneck back to A (PI(A) PlPp(B)). A and B after this step are shown in Figure 3. 
Meeting Real Time Constraints
The data rates which can be met with different levels of pipelining and parallelism are as shown in Figure 4 . The figure shows the variation in the achieved data rates with the number of users. We assume that the effective number of stages of the multistage detector is 3 (Me = 3). As the level of pipelining and parallelism increases, we observe an increase in the data rates. The data rates from (Parallel A)(Pipe B) satisfies the requirements for lower number of users ( 5 10) as it is limitcd by the complexity of the first stagc which is O ( K 2 ) . By having K processing elements for the first stage, the bottleneck shifts back to A ((Parallel -- Judging from the time requirements for the block I and block 11, we can update block I1 once in 27 updates to block I. The frequency of updates is determined by the amount of error that can be tolerated in the detection. If the updates are not frequent enough to keep up with the fading of the channel, the performance of the system will degrade in terms of the bit error rate. More frequent updates of once in 14 bits can be achieved by again further partitioning the matrix inverse into 2 separate tasks. Here, the key idea is to use the amount of parallelism necessary to satisfy the bit error rate tolerance levels. Alternate methods could also be used for computing the inverse to reduce the complexity and make more updates feasible.
Hardware Mapping
The above analysis with task partitioning is independent of the final hardware mapping of the processing elements.
We assume that the processing elements in the critical part (block IV and AHr) are equivalent to the functional units in a C67 for that particular operation because the C67 is used as the basis for our timing results. The processing elements could be mapped to different architectures such as a single ASIC or multiple processors or a combination of a processor with an ASIC or FPGA. The mapping could also be done such as to have a DSP core with some coprocessor structures for critical parts. Also, if there exists many processing elements in parallel where a single element dominates the computation, such as block 111 where the time taken by A H r takes the same time as the other 3 matrix products taken together (see Table 2 ), all those processing elements could be mapped to a single processor. Thus, the load between elements that have idle times could be dis- tributed to other processing elements.
The other assumptions include ideal communication overhead, no restriction on the number of processing elements available and the feasibility of designing such an application specific system. The number of processing elements are dependent on the number of users (K), which is variable. Allocating elements for the maximum number of users may not lead to optimum utilization. Hence, reconfigurable architectures supporting varying number of users should also be considered.
Future Work
The dynamic range requirements for the joint estimation and detection algorithm are being analysed. In the initial version, the algorithm is implemented in floating point due to the possible loss in precision involved in LU decomposition. From the analysis of the differencing multistage detector [ 111, we expect a precision range of less than 24 bits. A fixed point implementation with a lower precision range could benefit from the VLIW and SIMD type of fine grained parallelism shown in recent DSP and general purpose architectures. Also, matrix oriented architectures [3], such as a vector processor with SIMD, showing 2 levels of parallelism could be beneficial to these applications. Another idea is to explore special systems to take advantage of the complex arithmetic data involved, such as using redundant complex number systems (RCNS) for a ASIC architecture [2] .
We develop acceleration techniques to implement key computationally intensive baseband algorithms in hardware. The joint multiuser channel estimation and detection algorithm is considered for this purpose. A detailed task partition of this algorithm along with its complexity analysis is shown with the help of a C6x DSP simulator. The available parallelism and pipelined tasks in the algorithm are exploited to satisfy the real-time constraints. We discuss mapping issues of the task partitions in hardware. Such an application specific design with multiple processing elements is more effective than a single processor in meeting the real time requirements of next generation communication systems.
