The multistage detection algorithm has been widely accepted as an effective interference cancellation scheme for next generation Wideband Code Division Multiple Access (W-CDMA) base stations. In this paper, we propose a real-time VLSI implementation of this detection algorithm in the uplink system, where we have achieved both high performance in interference cancellation and computational efficiency. When interference cancellation converges, the difference of the detection vectors between two consecutive stages is mostly zero. We recode the bit estimates, mapping from can be easily implemented in hardware as arithmetic shifts. However, the convergence of the algorithm is dependent on the number of users, the interference and the signal to noise ratio and hence, the detection has a variable execution time. By using just two stages of the differencing detector, we achieve predictable execution time with performance equivalent to at least eight stages of the regular multistage detector. A VLSI implementation of the differencing multistage detector is built to demonstrate the computational savings and the real-time performance potential. The detector, handling up to eight users with 12-bit fixed point precision, was fabricated using a 1.2¨m CMOS technology and can process 190 Kbps/user for 8 users.
I. Introduction
The fast growing cellular telephony industry provides higher and higher capacities for more and more subscribers each year, which in turn requires complex signal processing techniques and sophisticated multiple access methods to meet these demands. Direct Sequence Code Division Multiple Access (DS-CDMA) has been recognized as one of the best multiple access schemes for wireless communication systems [1] . The Wideband CDMA system discussed [2] in this paper is based on the short code DS-CDMA scheme. We place particular emphasis on the uplink (from mobiles to the base station) system, where all the subscribers share the common channel (shown in Figure 1 ). In such an environment, the only way to distinguish these users is to use orthogonal or nearly orthogonal codes (spreading sequences) to modulate the transmitted bits.
Any single desired user in the CDMA uplink system experiences direct interference from the other users in the same cell and neighboring cells. This effect is called Multiple Access Interference (MAI), which is the major limitation in capacity for the current IS-95 CDMA standard. The other related problem is called the near-far problem. When a user is far from the base station, it is likely that his signal would be overshadowed by the users who are near the base station. In the IS-95 standard, perfect power control is utilized, which ensures that the received signal of any user within the cell is equal to any other. This requires a complicated control system on both base stations and mobile phones. Users at the far end of the cell usually consume extremely large amounts of power, which would inevitably shorten the battery life.
The assumption of simply considering all the other users as noise leads to the MAI and near-far problems [3] . One viable scheme is to use the cross-correlation information of all the users to do the linear or non-linear multiuser detection [4] shown in Figure 1 , which requires a short code spreading scheme so that the cross-correlation information is determined. In a short code system, the spreading sequence is repetitive bit after bit (shown in Figure 2 ) with different codes for each user. The channel estimation block in Figure 1 is an essential part in a W-CDMA uplink system to estimate the delay and the amplitude information of each user. There are many advanced algorithms for channel estimation, such as maximum-likelihood estimation and subspace parameter tracking. Most of the proposed communication algorithms in W-CDMA systems consist of various matrix and vector level operations. Advanced computer arithmetic techniques, such as CORDIC, online arithmetic units [5] , fast multiplier structures and so on, are especially valuable to the optimization and implementation of these algorithms. In this paper, we focus on the implementation of the multiuser detector block by using computer arithmetic techniques to reduce the complexity.
One group of multiuser detectors is based upon interference cancellation (IC), especially parallel interference cancellation (PIC). The concept is to cancel the interference generated by all users other than the desired user. Lower computation demand and hardware related structures are the major advantages of this strategy. One of the most effective PICs comes from the iterative multistage method, first proposed by Varanasi and Aazhang [6] . The inputs of one particular stage are the estimated bits of the previous stage. After interference cancellation, the new estimates, which should be closer to the transmitted bits, are fed into the next stage. Later researchers developed this multistage idea and introduced some other types of PICs [7] . However, almost all the existing multistage based algorithms neglect the fact that as the iterations progress, the solution becomes more and more invariant, i.e. more and more elements in the output vector turn out to be the same as the elements in the input vector. Ideally at the last iteration stage, the output and the input should be identical if the algorithm converges. Therefore in the last several stages, the multistage detector generates an output which is almost identical to its input. This is a substantial waste of computation power and increases the system delay.
Lin [8] developed a differential matched filter and presented an FPGA implementation [9] , where they used the differential information in the FIR filter's coefficients to mitigate the complexity. This idea is important to our research on the complexity reduction for the multistage detector.
In this paper, we propose a differencing multistage detection algorithm. Unlike the conventional multistage detector, the number of computations in each stage is not constant, but decreases dramatically stage after stage, which exactly reflects the characteristic of the iterative algorithm.
Therefore the complexity is reduced, while in the meantime, the high performance of the interference cancellation of the multistage detector is preserved.
We have implemented both the conventional and the proposed differencing multistage detector in a single ASIC with a select function. This is because the differencing multistage detector uses the conventional detector as its first stage. Recent researchers also proposed various kinds of CDMA related matched filter, detector and decoder structures [9, 10, 11] . Compared to their approaches, our design focuses on the multiuser detector for the next generation W-CDMA and arithmetic level optimization. Our implementation was fabricated by MOSIS in 1.2¨m CMOS technology.
In the next section, we present the mathematical model of the multiuser communication system and our new differencing multistage detection algorithm. We will also analyze the convergence and fixed-point word length issues. An ASIC hardware implementation of this algorithm for real-time communication systems is shown in section III.
II. Differencing Multistage Detector

A. Multiuser communication model
We assume a multiuser binary phase shift keying (BPSK) modulated DS-CDMA synchronous communications system. We could also extend this model to a general asynchronous system by adding the impact from adjacent bits, where an appropriate channel estimation block is required.
The channel is a single path channel with additive white Gaussian noise (AWGN). . Here because we use 
Equation (2) can also be expressed in a simpler matrix notation on substituting (1),
where vectors 
We can normalize the auto-correlation coefficients in (4) in our multistage detection algorithm because all the estimated bits are +1 or -1 within the multistage detector (we are interested only in the sign of these bits). The amplitude of each user would not affect the final hard decision.
However, if we need to provide soft decision output for a later decoding block, we should also compute the real values of the auto-correlation coefficients.
The cross correlation matrix h can be split into three parts, as in (5): 
Our differencing multistage detector is based on estimating the transmitted bits from (3) using a non-linear method.
C. Derivation of the differencing multistage detector
The multistage detector is an interference cancellation scheme. In each stage of the multistage detector, PIC removes the component of other users (p j in (7)) from the received signal in parallel to obtain a better estimated signal for one particular user. 
We have several observations from the above algorithm. After , as in (7), we calculate the difference of the estimated bits in two consecutive stages, i.e. the input of each stage becomes
, which is called the differencing vector. By subtracting the estimated hard decision vectors of two consecutive stages represented by (7), we have the following equations (here we denote i z 3
Using this differencing algorithm, computations can be saved by computing (8) instead of (7) 
. end
D. Numerical results
The differencing multistage detector is tested by Monte Carlo method with extensive simulations to estimate the convergence rate (shown in Figure 3 ), given that iterations are forced to stop at the eighth stage. We observe that the differencing and conventional multistage detectors have the same convergence pattern and both of them work more effectively when SNR is high. Also, we observe that eight stages are sufficient for most cases, which guides the implementation of this algorithm.
The BER for the differencing multistage detector is exactly the same as the conventional multistage detector through the simulations. This is because we do not change the framework of the iterative method, nor the convergence rate. Equations (7) and (8) are essentially equivalent to each other. The BER plot versus SNR and MAI for a ten-user and twenty-user system is shown in Figure 4 . These figures show that the performance of the matched filter degrades dramatically when MAI increases or the number of users increases, which is the near-far and multiple access interference problem. In contrast, the performance of the differencing multistage detector, for moderate MAI and number of users, approaches the bound of a single user system, which is given
The percentage of zeros, which in turn signifies the reduction in complexity, in the differencing vector is illustrated in Figure 5 (a). In this figure, we see that the percentage of zeros in the differencing vector increases as the iterations progress, which shows that the iterations converge progressively. After the fourth stage, the number of zeros approaches 98% in a 15-user communication system. This result explicitly indicates that if we use the conventional multistage detector, almost 98% of the computation resource is unnecessary in the fourth stage. Thus we can achieve a 6X speedup in an eight stage system according to Figure 5 (b). With more stages in the system to increase the BER, higher speedups are obtained relative to the conventional multistage detector.
III. Real-Time Implementation
The detector can be implemented in real-time by both DSPs and ASICs. Although high performance general purpose DSPs can meet the real-time requirements, they are not as cost-effective.
In commercial communication systems, sophisticated algorithms tend to be implemented by dedicated ASICs. These hardware implementations are potentially cheaper and faster with lower power consumption [12, 13, 14] . In this section, we present a fixed-point implementation analysis and our ASIC implementation of the differencing multistage detector.
A. Fixed-point implementation analysis
Converting an algorithm from floating point to fixed point requires two major procedures. First, we have to estimate the dynamic range of the input data and all the variables used in the algorithm.
Also, we have to find an optimized wordlength to represent numbers and truncate the results. In this section, we present an analysis of the fixed-point implementation of the differencing multistage detector.
Range estimation
The cross-correlation coefficients from the channel estimation block and the matched filter output from integrators are two major operands in the differencing multistage detector. Both are generated by high speed analog to digital (A/D) converters, which sample and digitize the analog input signals at the front end.
From the characteristics of the Gold code, we know that the maximum value of cross-correlation coefficients is the auto correlation of any particular spreading sequence, i.e., range
where the spreading gain is
if we use a Gold code of length 31.
The range of the user's amplitude depends on the dynamic range (or MAI) of the system. The relationship is the following,
The range estimation for the matched filter output is complicated because it is determined by SNR, MAI, and the number of users in the system. Since a matched filter treats all the interfering users as noise, the probability density function (PDF) of the matched filter output follows a Gaussian distribution, as illustrated in Figure 6 . The distribution is also symmetric, based on the assumptions of BPSK modulation, binary distribution of the source bits, and the binary symmetric channel. 
Wordlength analysis
From (9) and (10), we can conclude that the number of bits [14] needed to represent the result of matrix product
Here we assume a binary representation of the integers. If MAI = 10 dB and r = 5 (Gold code of length 31), °U µ x Ä , which indicates that at least eight bits are needed to represent any crosscorrelation coefficient.
For the matched filter output, the number of bits needed is nine in a perfect power control case, and ten in a MAI = 10 dB case for up to 20 users (shown in Figure 7 (a)). In Figure 7 , we can also observe that if the number of users is small, SNR will dominate the variation of the dynamic range.
When more users are active in the system, MAI will determine the number of bits required.
For some applications, the optimized wordlength might not follow the relation in (13), but will usually be smaller than °U µ
. The optimized wordlength is determined by simulation, in which the minimal mean square distortion is set corresponding to a particular performance requirement.
B. Complexity analysis
Further investigations show that the differencing vector p has over 80% zeros after the first iteration in general (shown in Figure 5 ), which can be regarded as a sparse vector. When solving Since we have mitigated all the multiplication operations to simple additions and shifts, dedicated multipliers are not necessary. However, advanced computer arithmetic techniques, such as full carry look-ahead adders, online arithmetic units [5] , etc. are essential to achieve the real-time performance.
C. Prototyping the differencing multistage detection algorithm
The structure of the first three stages of the differencing multistage detector is shown in Figure   8 . In the first stage, the PIC uses the previous estimates (from the matched filter output) to generate a new vector of estimated bits. We need a conventional multistage detector as the first stage, so that two initial vectors are obtained for the differencing method.
After the first stage, the differencing multistage detector starts to use the differencing vector p t AE Å Ç w as the input, which is generated by subtracting the input hard decision from the previous hard decision. In Soft decision inputs and outputs are generated in parallel for each user and all users are detected in a serial manner. The timing of inputs and outputs is controlled by a hand shaking mechanism.
The input numbers are in two's complement format and they are stored in the data register bank.
At the same time, the hard decisions are acquired from the sign bit of the soft decision and the differencing vector is generated by combinational logic. The recoder block (highlighted in Figure   9 ) implements the key features of the differencing multistage detector by selecting all the non-zero elements and tagging their addresses. The timing for the accumulation is scheduled according to the positions of the non-zero elements. If an element is not zero, the recoder will pick out the corresponding cross-correlation data, and update all the soft decisions by subtracting or adding it, according to the sign of the differencing vector's element. Loading, shifting, accumulating and writing back are organized as a simple pipeline machine, managed by a two-phase clock. The pipeline will not stall because no data or control dependencies exist. Finally the soft and hard decisions are generated one by one with certain handshaking protocols to the next stage. Table 1 summarizes our prototype chip specifications. To simplify the hardware design, we have focused on fixed-point implementation of a synchronous system and the design is based on an eight-user Gold code spreading system. However, it can be extended to a random code, asynchronous system with a variable number of users. We choose an eight-user system since all the control logic is primarily binary counters. Therefore, a number of users with a power of 2 would be most efficient. The input data bus is limited by the pin count of our prototype chip. In order to meet the fixed point word length requirement, as determined in the analysis in Section III.A, we choose 10 bits as the input precision. The detector allows us to detect eight users in a MAI = 15 dB and SNR = 6 dB environment. The internal data bus is wider than the input or output bus to ensure that no overflow would occur during intermediate computations.
D. Chip specifications
Figure 10(a) shows the actual chip die photo. The chip has five major blocks: recoder, 12-bit carry look-ahead adder, register banks for cross-correlation coefficients, soft decision registers and the address information of the non-zero elements. Some programmable logic arrays (PLAs) and temporary registers are necessary for control and pipeline management [15] . Figure 10 
E. Two chip multistage detector with predictable delay
Our chip implements a single stage of the conventional/differencing multistage detector. A complete multistage detector is implemented by simply cascading two chips together with a proper feed back path and glue logic. The flow of data between the two chips is controlled by a simple hand shaking mechanism as we know that the next iteration for detection will rake time lesser than or equal to that of the previous stage. The first chip conducts the conventional multistage detection. As a complete matrix-vector operation (7), is performed in the conventional detector, the delay is constant. The second stage is configured as a differencing multistage detector, the output of which is fed back to its own input after the first differencing multistage detection the number of clock cycles required decrease for each iteration, multiple iterations of interference cancellation can be run on the second chip within the processing latency of the first chip. The throughput is determined by the clock rate of both chips and the delay is simply two stages of conventional multistage detector. Figure 11 shows the computational savings obtained by using the differencing technique over the conventional detection scheme. The figure shows the amount of iterations using the differencing method that are possible within a single iteration of the conventional method. For the worst operation case at SNR = 4 dB MAI = 0 dB, the two chip differencing system can execute at least seven iterations in the time taken for two iterations of a conventional detector. Figure 11 also shows when the SNR increases, the computational savings are higher and more iterations of the differencing scheme are possible. This is due to the reduction in noise, resulting in lower BER and faster convergence in the detection process. Also, it can be seen that higher MAI (10 dB) results in faster convergence and hence, more iterations can be performed for higher MAI. This is because MAI = 0 dB implies the equal power case (worst case) for all users. It should be noted that the Figure 11 only conveys the computational savings due to the differencing scheme and 8 iterations are sufficient in most cases for convergence.
A cascade-mode two chip differencing multistage detector is shown in Figure 12 . Two ASICs are cascaded in a chain, driven by the same clock. From our hardware testing(shown in Figure   10 (b)), the two chip system delay with the differencing algorithm is less than 70 cycles. Working at a clock rate of 12.5 MHz, the system delay is about 6¨s, much less than that of the conventional multistage detector, which is around 48¨s for eight stages. Using our design, the system can reach a throughput up to 190 Kbps with proper buffering. This rate meets the 144 Kbps requirement of the W-CDMA communication proposals [2] .
F. Scalable ASIC design
Our hardware implementation shows the real-time performance in the communication system.
We could estimate the size for a commercial base station detector chip in Table 2 . If we design a chip which can handle 30 asynchronous users (upper limit for Gold code of length 31 system), it would require three full carry look-ahead adder as the ALU. The cross-correlation matrix has
elements, each one of which has 8-bit precision (according to Section III.A). We could expand the data bus width to 16 bits in order to accommodate higher MAI. Total number of register cells are
. If a conservative static register cell consists of approximately 10 transistors, the total number of transistors would be around 100K.
IV. Conclusion
In this paper, we have focused on the real-time implementation issues for the multistage de- ' is the final hard decision; "Hin" and "Hout" represent hard decision input and output respectively; "Sin" and "Sout" represent soft decision input and output respectively; "1/2" selects the first stage or later stages; "HS" is the hand-shaking port.
