Abstract
Introduction
Next generation communication systems based on W-CDMA (Wideband Code Division Multiple Access) are being designed to increase data rates by orders-of-magnitude and enhance performance significantly in order to support real-time multimedia services [ 11. However, the increased complexity required by the algorithms to support multimedia capabilitics has not been supported by hardware to achieve real time implementations. Detcction is one of the core baseband processing operations in a digital communication receiver and is used to find the transmitted bits from the source after it ha5 been corrupted by noise and other interferencc prcsent in the channel. The data rates that can be achieved in a communication system are dependent on the speed of detection; therefore, it is critical to accelerate detection to meet real-time performance requirements.
Several advanced schemes have been proposed for detection [12, 161 for future communication receivers that provide better performance in terms of lower bit error rates. Despite their sub-optimality, these algorithms have not been implemented in practical systems duc to their high implementation complexity C141. In this paper, we demonstrate the use of non-conventional redundant number representations using on-line arithmetic to accelerate the implementation of traditional as well a 5 advanced detection algorithms to help meet real-time requirements for next gcncration communication systems.
The physical baseband layer in a digital communication receiver involves operations to detect and decode the transmitted information bits. Sophisticated algorithms for channel estimation, detection and decoding are applied on the receiver to determine the transmitted bits. Based on these high-precision operations, a hard decision (a signbased test) is made on the transmitted bits for detection, line arithmetic have been to eliminate carry-propagate addition, reduce interconnection bandwidth between modules and allow parallelism between several operations. With a serial data flow, on-line arithmetic can be pipelined to implement sophisticated algorithms. As carry-propagation is eliminated, on-line operations can be overlapped. On-line arithmetic ha$ been shown to provide a speed-up of 2-16X [6] for conventional numerical operations. The implementation tradeoffs related to the applicability of on-line arithmetic are its need for a non-conventional number system, conversions to-and-from a conventional system [lo] , and the inherently serial nature of the operations.
In this paper, we show that on-line arithmetic also has immense potential for use in detection in communication systems. We present on-line implementations for both traditional and advanced detection algorithms. Because communication systems are special-purpose applications, there is no need to maintain a conventional number representation. On-line arithmetic can be used in a communication receiver with almost the same numerical accuracy and significant performance acceleration in terms of speed and latency reduction. Also, a MSDF representation allows us to stop computations as soon as the first non-zero MSD (sign) has been calculated. This not only avoids the computations of the successive digits, but also eliminates the need for backconversion to a conventional number system.
Background on on-line arithmetic
On-line arithmetic algorithms [5, 7] work in a digit-serial manner, producing the result in a MSDF faqhion. To generate the first output digit, 6 digits of the input are required. Thereafter, with each digit of the input, a new digit of the result can be obtained. The on-line delay b is typically a small integer, e.g. 1 to 4. Since the outputs are produced serially, the algorithms can be pipelined with a latency of 6 as shown below:
In order to achieve MSDF operations, on-line algorithms need to use a redundant number system [2] for carry-free addition. The on-line representation of a number z is given by 151
where X j represents the value of the addends at step j , T is the radix of the redundant number system and 15 is the on-line delay. The digits zi belong to a redundant digit set, {-p ,....., -l,O,l,....., p } (assumed symmetric) where r / 2 5 p 5 T -1, represents the amount of redundancy in the number system. For our system, we shall a w m e p = T -1 for maximum redundancy as this will allow the inputs to the system to be directly acceptable in redundant form for on-line operations. We choose a radix-4 system to reduce the on-line delay to 6 = 1 for multiplication and addition [51 a y it requires the least number of gates [SI. In our algorithms, we shall assume fixed point inputs with 8-bit precision as it is shown to bc sufficient for most detection implementations [ 141. On-line arithmetic, being word-length independent, will demonstrate better performance benefits with higher precision. The operations that need to be performed for detection arc on-line addition and multiplication, which are presented in the Appendix.
Detection algorithms implemented in a W-CDMA communication receiver
We first show the advantagcs of on-line arithmetic for a traditional and widely implemented matched filter detcc-tor [ 161 (or the single user detector) as an initial example. Then, we present one of the advanced multiuser detection algorithms proposed to achieve better performance in wireless communication receivers.
Synchronous matched filter detector
A synchronous detector [lZ, 161 implies that different users are transmitting their information to the receiver at the same time, i.e. all users are synchronized in time with respect to each other. This makes detection a simpler task as the delays of the different users need not be accounted for. This is a valid assumption for the downlink when a base-station is talking to a mobile receiver. A matched filter detector is also termed as a single user detector as it ignores the effect of interference due to the other users. The detection problem can be viewed as a leaqt squares problem. Let ri E RN be the received signal and A E R N x K be the cross-correlation matrix obtained from channel estimation.
N is the length of the spreading code (also known a$ the spreading gain or the spreading factor) and K is the number of users in the system. Let di E { + 1, -l } K be the bits of the K users to be detected. Then, the system can be formulated as given below:
where ? is the Additive White Gaussian Noise (AWGN) in the system. So, a least-squares solution to the problem can be derived as
Equation 3 represents the decorrelating detector, which is a multiuser detector. A simple approximation to the decorrelating detector can be obtained by assuming that the crosscorrelations are zero (the received signals for each user are independent) and the AHA matrix is identity. This gives rise to the single user detector or the matched filter detector with hard decision outputs:
(4) Figure 2 shows the architecture of a single user matched filter detector. Here, we amme that the delays of the users are coarsesynchronized within one symbol duration. We consider multistage detection 115, IS], based on the principle of Parallel Interference Cancellation (PIC). This scheme cancels the interference from the other users successively in stages and is shown to have computational complexity quadratic with the number of users. The different delays and phase-shifts make the received signal ri and the channel estimates complex numbers. For an asynchronous system with BPSK modulation, the channel estimate can be arranged as Ao, A1 E C N x K which corresponds to partial correlation information for the successive bit vectors di-l,di E {+l, -l } K , which are to be detected. In vector form, the received signal is In the uplink, when multiple mobile users are communicating with the base-station, the desired user's bits receive interference from the past or future overlapping symbols of other users along with their current symbols because they
Asynchronous multiuser detection

Asynchronous matched filter detection
The bits, di, of the K users to be detected lie between the received signal ri and ri-1 boundaries. The matched 
We see that the asynchronous nature of users increases the complexity of matched filter detection. Comparing equations (4) and (6), we can see an increase in complexity for the asynchronous matched filter due to the increase in addition.
Multistage parallel interference cancellation
The multistage detector [13, 15, 181 uses the soft decisions yi of the matched filter to get an initial estimate of the bits and then subtracts the interference from all other users. The multistage detector performs parallel interference cancellation iteratively in stages. (7) di(') = ~i g n ( y i ( ' ) ) .
(8)
Equation (7) may be thought of as subtracting the interference from the past bit of users, who have more delay, and the future bits of the users, who have less delay than the desired user (Refer to Figure 3 and equation (5)). The left matrix L E R K x K , stands for the partial correlation between the paqt bits of the interfering users and the desired user, the right matrix R = LT, stands for the partial correlation between the future bits of the interfering users and The detected bits are then sent to the decodcr for retrieving the transmitted information.
Each stage of the multiuser detector uses only adders because multiplication by single bits can be reduced to addition and subtraction. In order to form the various vectors such as Cda in equation (7), we can use an adder tree as shown in Figure 5 . The multiplication by the bits can be substituted by a 2's complement representation in ca$e the bit is -1 (shown as 2C with a circle in the figure) and left unchanged when the bit is + l .
Implementation of on-line single user and multiuser detectors
This section compares the advantages of implementing detection using on-line arithmetic. The performance benefits for the traditional single user matched filter detector and for a more advanced, multiuser detector are presented in this section.
Synchronous matched filter detector
As seen from Figure 2 , the computation performed in a matched filter is the sign of a matrix-vector multipli- 
Advanced multiuser detection
The timing schedule for multiuser detection is a5 shown in Figure 7 . The figure shows the time steps involved during the computation of the asynchronous matched filter (equation (6)) and three stages (S = 3) of parallel interference cancellation (equations (7)- (8)). The figure shows the pipelining of the stages of the multiuser detector in the conventional arithmetic case and the pipelining of both the PIC stages a$ well as within the stages for the on-line implementation. In order to compute the new di, the first stage of the detector must wait until di, yi and di+l of the matched filter are available, as can be seen from equations (7)- (8) . For the conventional arithmetic implementation, since the hard decision d;+l is available only at the end of the computation, the PIC stage needs to wait for 2 conventional matched filter operations ~C M F before it can start its computation, where t C M F = (log,(N) + 3) * log,(d) * t,,,,.
Similarly, each stage of the PIC takes 2 conventional PIC operations ~C P I C to start the new di, where tcprc = (log,(K) +3)*log2(d)*tcO,, as shown in Figure7. Hence, the overall system latency to generate the result is given by (~* S -~) *~C P I C +~*~C M F , wheresis thenumberofPIC stages. Also, as the stages are pipelined, the throughput of the convenlional detector is ~C M F as typically, the spreading factor is grcatcr than the number of users ( N > K ) .
For the on-line multiuser detector, the first stage of the PIC can begin its computation as soon as the sign di+l is obtained from the matched filter, without waiting for the completionof the digits to yi+l. Thus, the first stage can begin its computation immediately after tstop of the (i + computation. Similarly, each stage of the PIC can begin computing di as soon as the di+l of the previous stage hac, finished. Thus, starting after tstop helps to reduce the overall latency of t M F + m * S * toL + S * t p r c . Again, the throughput for the on-line detector is m * t o L as it was for Bit parallel conventional arithmetic . tMF... . . . . . . . ... ... .. . . . . . . . .. .._. . . ... .. . .. .... . . .. . . . .. . ... .. . .. ... ... ... .. ........ I.. .. . . . . . the synchronous detector (Figure 6) . Assuming a three-stage detector (S = 3), we get a latency of 168 cycles and a throughput of 24 cycles for the conventional implementation. The on-line implementation has a reduced latency of 94 cycles and a throughput of 8 cycles. The speed advantages and comparisons with the synchronous single user detector are shown in Table 1 . We can see that using on-line arithmetic achieves a 1.79X reduction in latency and a 3X speedup over the conventional detector implementation. Hence, the benefits of on-line arithmetic are greater with advanced detection algorithms, providing significant decrease in latency and increase in speed. Savings in area are also possible due to the digit serial nature of on-line computations as we use only a single-digit radix-4 on-line adder and multiplier as opposed to a d-bit fully parallel conventional adder and multiplier for each element in the tree-based computations of the detection algorithms.
Extensions to higher modulation schemes
The algorithms for detection presented in this paper assumed BPSK transmission, which is currently used in IS-95 based CDMA systems. However, on-line arithmetic can also be used for QPSK (Quadrature Phase Shift Keying) proposed for future W-CDMA systems. For QPSK, the sign of the real as well a y the imaginary component of the received signal is used to form the decision region in a manner Detector Latency Single User Throughput MultiLatency user Throughput similar to the BPSK scheme. On-line arithmetic can also be used in M-ary QAM (Quadrature Amplitude Modulation) proposed for wireless LAN (Local Area Network). This is because the decision regions in a square QAM are also squares. The knowledge of the square that the received signal will lie in can be obtained from the first few MSDs and the successive operations can be stopped. The closer the detected signal is to the decision region, the longer it will take to obtain the sign as the MSDs will be zeros. Thus, the throughput of the detector depends on the proximity of the detected signal with the decision region. Going to higher modulation schemes also implies that greater bi t-precision is needed at the receiver. This need further motivates the use of on-line arithmetic for detection as the throughput is now independent of the bit-precision. For higher M-PSK systems, the analysis becomes complicated as the decision becomes angle-baqed. However, it may be possible to work in polar coordinates and use online arithmetic. On-line algorithms for complex number arithmetic [l 11 using redundant complex number systems (RCNS) may also be utilized for these higher modulation schemes. The benefits of on-line arithmetic may reduce if interleaving or block computations requiring large latencies are performed on the received signal. We are investigating the case for higher modulation schemes as future work.
Conventional
On 
Summary and future work
This paper shows the potential benefits of on-line arithmetic for sign-baqed testing operations in a digital communication system. Specifically, we present the advantages of using on-line arithmetic for traditional and advanced detection algorithms for digital communication systems. A comparison of a single digit on-line multiuser detector to an 8-bit precision conventional multiuser detector shows a reduction in latency by 1.79X, a speedup of 3X in the throughput and possible savings in area. The overhead of returning back to a conventional number representation is also not required. However, the throughput of the detector is not constant as it depends on the proximity of the detected signal to the decision region (i.e. depends on the number of zeros in theMSDs).
The blocks preceding detection and decoding in Figure  1 can also be implemented using on-line arithmetic to form an entire communication receiver using on-line arithmetic.
The inputs generated serially from the A/D converter are also inherently MSDF and hence, the on-line algorithms can increaqe the overall speed by overlapping computations with the conversion. Other signal processing applications involving sign-baed computations can also benefit from an on-line arithmetic approach.
We are currently investigating a complete physical Iayer implementation of the communication receiver with estimation, detection and decoding chain of blocks using on-line arithmetic. The amount of precision required for the algorithms is different at intermediate stages of the blocks and is also dependent on the channel conditions and the number of users in the system and hence, may vary with time. An on-line implementation can be extremely beneficial in this scenario due to the digit-serial nature of computation and hence, the precision can be made programmable. Thus, a system with high communication data rates and latency reduction can be designed using on-line arithmetic for digital communication receivers.
Appendix
The appendix is used to describe the fixed point on-line addition and multiplication schemes used for detection.
On-line addition
The steps involved in fixed point on-line addition [4, 5] where the selection function is the same as in on-line addition.
Acknowledgments
