This paper presents algorithms and architecture designs that can meet real-time requirements of multiuser channel estimation and detection in future CDMA-based wireless base-station receivers. Sophisticated algorithms proposed to implement multiuser channel estimation and detection make their real-time implementation difficult on current Digital Signal Processor (DSP)-based receivers. A maximum-likelihood based multiuser channel estimation scheme requiring matrix inversions is redesigned from an implementation perspective for a reduced complexity, iterative scheme with a simple fixed-point VLSI architecture. A reduced-complexity, bit-streaming multiuser detection algorithm that avoids the need for multishot detection is also developed for a simple, pipelined VLSI architecture.
implementations either assume perfect channel estimation or assume single user estimation using slidingcorrelator type structures [8] . The detector implementations also assume that channel estimation is done in real-time and the data rates are considered to be dependent only on the detector. However, many advanced multiuser channel estimation schemes have high computational complexity, even more than that for multiuser detection, due to matrix inversions involved and cannot be performed in real-time. Also, algorithms for estimation and detection are block-computation based due to the need for repeated inversion updates for estimation [13] and multishot detection [5] , [14] , which make their real-time implementation more difficult. Matrix-inversion free schemes such as those based on conjugate gradient descent and recursive least squares (RLS) [15] , [16] , [17] exist in the literature. We have evaluated the applicability of such schemes for multiuser channel estimation and presented one such scheme with low computational complexity and suitable for implementation. Jointly performing multiuser channel estimation and detection is shown to have lower computational complexity and better error rate performance than performing multiuser estimation and detection separately [13] . Hence, we shall consider this joint algorithm for multiuser channel estimation and detection for redesign from a VLSI architecture perspective. Similar work on a joint channel estimation and detection scheme for TDMA systems with a systolic implementation for Kalman filtering is presented in [18] . They have also studied word-length effects and provided comparisons with LMS and RLS schemes.
In this paper, we present efficient algorithms for multiuser channel estimation and detection, designed from an implementation perspective and their mapping to real-time VLSI architectures. We redesign a multiuser channel estimation algorithm [13] , based on the maximum likelihood principle and present an iterative scheme, which is computationally effective, suitable for a fixed point implementation and is equivalent to matrix inversion in terms of error rate performance. A new bit-streaming multiuser detection scheme based on parallel interference cancellation is presented that avoids the need for multishot detection [5] , [14] , [19] for a simple bit-streaming pipelined VLSI architecture. Fixed-point implementations of the redesigned algorithms are presented. First, we determine the maximum data rate achievable with no area constraints. Then, we obtain the data rate achieved by an area-constrained architecture. Finally, we present area-time tradeoffs for real-time VLSI architectures to achieve the targeted data rates with minimum area overhead. Thus, the main contribution of this paper is to show real-time performance for multiuser algorithms by (1) designing the algorithms from a fixed-point architecture perspective, without significant loss in error rate performance, (2) task partitioning and (3) designing bit-streaming fixed point VLSI architectures to exploit available pipelining, parallelism and bit-level computations.
II. MULTIUSER CHANNEL ESTIMATION AND DETECTION

A. Real-time requirements
Data transmission in 3G wireless systems such as 3GPP or UMTS is possible at varying rates such as from 32 Kbps to 2 Mbps depending on the spreading factor (AE ) which varies from 256 (for vehicular traffic) to 4 (for indoor environments) respectively (for example, see [3] ). The standards assume a chip rate of 4.096 Mcps and Quadrature Phase Shift Keying (QPSK) modulation (2 bits/symbol). We have assumed Binary Phase Shift Keying (BPSK) modulation (1 bit/symbol) in our work for simplicity. Hence, we target data rates in the range of 16 Kbps to 1 Mbps. However, our proposed algorithms as well as our work on fixed-point analysis, pipelining and parallelism can be extended to higher modulation schemes as well. We propose different architectures which explore area-time trade-offs in order to achieve these data rates. We seek to design architectures that meet real-time requirements to within an order-of-magnitude.
Specifically, we target architecture designs for different spreading gains (AE Ã ½ ¿¾ ½¾ ¾ ) to achieve data rates of 16 Kbps, 64 Kbps, 128 Kbps, 256 Kbps and 1 Mbps respectively. Note that the reference to 3G systems is solely as an example to illustrate important system features such as the varying data rates which we seek to target and the use of training sequences for channel estimation.
B. Received signal model
We assume BPSK modulation and use direct sequence spread spectrum signaling, where each active mobile unit possesses a unique signature sequence (short repetitive spreading code) to modulate the data bits (¦½). The base-station receives a summation of the signals of all the active users after they travel through different paths in the channel. The multipath is caused due to reflections of the transmitted signal that arrive at the receiver along with the line-of-sight component. These channel paths induce different delays, attenuations and phase-shifts to the signals and the mobility of the users causes fading in the channel. Moreover, the signals from different users interfere with each other in addition to the Additive White Gaussian noise (AWGN) present in the channel. Multiuser channel estimation refers to the joint estimation of these unknown parameters for all users to mitigate these undesirable effects and accurately detect the received bits of different users. Multiuser detection refers to the detection of the received bits for all users jointly by canceling the interference between the different users. The performance of multiuser detection depends greatly on the accuracy of the channel estimates. The model for the received signal at the output of the multipath channel [13] can be expressed as
September 11, 2001 DRAFT After eliminating terms that do not affect the maximization, the log likelihood function becomeś
The estimate , that maximizes the log likelihood, satisfies the following equation:
The matrices Ê and Ê Ö are defined as follows:
Thus, the computations required to obtain the estimate are (i) the computation of the correlation matrices Ê and Ê Ö and (ii) the computation required to solve the linear equation in (3).
D. Multiuser detection
Multiuser detection cancels the interference from other users to improve the error rate performance, compared to the traditional single user detection using only a matched filter [20] . We implement multistage detection [14] , based on the principle of Parallel Interference Cancellation. This scheme cancels the interference from different users, iteratively in stages and is shown to have computational complexity quadratic with the number of users. It is also possible to feed the channel estimate matrix directly into the multistage detector instead of explicitly extracting the parameters. 
D.1 Matched filter detector
The bits, , of the Ã users to be detected lie between the received signal Ö and Ö ½ boundaries. The matched filter detector [5] , [20] does a correlation of the input bits with the received bits. Hence, the matched filter detector can be represented as
The multistage detector uses the matched filter to get an initial estimate of the bits and then iteratively subtracts the interference from all other users.
D.2 Multistage detector
The multistage detector [14] , [23] performs parallel interference cancellation iteratively in stages. The desired user's bits suffers from interference caused by the past or future overlapping symbols of different asynchronous users. Detecting a block of bits simultaneously (multishot detection) can give performance gains [5] . However, in order to do multishot detection, the above model should be extended to include multiple bits. Let us consider bits at a time ( 
where Ý´Ð µ and ´Ðµ are the soft and hard decisions respectively, after each stage of the multistage detector.
These computations are iterated for Ð ½ ¾ ¡ ¡ ¡ Å where Å is the maximum number of iterations chosen for desired performance. The structure of À final stage, are fed back to the estimation block in the decision feedback mode for tracking in the absence of the pilot signal. Detectors using differencing methods have been proposed [23] to take advantage of the convergence behavior of the iterations. If there is no sign change of the detected bit in succeeding stages, the difference is zero and this fact is used to reduce the computations. However, the advantage is useful only in case of sequential execution of the detection loops, as in DSPs. Hence, we do not implement the differencing scheme in our design for a VLSI architecture.
III. REAL-TIME ALGORITHMS FOR MULTIUSER CHANNEL ESTIMATION AND DETECTION
A. Iterative scheme for channel estimation
A direct computation of the maximum likelihood based channel estimate involves the computation of the correlation matrices Ê and Ê Ö , and then the computation of the solution to (3), Ê ½ Ê Ö , at the end of the pilot. A direct inversion at the end of the pilot is computationally expensive and delays the start of detection beyond the pilot. This delay limits the information rate. In our iterative algorithm, we approximate the maximum likelihood solution based on the following ideas:
1. The product Ê ½ Ê Ö can be directly approximated using iterative algorithms such as the gradient descent algorithm [16] . This reduces the computational complexity and is applicable in our case because Ê is positive definite (as long as Ä ¾Ã).
2. The iterative algorithm can be modified to update the estimate as the pilot is being received instead of waiting until the end of the pilot. Therefore, the computation per bit is reduced by spreading the computation over the entire training duration. During the Ø bit duration, the channel estimate, , is updated iteratively in order to get closer to the maximum likelihood estimate for training length of .
Therefore, the channel estimate is available for use in the detector immediately after the end of the pilot sequence.
The computations in the iterative scheme during the Ø bit duration are given below:
The term´Ê´ is known exactly, the iterative channel estimate can be made arbitrarily close to the maximum likelihood estimate by repeating step 3 and using a value that is lesser than the reciprocal of the largest eigenvalue of Ê´ µ . In our simulations, we observe that a single iteration during each bit duration is sufficient in order to reach very close to the true maximum likelihood estimate by the end of the training sequence. The solution converges monotonically to the true estimate with each iteration and the final error is negligible for realistic system parameters. A detailed analysis of the deterministic gradient descent algorithm can be found in [16] , [17] and a similar iterative algorithm for channel estimation for long code CDMA systems is analyzed in [24] .
An important advantage of this iterative scheme is that it lends itself to a simple fixed point implementation, which was difficult to achieve using the previous inversion scheme based on maximum likelihood [13] . The multiplication by the convergence parameter can be implemented as a right-shift, by making it a power of two as the algorithm converges for a wide range of [24] .
The proposed iterative channel estimation can also be easily extended to track slowly time-varying channels. During the tracking phase, bit decisions from the multiuser detector are used to update the channel estimate. Only a few iterations need to be performed for a slowly fading channel and the previous estimate serves as a very good initialization. The correlation matrices are maintained over a sliding window of length Ä as follows,
B. Performance comparisons
Iterative algorithms have been proposed earlier for channel estimation and detection in [15] , [25] , [26] , [27] , [28] . In [15] and [25] , several iterative methods for general adaptive filter and equalizer applications are discussed in detail. Specific algorithms applicable for CDMA systems are developed in [26] , [27] , [28] , [29] . Most of these algorithms are based on the method of gradient descent or the method of least squares. These papers mainly target BER performance and they do not consider hardware complexity for a real-time implementation. In this paper, we propose an iterative channel estimation algorithm for multiuser channel estimation suitable for real-time implementation and we show that it has almost the same performance as schemes based on least squares.
As discussed in [15] , the gradient descent algorithms can be broadly classified into two categories, deterministic and stochastic gradient descent. The well known least mean square (LMS) algorithm is a stochastic gradient algorithm, where the actual gradient is not known and is approximated by an estimated noisy gradient. In this paper, we use the deterministic gradient descent algorithm from [15] , [16] , [17] , where the gradient of the objective function is known exactly, to solve the linear equation in (3).
The proposed iterative algorithm to obtain the ML estimate is related to the RLS approach for MMSE estimation. In both cases, the estimate for preamble length Ð aims to minimize the squared error for that particular length Ð. However, we use the known gradient to obtain the estimate as opposed to the RLS algorithm which does not rely on gradient descent. Another difference between our iterative approach and RLS is that we use a sliding window update as opposed to RLS which uses an exponential weight factor update ( ). For the case of AWGN noise, we note that the ML and MMSE estimation approaches lead to the same solution for obtaining the channel estimate.
A comparison of the performance of our iterative scheme against the RLS algorithm is shown in Figure   2 . The simulations were performed for 8 equal power users with a spreading code of length 16 for a AWGN channel having 3 multipath reflections at 10 dB SNR. The Bit Error Rate (BER) is calculated using the channel estimates after the end of the pilot phase for two types of detectors, a Matched Filter Detector (MF) [5] , [20] and a Multistage Multiuser Detector (MUD) [14] . The users are all transmitting at the same power over a static channel with 3 paths of relative strengths 1, 0.5 and 0.33. Although the detection algorithm can handle the near-far problem, we simulated the equal power scenario as it generates the worst case for multistage detection. To use a sliding window update, we choose ½ as the exponential weighting factor for RLS in our simulations. From Figure 2 , it can be seen that our iterative scheme (ITER) performs almost as well as the RLS algorithm and the actual matrix inversion.
The value of should be less than the reciprocal of the largest eigenvalue of Ê´ µ for convergence.
Since the maximum eigenvalue of Ê´ µ increases with , a larger is possible for a smaller preamble length. Therefore, faster convergence can be achieved for smaller preambles. The maximum value of that can provide stability for a given preamble can chosen at the receiver for fastest convergence.
Therefore, the performance of our iterative algorithm is almost the same as that achieved by the RLS algorithm or the exact ML algorithm. From Figure 2 , we can see that the performance curves almost flatten out after a window length of 128 and henceforth, we use Ä ½¾ as our window length for simulations. Since for this window length, ½ ¾ and ½ ½¼¾ have the same performance, we will use ½ ½¼¾ henceforth in our simulations for greater stability.
Our iterative scheme is less computationally complex than RLS as we avoid the computation of the gain vector with every iteration. The RLS algorithm uses the matrix inversion lemma [15] to avoid matrix inversion but requires scalar division. Though the order of complexity in terms of multiplication and addition is the same for both the iterative scheme and RLS (Ç´Ã ¾ AEµ per bit), the RLS scheme requires
Ç´ÃAEµ more divisions. The complexity difference may be thought of as the additional complexity to find a new (gain vector) for every iteration in RLS compared to the fixed used in our iterative scheme. Our iterative scheme is also more suitable for a hardware implementation than RLS. In a systolic implementation, our proposed iterative algorithm uses only truncated multipliers and adders and does not require any special boundary cells. For implementation of RLS, matrix decomposition techniques such as QR have been used [15] . The QR decomposition can also be implemented efficiently in fixed-point using systolic arrays [30] , [31] . However, the cells in the array (especially, the boundary cells, which need to compute the Givens rotation) [15] , [31] have more computational complexity than the cells used in our iterative algorithm.
Thus, we show that our proposed iterative algorithm has a lower computational complexity than RLS and is also more suitable for a hardware implementation. We now evaluate the performance of the iterative scheme with respect to the original ML scheme for different SNRs and for fading channels.
The analysis of the system for a multipath fading channel with tracking is as shown in Figure 3 . Here we see that the proposed tracking scheme based on the update equations (16)- (17) is able to effectively track the time-varying channel. The poor performance of the static channel assumption for this Rayleigh fading channel (with mobile velocity 10 km/h) at a carrier frequency of 1.8 GHz shows the importance of tracking. The simulation was done for 15 equal power users with a window length of 128 (and preamble length of 128). For faster fading, the window length needs to be decreased appropriately. The original channel estimation scheme requires a matrix inversion and matrix multiplication for every update while the iterative scheme reduces the complexity to a matrix multiplication per update.
C. Pipelined detection
The multishot detection scheme [14] , [32] proposed in the earlier section is block-based. Such a blockbased implementation needs a windowing strategy and has to wait until all the bits needed in the window are received and are available for computation. This results in taking a window of bits and using it to detect ¾ bits as the edge bits are not detected accurately due to windowing effects. Thus, there are 2 additional computations per block and per iteration that are not used. The detection is done in blocks and the two edge bits are thrown away and recalculated in the next iteration. However, the stages in the multistage detector can be efficiently pipelined [19] to avoid edge computations and to work on a bit streaming basis. This is equivalent to the normal detection of a block of infinite length, detected in a simple pipelined fashion. Also, the computations can be reduced to work on smaller matrix sets. This can be done due to the block tri-diagonal nature of the matrix The detection can now be pipelined as shown in Figure 4 . An example highlighting the calculation of bit 3 in the detector is shown. An initial estimate of the received signal is done using a matched filter detector, which depends only on the current and the past received bits. The stages of the multiuser detector need bits 2 and 4 of all users to cancel the interference for bit 3. Hence, the first stage can cancel the interference only after the bits 2 and 4 estimates of the matched filter are available. The other stages have a similar structure. Hence, while bit 3 is being estimated from the final stage, the matched filter is estimating bit 9, the first stage bit 7 and the second stage bit 5. There are no edge bit computations in this scheme and hence, they can be avoided and we get ¾ savings in computation per detection stage, where is the detection window length including the edge bits. Also, instead of detecting a block of bits, each bit is detected in a streaming fashion, reducing the worst case latency by the detection window length ¾ and eliminating the memory requirements of block computation by a factor of ¾ .
D. Fixed-point implementation
We developed a model of the system in C++ using fixed-point "classes" in order to study the performance of the system with different precision requirements. The multiplications and addition operations were "over-loaded" so as to saturate if the available precision were to be exceeded. Since the received signal amplitude depends on the number of users in the system, the number of multiple path reflections, the spreading gain and the signal-to-noise ratio, the amount of precision required by the A/D converter is given by precision (in bits)
Equation (22) We study the effects of finite precision on the estimation and detection algorithms based on their performance using simulations. A detailed analysis of the algorithms for finite precision (as in [33] ) is challenging and is not the focus of this paper. We present two simulation results of the algorithms for finite precision with different spreading gains. Figure 5 shows the bit error rate performance of the channel estimation and detection algorithms for a spreading gain of 16 with 8 users. Figure 6 shows the performance for a spreading gain of 32 with 15 users. In each case, we choose a preamble length of 128 and a of ½ ½¼¾ (chosen to be smaller than the reciprocal of the largest eigenvalue of Ê´ µ for all in order to ensure convergence).
Based on the simulations performed, we have made the following observations:
1. We see that 16-bit fixed point multiuser channel estimation and detection performs almost as well as floating point precision multiuser estimation and detection. In fact, for AE ½ and Ã the performance begins to degrade only at 13-bit precision and for AE ¿¾ and Ã ½ the performance degrades at 14-bit precision. 3. The finite precision of the computations has greater impact on the performance of multiuser algorithms than on single-user algorithms. The matched filter receiver starts degrading only at 8-bit precision. This is reasonable to expect as the computations required for interference cancellation are more complex than that for matched filter detection. While matched filter detection requires just an inner product computation, multiuser detection requires us to solve a linear equation. Furthermore, significant performance gain is achieved in multiuser detection (compared to matched filter detection) with the extra precision.
4. Higher spreading gains and larger number of users implies larger number of multiply-and-accumulates, which may easily saturate the multipliers and adders. Hence, we see that going from AE ½ to AE ¿¾ shows a slight increase in precision requirements (from 14 to 16).
IV. TASK DECOMPOSITION AND VLSI ARCHITECTURES
A. Task decomposition of multiuser channel estimation and detection
The various sub-blocks in the joint multiuser channel estimation and detection algorithm are as shown in For the sake of convenience, we henceforth represent the current inputs , Ö as , Ö and Ä , Ö Ä as ¼ , Ö ¼ respectively. All the architectures assume a single-cycle multiplication and addition as both multiplication and addition can be implemented in ÐÓ ´Òµ type computations [34] where Ò is the number of bits and the single cycle assumption also helps us with the DSP comparisons. We assume that a Wallace or Dadda multiplier tree [34] is used for multiplication requiring Ç´Ò ¾ µ 1-bit Full Adders (FA) for an Ò-bit multiplication. Since the multiplication by in (15) (implemented as a shift) results in truncation of the output, a truncated multiplication using significantly less hardware [35] can be used. The delays of blocks such as multiplexers and gates are assumed to be included in the single-cycle delay. For an area estimate of the architectures, we consider the number of 1-bit FA cells in the design. It can be observed from Figure 7 that the bottlenecks in the pipeline are the matrix multiplications Ê £ for channel estimation and the calculation of the Ä matrices for multiuser detection.
We explore different area-time tradeoffs to develop real-time architectures with minimum area overhead. We explain the design in detail for a time-constrained architecture which shows the upper bound on data rates with no constraints on hardware and then show that by constraining hardware, we are able to design different architectures to meet real-time requirements with minimum area overhead. We have considered only the computational complexity for our analysis and have ignored the analysis of the memory requirements. This is because the focus of our paper was on the computational complexity and area-time tradeoffs needed to meet real-time requirements. We have done an analysis for the memory requirements in previous work for channel estimation [36] . Figure 8 shows the achievable data rates and Figure 9 shows the transistor count for the architectures discussed below. We assume 28 transistors per 1-bit standard FA cell as in [34] .
B. Area-Time tradeoffs for channel estimation architectures
B.1 Time-constrained architecture
The block diagram of a time-constrained architecture is as shown in Figure 10 . In this architecture, the available parallelism in the algorithm is exploited to the maximum extent. Hence, all the elements needed to perform a parallel matrix multiplication are computed simultaneously. The entire matrices Ê and are multiplied using an array of multipliers. The entire product matrix is subtracted by the autocorrelation matrix, Ê Ö , shifted and a new channel estimate is formed. Thus, as the time taken by the other computations is pipelined with the time for the multiplication, the output matrix can be formed in parallel every ÐÓ ¾´¾ Ãµ · ½ using Ã ¾ AE multipliers. This is because each element of an AE £ AE product matrix can be computed in ÐÓ ¾´AE µ · ½ time using AE ¿ multipliers and using a tree structure to compute the inner products [37] , in a time-constrained architecture.
We also exploit the bit-level arithmetic and parallel structure of the correlation matrices to form the correlation matrices simultaneously within a cycle. Since the auto-correlation matrix update is a symmetric matrix and all the diagonal elements are 1's ( ¨ ½), we need to compute only the strictly upper triangular (or lower triangular) part of the auto-correlation matrix. Also, as the updates are all +1's or -1's, this can be obtained from a simple single-bit XNOR gate structure. As the auto-correlation matrix is always updated and down-dated by ¦1's, increment/decrement counters can be used in place of general adders in our design. Also, the elements in the cross-correlation update are ·Ö or Ö and hence, the vector Ö could be directly added or subtracted with every column of the cross-correlation matrix based on the sign of the bit vector .
The area requirements for the time-constrained architecture are as shown in Figure 9 . The area requirements vary from ½¼ to ½¼ ½¾ transistors. This is a highly aggressive solution with today's technology and it is not feasible to devote so many FA cells just for channel estimation, which is only a part of the complete receiver. However, this states the theoretical minimum time requirements by exploiting the available parallelism as ÐÓ ¾´¾ Ãµ · ½, which is the time required to do the parallel multiplication and pipelined integration with the other blocks. We require ¾ÃAE´¾Ã ½µ adders for doing the recursive doubling [37] in ÐÓ ¾´¾ Ãµ time (adding ¾Ã elements in ÐÓ ¾´¾ Ãµ time requires´¾Ã ½µ adders) and ¾ÃAE adders for the subtraction following the multiplication. The data rates achieved by this fully parallel architecture is shown in Figure 8 . We can see that we are able to get 1 to 2 orders of magnitude performance more than necessary using the amount of parallelism in the algorithms. Therefore, we propose better area-time tradeoffs more closely matched to the target data rates in Section IV-B.3
B.2 Area-constrained architecture
For an area-constrained architecture, we assume that only a single multiplier and adder are available.
Thus, the matrix-matrix multiplication serially takes Ã ¾ AE cycles. The data rates achieved and area requirements for this architecture are shown in Figures 8 and 9 . We see that though the serial architecture uses very little area, it falls below real-time requirements by 1 to 2 orders of magnitude.
B.3 Data rate targeted area-time tradeoffs
In this section, we use part of the available parallelism to achieve real-time performance with minimum area overhead. We use a vector multiplier calculating each row of the multiplication in parallel. This is shown in Figure 8 and Figure 9 as AREA-TIME1. Thus, the multiplication now takes ¾ÃAE cycles at an ¾Ã increase in the number of multipliers. This seems to meet real-time requirements up to AE ¿¾ as seen in Figure 8 . However, for AE ¿¾, it can be seen that greater amounts of parallelism need to be used to meet real-time. For AE ¿¾, we found that additionally 16 columns of the matrix need to be computed in parallel. This implies that the matrix multiplication is done in ÃAE cycles and at a further 16X increase in the number of multipliers. This is shown in Figure 8 and Figure 9 as AREA-TIME2.
C. Area-Time tradeoffs for multiuser detection architectures
C.1 Time-constrained architecture
A detailed task partition of the blocks for multiuser detection are as shown in Figure 11 . The blocks consist of a matched filter detector which provides the initial hard ( ) and soft estimates (Ý) to the parallel interference cancellation stages. A three-stage detector is chosen for implementation as it provides sufficient convergence [23] .
An array of parallel multipliers is used for computing the entire matched filter estimate 
C.3 Data rate targeted area-time tradeoffs
The area-time complexity for multiuser detection is found to be similar to that for channel estimation and hence, we use the same type of area-time tradeoffs as before. This is shown in Figure 8 and Figure   9 as AREA-TIME1. Thus, the multiplication now takes ÃAE cycles at a Ã increase in the number of multipliers. This potentially can meet real-time requirements up to AE ¿¾ as observed in Figure 8 .
However, for AE ¿¾, it can be seen that greater amounts of parallelism need to be used to meet realtime. Hence, for AE ¿¾, we found that 16 columns of the matrix also needs to be computed in parallel.
This implies that the multiplication is done in ÃAE ½ cycles and at a further 16X increase in the number of multipliers. This is shown in Figure 8 and Figure 9 as AREA-TIME2.
V. RESULTS AND COMPARISONS
A. Computational savings
The computational advantages of the newly proposed schemes over the previous schemes are shown in Table I Similarly, for comparing the detection schemes, we assume that a window of D bits need to be detected.
For every window, we save Ç´¾ÅÃ ¾ µ computations, assuming an M-stage detector as the edge bits do not need to be calculated. A fully pipelined time-constrained detector can reduce the time requirements to Ç´ÐÓ ¾´AE µ · ¿µ by exploiting available parallelism. Note that the enhanced algorithms, as seen from Table I do not have inherent computational savings but are designed to benefit from exploiting parallelism and pipelining in an architecture. Thus, significant benefits in performance can be achieved by enhancing the existing schemes for channel estimation and detection with schemes having an efficient hardware implementation and exploiting the available parallelism.
B. Comparisons with DSPs
Though DSPs and general purpose processors with MMX-enhanced instruction sets exploit byte-wide parallelism, they are inefficient for processing on bits. Storage of bits as bytes on such processors is inefficient as there is a large overhead is involved in packing and unpacking these bits. Also, the compiler may not replace bit-level multiplications with additions and subtractions. Using a control structure instead, also limits the utilization of available parallelism. Formation of bit-level matrix updates is much more effective and simpler to build in parallel with XNOR gates than as sequential multiplications on DSPs. This poor performance is due to the computation of a matrix multiplication per received bit on the DSP.
The frequency of updates to the channel estimates can be reduced for slow fading channels for better time performance. Similarly, detection takes ¾¼Ñ× for all 32 users. The low data rate performance of the detector is because we consider a more realistic and complete system with continuous updating of channel estimates to the detector as compared to a static channel assumption and neglecting effects of channel estimation in other detector DSP implementations [6] , [23] .
VI. CONCLUSIONS
We first present computationally efficient algorithms to meet real-time requirements of multiuser channel estimation and detection in future wireless base-stations. Existing algorithms for multiuser channel estimation and detection are redesigned from an implementation perspective for a reduced complexity solution. The maximum likelihood based channel estimation algorithm requiring matrix inversions, block-based computations and floating point accuracy is redesigned for an iterative scheme, which has a simpler fixed point VLSI architecture and reduced complexity. Multiuser detection is also redesigned for a pipelined structure, that reduces the memory requirements by a factor of ¾ and worst case latency by ¾. The edge bit computations in the block scheme are eliminated and a ¾ improvement in computational complexity per detection stage is achieved.
We then present fixed point, real-time VLSI architectures for multiuser channel estimation and detection. The proposed VLSI architecture schemes can be integrated with DSP architectures as a co-processor support [39] to build single DSP base-station solutions. Bit-level extensions [40] can also be similarly The matrix inversion based scheme assumes a static channel and is not updated with decision feedback, while the iterative scheme is updated every bit. The convergence parameter, , is chosen as 1/1024. A pilot sequence of 128 bits was used initially to obtain the channel estimates. 
