In this paper, we propose a reduced complexity and power efficient System-on-Chip (SoC) architecture for adaptive interference suppression in CDMA systems. The adaptive Parallel-ResidueCompensation architecture leads to significant performance gain over the conventional interference cancellation algorithms. The multi-code commonality is explored to avoid the direct Interference Cancellation (IC), which reduces the IC complexity from O(K 2 N ) to O(KN ). The physical meaning of the complete versus weighted IC is applied to clip the weights above a certain threshold so as to reduce the VLSI circuit activity rate. Novel scalable SoC architectures based on simple combinational logic are proposed to eliminate dedicated multipliers with at least 10× saving in hardware resource. A Catapult C High Level Synthesis methodology is apply to explore the VLSI design space extensively and achieve at least 4× speedup. Multi-stage Convergence-MaskingVector combined with clock gating is proposed to reduce the VLSI dynamic power consumption by up to 90%.
I. Introduction
Multiple Access Interference (MAI) is one of the major limiting factors to system capacity in CDMA systems. In [1] [2], a Parallel Interference Cancellation (PIC) algorithm was developed to detect the multiple users simultaneously by completely cancelling the estimated MAI from all remaining users. Since it is much simpler than the Maximum-Likelihood (ML) multi-user detector, it has been well accepted as one of the most practical algorithms for real-time implementation. A multi-stage real-time VLSI architecture based on this algorithm was reported in [3] . Related implementation schemes are found in [4] and [5] . However, when the interference estimation is not accurate (e.g., when the system load is high or the receiver is in the early detection stages), cancelling the wrong estimate may even add more interference to the signal. This leads to the so-called "ping-pong" effect in the conventional PIC algorithms. In such situations, it is preferable not to cancel the entire estimated interference. Divsalar et al. [6] proposed a partial PIC (PPIC) algorithm by introducing a weight in each stage. The stage specific weights are found by a trial-and-error computer search for all the users with the only intuitive constraint 0 < w 1 < w 2 < · · · < w m < 1, where w m is the weight at stage m. A DSP prototype based on this improved algorithm was reported in [7] . However, because of the limited parallelism, the DSP-based prototype supported only a small number of users for relatively low speed systems.
The intuitive weights applied in [6] are far from optimal solution because it does not apply any optimization criteria in finding the weights. To seek better weights, adaptive PIC based on Minimum Mean Squared Error (MMSE) criteria was proposed in [8] . The weight for each user in each stage is computed by an adaptive Normalized Least-Mean-Square (NLMS) algorithm. However, the NLMS algorithm increases the system complexity considerably, which makes the real-time implementation very challenging. Although some VLSI architectures for DS/CDMA receiver can be found in [11] , [12] and [13] , related VLSI architecture work for the adaptive interference cancellation has not been reported yet in the literature. The extra complexity demands special treatments to meet the real-time requirement and hardware resource limit.
To achieve the goal of real-time implementation with efficient architecture, both algorithmic and architectural optimizations are explored in this paper. Power consumption is an essential consideration for both VLSI and DSP processor implementations, especially for mobile devices. SoC design architecture has many advantages over general-purpose DSP processors by providing higher parallelism and pipelining, lower power consumption and compacter size. Algorithmic transformations like pipelining and parallel processing can be used to reduce power consumption. On the other hand, shutting down some computation blocks leads to fewer instructions in DSP and fewer cycles in VLSI implementation. The power savings achieved in this manner can be significant but are very algorithm dependent. A proper addressing of when and how to shut down can result in substantial improvement in energy efficiency with no or little loss in performance.
To design power saving schemes for the SoC architecture, we explore the physical meaning of the weights in the adaptive interference cancellation algorithm. A conventional complete PIC is considered as a special case of adaptive scheme with weight '1' for all users. It is found that when the interference weight for one particular user in one stage is '1', it implicates that the symbol of the particular user is estimated "almost" correctly and the interference from that user is "completely" cancelled. We then investigate the inter-stage features of the user-specific weights. In the early stages, the NLMS algorithm will adjust the weights more significantly since the symbol detection is less accurate than later stages. But the weight tends to converge to '1' in later stages as the MSE converges. After the first stage, only a small portion of weights will diverge from the initial values. Thus, the distance of one user's weight from the initial value is used as an indicator of the accuracy of the symbol detection of that user. A Convergence-Masking-Vector (CMV) is generated by comparing each user's weight with a given threshold at each stage. The vector only contains flags ( 0 or 1 ) to indicate if the weight has converged or not.
The CMV is combined with clock gating as a dynamic power management function for the multi-stage components of VLSI architectures. If the CMV indicates a convergence, then there is no need to update the weight for this user at all later stages and the corresponding components in later stages are shut down. Simulation shows that the active rate can be dropped to 60% after stage 1 and to 10% after stage 2 with a threshold of 90%, which leads to negligible performance loss.
This gives 40% dynamic power savings in stage 1 and 90% savings after stage 2.
There exist many area/time tradeoffs in the SoC architecture [12] . The conventional VHDLbased design methodology is very time consuming and difficult to explore the design space extensively [15] . In this paper, a Catapult C High Level Synthesis (HLS) design methodology is proposed to explore design space extensively using layered parallelism and pipelining [9] . Special bit-ware VLSI units based on the simple Sumsub-Mux Unit (SMU) for the bottleneck design blocks are proposed to eliminate the use of dedicated ASIC multipliers. This reduces the hardware complexity dramatically and achieve the efficient tradeoffs in parallelism and pipelining architectures. The most area/time efficient VLSI architectures are implemented in an Field Programmable Gate Array (FPGA) prototyping system, giving at least 10× saving in the hardware resource over a multiplier based design and at least 4× speedup over the conventional area-constraint architecture.
The paper is organized as follows. In section II, we present the system model and conventional receivers. In section III, the symbol level adaptive PRC is presented followed by the power efficient CMV architecture. Section IV describes the SoC architecture design methodology, system partitioning and the derivation of the SMU combinational logic. The VLSI architecture for the dominant weight updating and PRC modules based on SMU blocks are presented in section V. In section VI, we provide the VLSI design space exploration results and the fixed-point bit error rate performance from emulations. The paper concludes in section VII.
II. System Model and Conventional Receiver
We consider the synchronous multi-code CDMA system using QPSK modulation scheme. The n th symbol for the k th user at the transmitter is mapped to constellation points using a group of
The symbol output at the modulator is s
√ 2 with equal probability, where j = √ −1. In an AWGN channel, the received complex base band signal at the i th chip of the n th symbol is expressed as
where α We focus on the n th symbol and omit the symbol index for notation simplicity in the following.
By collecting the N chip samples in one symbol duration into a vector, we form a signal vector as 
The matched filter output is then corrected by the channel estimation phase and sent to a multiuser demodulator. At the demodulator, the estimated bits of the k th user are detected asb
A. Complete Multistage PIC
A multi-user Parallel-Interference-Cancellation (PIC) algorithm was proposed in [1] for simultaneous detection of all users. By assuming the bit estimation of the (m − 1) th stage as the transmitted bits for each user, it estimates the interference at the m th stage for each user by recon-structing the transmitted signal excluding itself as in
whereŝ (m−1) j is the modulator symbol output for user j at the (m − 1) th stage by using the harddecision bits of the (m − 1) th stage. The estimated interference is subtracted completely from the received signal for each user. The corrected signal is despread and demodulated as (4) to generate more accurate estimation of the bits. This process is repeated in an iterative pattern for multiple stages.s
B. Partial PIC Receiver
It is pointed out in [6] that if the estimation of the early stages is not accurate enough, the complete PIC even adds more interference to the signal. To achieve more accurate interference cancellation, a partial weight is introduced for each stage. The weights are chosen based on the intuition that the estimation from the earlier stages is less accurate than later stages and less interference should be cancelled. The intuitive weights are found by a trial-and-error computer search with the only constraint that w 1 < w 2 < · · · < w m . The more accurate signal used for demodulation of each user is generated by adding the partially interference cancelled signal and a weighted soft input signal of the previous stage as in (5) .
III. Low Power Adaptive PRC
Despite the performance gain of PPIC over the complete PIC, the intuitive weighting scheme is far from optimal solution. For better accuracy, it is preferable to choose individual weights for each user depending on the accuracy of the symbol detection. To achieve this, a set of weights is introduced in [8] for each user in each stage. By defining a cost function in terms of the squared Euclidean distance between the received signal r(i) and the weighted sum of all users' estimated signal, the optimal weights are given by minimizing the MSE of the cost function as w
where the weighted sum of all users' hard-decision symbols at the m th stage is given byr
Here w 
where µ is the step size andΩ (m−1) is the input vector to the NLMS algorithm. The interference for each user in the adaptive PIC is estimated in a direct form for all the K users aŝ
. The more accurate chip-level signal is generated for each user asγ
k (i) and the more accurate symbols are detected ass
A. Adaptive Parallel Residue Compensation
Since the computational complexity determines the cost of necessary hardware resources such as the number of functional units, it is one of the most important considerations in the implementation of PIC schemes. The complexity of direct form PIC in one chip for K users is 4K(K − 1) real multiplications, 2K(K − 1) real additions and 2K subtractions. Moreover, there is one"if" state-ment which is mapped to a hardware comparator for each user loop. This makes the loop structure irregular and not very suitable for pipelining. Considering the regularity of the computations for all users, we change the order of "interference estimation" and "interference cancellation". Instead, the new architecture has the following steps:
1. "Weighted-Sum-Chip Function": by summing up all users' weighted signal together, we get weighted estimation of the received signal in chip rate samples aŝ
2. "Residue Error Generation": a common residual signal for all users is generated by a single subtraction from the original signal as
3. "Parallel Residual Computation": in the final step, this residual error is compensated to each user to get the interference-cancelled chip signal as
4. The afore-mentioned procedure constructs a "Chip-Level" PRC (CL-PRC) structure if the multi-user "chip-matched filter" is carried out on the corrected chip signals directly as described. However, by jointly considering the matched filter and the residue compensation step in (9) , (10) and (11), the 0 th stage multi-user matched filter output can be utilized to generate the "symbol-Level" PRC (SL-PRC) architecture. The "spreading" and then "matched filter" procedure for the weighted symbols of each user is redundant in chip level. We only need to do matched filtering for the weighted-sum chips as in (12) . The soft-decision matched filter output of the corrected signal is finally generated in the symbol level as (13) . The optimally weighted symbol in (9) can be computed as
and stored in registers or memory arrays before the spreading in (9) .
The complexities of one stage PRC and matched filters for the three different schemes are summarized here. It is seen that the interference cancellation complexity is reduced from the order
which is linear to the number of users. Overall, the symbol-level PRC has the minimum complexity. Although it is similar to the chip-level PRC, the loop chain for chip index is more compact and regular for scheduling the pipelined and parallel architecture than the chip level processing so as to generate faster VLSI design.
B. Algorithmic Optimization for Low Power VLSI Architecture
Low power consumption is an important factor in lowering system cost [17] - [22] . Average power determines the battery life while peak power affects the reliability. The source of power consumptions in CMOS technology includes the switching current (dynamic power), short circuit current and leakage currents. The average power consumption of a CMOS gate due to the switching The power equation suggests many strategies for increasing the energy efficiency at various abstraction levels: from the algorithmic level down to the layout level. Thus, the working space for a joint algorithm and SoC architecture optimization is depicted in Fig. 1 . On the algorithm side, the focus is the system performance in terms of bit error rate etc. On the architecture side, the focus is the VLSI performance in terms of the real-time cycle number, silicon area and dynamic power dissipation. To convert an algorithm to the real-time architecture, we work at different levels: from floating point algorithm to the behavioral model, the bit-true model, the RTL model and the gate level netlist. To achieve the power saving, we propose a CMV based on the algorithmic feature of the weights to provide a control logic signal to stop the multi-stage NLMS adaptation. We can employ either a central or a set of distributed control units. It is more favorable to use a set of distributed control units since long and fast control signals are eliminated. These distributed control units also become much simpler and faster and consume less power. The control unit is applied to shut down the system and enter into an idle state.
C. Stochastic Convergence Masking Vector
Because of the MMSE criteria in the adaptive PRC scheme, the mean squared error will converge in the NLMS update recursion. If the weight for the m th stage is very close to the initial value of the NLMS algorithm, it means that the MSE has converged and the interference from the particular user has been cancelled out at the m th stage, Thus, there is no need to continue the weight update and interference cancellation in the later stages.
To analyze the stochastic feature of the user-specific weights, the normalized optimal weights versus chip index for stages 1, 2 and 3 are depicted converges for a group of chips at the (m − 1) th stage, the weight for this user and symbol tends to converge also at the m th and later stages. This makes sense because the weight in the adaptive NLMS PRC is user, symbol and stage specific. The convergence of the weight depends on the confidence level of the correctness in symbol detection. If the symbol is already detected correctly, then a normalized weight "1" can cancel the interference from this user "completely". There is no need to continue the cancellation for the particular user symbol in later stages.
With more stages, more weights converge to the normalized "1" (in the case of BPSK, many of them converge in the second stage). Thus, after more interference is cancelled and the signal is getting cleaner in later stages, the majority of the weights will be close to the normalized weight "1". This is demonstrated by the probability-distribution-function (PDF) for stages 1, 2, 3 and 4 in Fig. 3 . For the later stages, only a small portion of both the weights and the interference need to be updated and cancelled, respectively. It also gives a metric to control the hardware utilization in (0) =α and update the weights according to the NLMS recursion in (8) . For each user, we set the CMV vector V
where V
(1) k = 1 means that the k th user has converged to a "correct" symbol detection, there is no need to continue the detection of this symbol for this user in later stages. Else V
(1) k = 0 indicates that the receiver needs to continue the multistage weight update for the k th user. We also separate the weighted-sum chip signal in (9) into two terms: the converged termr
V and the not-converged
Then the symbolsΩ (0) (i) are detected from (12) and (13) with the interference cancelled. For the
There is no need to detect these symbols at later stages. Otherwise, we initialize and update the weight from the NLMS update equation (8) . We compute w 
wherer 
Unlike the differential multi-stage implementation for the complete PIC in [3] , the proposed weighted PRC guarantees convergence of the MSE. For the differential complete cancellation in [3] , if the interference is estimated incorrectly, the mistake is locked in latter stages. However, the simulation results show that large power savings are achieved using the CMV in the proposed adaptive PRC with negligible loss in performance.
IV. Pipelined Multi-stage SoC Architecture
To meet the challenges and reap the rewards of SoC design, engineering teams need a scalable verification solution that addresses all aspects of the design cycle and reduces the verification gap. In this section, we focus on the hardware implementation of the NLMS based adaptive PRC architecture. The following issues will be addressed: design methodology, hardware resource/architecture constraints and system partitioning etc.
A. Catapult C HLS Architecture Scheduling
Functional verification is a critical bottleneck for SoC implementations. We apply an efficient Catapult C HLS methodology [9] from Mentor Graphics to investigate various pipelined architectures and different levels of parallelism. Catapult C provides architecture scheduling to generate efficient RTL on different resource/timing requirements. Configurable parallelism is enabled by assigning the number of FUs according to area/time constraints. The best solution would be the smallest design meeting the real-time requirements. A pipeline controller also generates the control logic for the multi-stage pipelined processing to reduce processing latency.
The power management unit is also responsible for the generation of the clocks, which are supplied to the rest of the design. Clock gating is a commonly used technique to reduce dynamic power dissipation by gating off clock signals to registers, latches and clock regenerators. An example logic block is shown in Fig. 6 . Gating may be done when there is no required activity to be performed by logic whose inputs are driven from a set of storage elements. Since output values from the logic will be ignored, the storage elements feeding the logic can be blocked from updating to prevent irrelevant switching activity in the logic. The "START DET" and "SHUT DOWN" signals are designed in a pattern to serve as the pulse into the T-flip flop to generate an enable signal output. This enable signal "AND"s with the inverse of the CMV for the k th user at the m th stage to generate an enable signal for the clock. It is worth noting that, in order to prevent glitches in the clock network, for each enable signal we must introduce a latch, which contributes an overhead in energy consumption. However, this overhead is negligible compared with the overall system complexity. Consequently, in order to reduce the energy dissipation, a simple circuit detects the occurrence of convergence events for each user at each stage. It also detects the convergence event of the earlier stages. When this occurs, the clocks to the NLMS and the PRC modules are blocked and no further weight calculations are performed.
C. System Level Partitioning
Because the transmitter design is relatively simple, we focus on the receiver design architecture.
The loop structures and the intrinsic timing in the algorithm need to be arranged well to achieve FIFOs are applied to balance the processing latency in different chains. The input bit streams for K users are packed into one single word bit vector buffer as:
k−1 to save the storage. The spreading codes for K-users can also combine to form a code vector ROM as
D. Pipelined Weight-Updating-Block
The NLMS is a major design bottleneck since it involves divisions and multiplications with feedback structures as (8) . This design block takes the input vector for the chip-based complex NLMS algorithm and computes the optimal weights for all the users in each symbol. Although it is relatively straightforward to synthesize the high-speed architectures for feed-forward-only signal processing structures such as the conventional PIC, it is considerably more difficult to synthesize similar architectures when there is a feedback structure. In the NLMS adaptation, the error of the weighted hard-decision signal is used to adjust the weight coefficients in real time.
There are two top-level loop structures L1, L2 corresponding to the equations in (7) and (8).
L1 loop is the recursive loop for the updates in chip-basis for each symbol. L2 updates the weight estimates from registers to memory blocks when one symbol is ready. The loops are mapped to hardware units as shown in the block diagram in Fig. 8 . In L1 loop there are two second-level loops corresponding to the user indices: L1.1 computes the weighted estimation of the received signal based on the current weights. L1.2 computes the iterative weights for K users. According to the loop structures for the code index k and chip index i, the NLMS block can be partitioned into two major functions: the Weighted-Sum-Function (WSF) as in equation (7) and Weight-AdaptationFunction (WAF) as in equation (8) . In the WSF sub-block, the estimated hard-decision bits are extracted from the bit vectors B 0 and B 1 by the De-Packing Unit (DPU) block. TheΩ
] vector is generated using the same Modulator-SpreaderUnit (MSU) as in the transmitter from the estimated bits and the spreading code vector C[i]. This vector is then stored either in memory blocks or register files. In the same loop structure, the ChipWeighting-Unit (CWU) and Complex-Add-Unit (CAU) will generate the weighted sum of the replica as in (7). This replica of received signal is then subtracted from the received chip samples to form the residual error as is a constant. The division can be implemented by a right-shift of log 2 (2K). Since the step size µ does not need to be a very accurate particular value without loss in performance, we can combine µ and the norm into one coefficient and right-shift only by log 2 (K), which can be computed as a 
E. Bit-ware Sumsub-Mux-Unit
A conventional design of the Spreading-Unit(SU) and Chip-Weighting-Unit (CWU) in (7) and (8) utilizes explicit dedicated multipliers for all the involved multiplications. The circuit is shown in Fig. 9 for two users. Each SU has 2 multipliers and each CWU has 4 multipliers. Moreover, there will be 2 adders for each CWU and a pipelined CAU tree layout is required for a fully pipelined summation of K users. The complexity is still rather high with 6K multiplications for the loop. However, since the real and imaginary parts denoted byŝ 
The actual value ofΩ(k) can be determined from a truth table based on different input bits of the spreading code and the hard decision bits. By using {0, 1} instead of {±1} to representΩ(k) too, the logic design is shown to be
where uint1 denotes the unsigned one-bit data type.
The multiplication byΩ (m−1) with 2-bit values of {±1} then can be implemented with MUX circuits controlled by the decoder ofΩ(k) with 1-bit values {0, 1}. The multiplications in (7) are 
The same structure can be used forΩ (8) Table. 2 for different "Sel" signals.
The WSF and WAF blocks for the NLMS algorithm then can be integrated with these basic design blocks. An example with two SMUw and SMUe engines in parallel is shown in Fig. 11 .
In the WSF function, the "SELdecoder" takes the C[i] and B 0 , B 1 to generate the select signals for the SMUw. The SMUw takes the input from the temporary weight memory block. A CAU adds the two portions of paths to get the total weighted sum chip signal. It is then subtracted from the received original signal to generate the error, which is input to the SMUe module in the WAF block. After multiplying the "µ normed ", it is adjusted by the weights from the previous iteration and written back to the memory. In this way, each engine acts as a single processor for serial processing of K/2 users. Dramatic optimization in the VLSI area and timing closure can be achieved with this design compared to the conventional multiplier-based design as shown later. Figure 11 : The data path of the SMU-based NLMS architecture using WSF and WAF.
F. Weighted Matched Filter and PRC Architecture
Another major block is the Weighted-Sum-Matched-Filter and the Residue-Compensation block denoted by equations (9) . Similar to the NLMS block, symbol level Sumsub-MUX-Unit for Weighted-Symbol (SMUws)is designed with bit-ware combinational logic to generate w s [k] . In this case, the Weighted-Symbol (WS) SMU is controlled only by the "SelDecoder" triggered by the B 0 and B 1 vectors. A MUX controlled by the spreading codes and the accumulator forms the equivalent Weight-Matched-Filter-Unit (WMFU). This generates the optimal weighted sum chip signalr w,opt if the WMFU is accumulating partial results based on the user index k. This design module is shown in Fig. 12 . Based on this basic design module, the complete data path logic block diagram for the Weighted-Sum-Matched-Filter and Residue-Compensation process denoted by equations (9)- (13) is shown in Fig. 13 . This figure shows an example with two parallel Processing Elements built from the combinational logic. The K users are split into two groups of K/2 users. The users in each group utilize one PE in serial. In each PE, the optimal weights for one symbol are input into the SMUws module to form the weighted symbol w s [k] and the weighted sum Notice that in this design, we do not have to use general-purpose multipliers. The very simple combinational logic bit-level VLSI architecture can achieve much higher clock rate, as we will show later. This allows more time for the processing of each user and each chip and avoids the fully parallel layout of duplicate hardware design units at the order of user number K. This feature gives more flexibility in designing a configurable VLSI architecture that is important and challenging for the multi-code CDMA system. 
V. Simulation and Emulation Results

SMUws
WMFU CAU
w re [k] w im [k] SMUws WMFU CAU CAU ] [ 0 k MF S ] [ k ℜ ) ( m k s ] [ 0 k MF S ) ( m k s ] [ k ℜ w re [k] w im [k]
A. Floating-point Performance
In Fig. 14 , the BER performance versus the number of users for a fixed Signal-to-Noise Ratio (SNR) of 4dB is shown for stage 2 using random codes. The Spreading Factor SF is 64. For the PPIC case, a set of intuitive weights that satisfy w m−1 < w m are simulated. The PPIC starts to outperform the complete PIC after the number of users increases above 12. The performance of the proposed APRC outperforms both the PIC and PPIC significantly. It can be concluded that when the system load is low, the complete PIC works fairly well. However, when the system load is high, PPIC starts to outperform complete PIC. On the other hand, the adaptive PRC outperforms both the PIC and PPIC in a wide range of the system load. This demonstrates the superior performance of the adaptive PRC algorithm.
B. Fixed-point Implementation and Performance
The VLSI implementation requires fixed-point arithmetic. The reduction of the bit width almost linearly reduces the design size, hardware complexity and power consumption. However, the stability of the algorithm and the performance may suffer from excessive finite word length effects due to the overflow and quantization noise, unless all signals are scaled properly and sufficient word length is assigned. So it is important to find a reduced word-length with negligible performance degradation. Because of the multiplications or divisions in the algorithm, overflow should be avoided by scaling the result back to the correct word length. Meanwhile, the precision should be kept enough to avoid underflow and divide by zero error. The fixed-point performance for QPSK modulation is shown in Fig. 15 and 16 for QPSK. It can be seen that for the APRC scheme, 
C. Performance and Complexity Tradeoff Using CMV
The BER performance using convergence-masking vector for dynamic power management is compared with the original APRC algorithm in Fig. 17 and Fig .18 for different stages and relative thresholds. In both figures, the spreading factor is set to 16 and SNR is set to 12 dB. Fig. 17 shows the performance for a 10-user system while the number of users in Fig. 18 is 14. For a 10-user system, when the threshold is set to be 50% at stage 1, 70% at stage 2 and 90% at stage 3 and stage 4, the performance drop is negligible. However, if the threshold is below a certain level, e.g., 80%
at stage 4 or 40% at stage 2, significant performance degradation is observed when compared with the original PRC scheme. For a 14-user system, the performance degradation is less sensitive to the threshold level because the BER floor of the 14-user system is higher than the 10-user system.
If the threshold is set to 95%, the BER is almost the same as the "always-update" adaptive PRC scheme. 
D. Active Rate
The active rate of one component ζ act = τ act /T all * 100% is the percentage of time when the component is not shut down through clock gating. Thus, the active rate is an indicator of the power savings for the pipelined VLSI architecture. In Fig. 19 , we demonstrate the active rate of each stage under a different threshold level. The plain solid curve is the active rate of stage 1, the solid curve with square is stage 2, the diamond curve is stage 3 and the dotted curve is stage 4. From the simulation results in Fig. 17 and Fig. 18 , it is demonstrated that different thresholds could be applied to different stages. If we choose a threshold of 75% for stage 1, 90% for stage 2 and 95%
for stages 3 and 4, it leads to a 35% active rate for stage 1, a 10% active rate for stage 2 and roughly 5% active rate for stage 3. For stage 4, only if the threshold is set to above 96% will the active rate increase to 5%. The low active rate indicates that large power savings can be achieved over the original design with little or no loss in system performance. 
VI. SoC Architecture Design Space Exploration & Synthesis Results
A. Resource Mapping and Architectural Constraints
There are tradeoffs among the speed and size by using different storage hardware. If register files are applied to map the arrays, they can be accessed in parallel in one cycle. However, if the data arrays are mapped to memory block, only one entry can be accessed for a single memory block. Sometimes, the memory access race problem could stall the pipeline and force the design to process in serial. This increases the latency and reduces the processing speed. So register files tend to provide more parallelism. On the other hand, if multiple register files need to share the functional units, MUXs need to be applied in front of each input of the functional units. For a multi-user PIC system, they could be very large MUX with up to N inputs, where N is the spreading factor. In FPGAs, the large MUX could be even larger than the functional units such as the adders themselves. This can be a major contribution to the design size. Size reduction 
B. Scalable Architecture for the NLMS Module
We first explore the different mapping and pipelining options for the NLMS block. Table 3 shows the design space exploration for the resource mapping and architecture constraints of the multiplier-based NLMS weight update block. The abbreviation codes used in the table have the following meaning: "MEM" means the storage option for the array variables is memory; "REG"
indicates that the arrays are stored in register files. For the loop architecture constraint, "-" denotes no special processing such as pipelining and the loop remains rolled for the serial processing;
"P" means that the loop is pipelined and "UR" means that the loop is unrolled. Dedicated ASIC multipliers are applied for all the involved multiplication. Table 4 shows the corresponding netlist synthesis specifications. Solution 1 applies memory blocks for both the I/O interface and local variables. All the loop structures remain rolled with no represents the area-constrained design because all the multiplications can reuse multipliers in serial without using MUXs at the input of the multipliers. But this is also the slowest design with 856 cycles latency because the RAM bandwidth is limited. When the RAMs are replaced by register files, it is a little bit faster with 611 cycles latency as shown in solution 2. But the size is also bigger. In solution 3, since the pipelining of the loops is designed, the design is much faster with 307 cycles. But this requires more complex control logic and more multipliers for pipelined processing. To meet the 160 cycles latency requirement, register files have to be used everywhere and the two major loops need to be unrolled in parallel as shown in solution 4. Although it only has 147 cycles latency, the design is rather big with 91 dedicated ASIC multipliers. Without using special VLSI design circuit, the algorithm is not very suitable for practical implementation.
In Table 5 , the design architectures based on the SMU combinational logic circuit with different resource mapping and architecture constraints are compared. Solution 1 corresponds to the most area-constrained design. However, it could not meet the real time requirement with 529 cycles latency. Solution 2 only replaces the memory blocks with register files and has the same computation structure. It is slightly faster with 439 cycles latency. But the usage of MUX for the remaining multipliers leads to higher CLB number. Solution 3 uses register files for local arrays and memory for the interface. To meet the latency requirement, we unroll the L1.1 and L1.2 loops and design the logic circuits using the SMU units jointly for all the 10 users. Pipelining is achieved for the modules built from simple combinational gates. It could be seen that in solution 3, we meet the time requirement with 151 cycles at 59 MHz clock rate. 9 dedicated multipliers are still used for the remaining multiplications. Compared with the fastest design using dedicated multipliers, it achieves 10× saving in the number of multipliers. Table 6 and 7 present the scalable specifications for the multiplier-based and SMUws-based architectures of the PRC-MFB module, respectively. Solution 1 represents the most compact design in CLB consumption because it uses memory blocks for all arrays and no advanced architecture is designed for all level loops. Solution 2 uses register files for all arrays with no pipelining too.
C. Scalable Architecture for the PRC-MFB Module
It is a little bit faster but also bigger. Solution 3 utilizes elegant partial loop pipelines/unrolling for different levels based on the algorithm structure. Solution 4 first unrolls the L1.1 and L1.2
and designs L1 pipelining with initial interval of 2. It is seen that for the multiplier-based design, although solution 3 is much faster than the solutions 1 and 2, it still does not meet the 160-cycle constraint. Only a fully pipelined architecture in solution 4 meets the 160-cycle requirement with 16 ASIC MULTs. This gives a design with 35 cycles latency, which is much faster than necessary. 
VII. Conclusion
In this paper, we propose a novel low power and low complexity SoC architecture for multistage adaptive interference cancellation for CDMA systems. The Parallel-Residue-Compensation architecture which avoids the direct interference cancellation is optimized to reduce the redundant computations for efficient VLSI design. The CMV is proposed to combine with clock gating for dynamic power management of the SoC architecture. Efficient VLSI architectures are designed based on combinational logic circuits to avoid the usage of dedicated ASIC multipliers. The SoC design space is explored by using a Catapult C HLS design methodology, which leads to area/time efficient architecture. The VLSI architectures demonstrate efficient hardware resource usage and significant power saving by meeting the real-time requirements.
