Abstract-In this paper, a digital signal processor (DSP) with programmable correlator array architecture is presented for third generation wireless communication system. The programmable correlator array can be reconfigured as a chip match filter, code group detector, scrambling code detector, and RAKE receiver with low power consideration. The architecture and instruction set of the proposed DSP are specially designed for several key operations of wireless communications, such as channel estimation for RAKE combining, Viterbi algorithm and finite-impulse response filtering. According to the performance evaluation results, the proposed DSP outperforms other previously presented communication digital signal processors in terms of several crucial operations of wireless applications. A chip of the proposed DSP was implemented using hybrid design method where the timing critical components were full-custom designed and the other parts were cell-based designed under TSMC 0.35-m CMOS 1P4M technology. We believe that the proposed system architecture would be useful for upcoming 3G mobile terminal applications.
I. INTRODUCTION

A. Third Generation Wireless Communication System
W IRELESS communications become more and more popular in these years. The requirement of data transmission through mobile radio interface also increases rapidly. Thus some previous communication standards for speech transmission such as GSM are extended to support data transmission [1] . Even though, most of the third generation mobile communication proposals suggest the wide-band code division multiple access (WCDMA) [2] , [3] in order to support future multimedia transmission which requires much higher data rate and wider bandwidth. Fig. 1 shows a brief illustration of the WCDMA system diagram. In the transmitter side, the baseband data is first convolutional encoded and spread by a pseudonoise (PN) sequence. The spread signal is then transmitted out through RF module to the air. In the receiver side, the signal received by RF module from the air is first de-spread by correlators and then convolutional decoded to obtain the transmitted data. In this paper, we propose a system architecture including a digital signal processor (DSP) with a programmable correlator array for the de-spreading and convolutional decoding of WCDMA receiver.
B. Correlator Array and DSP
In CDMA systems, correlators are needed to de-spread the received signal [4] , [5] . The input data is spread by a PN sequence, and the receiver has to de-spread the received signal into original symbols by calculating the correlation of input data and the PN sequence. The receiver adjusts the timing offset to search the maximum correlation value. This process is very time-consuming, and thus some modern CDMA receivers usually use many correlators to perform parallel search [5] . Furthermore, in order to support higher data rate services, multicode DS-CDMA (MC-CDMA) [6] is suggested in ETSI UMTS [7] and TR45.5 CDMA2000. That is, one user uses several codes to transmit his data. As a result, correlator plays an important role in CDMA systems. Besides, in order to effectively reuse the same correlator array, it is quite important to design a reconfigurable correlator array architecture that can be used both in code acquisition [8] and code tracking [9] . The detailed issues of the reconfigurability will be discussed in Section II.
In addition to reconfigurability, low-power design must also be considered because correlator array usually consumes large chip area. There have been some existing methods to reduce the power consumption of correlator array, such as using signmagnitude number system [5] or using the redundant codes to create zero terms [10] , [11] . In this paper, we introduced a lowpower tri-code correlator based on the later concept. From the tri-code correlator, the programmable correlator array architecture is then illustrated.
In wireless devices, the programmable digital signal processors are widely used as baseband signal processing engine to support necessary system flexibility and upgradability. In most CDMA systems, the DSP is used to process the symbol-rate data [12] . Because much higher data rate is required in third generation wireless communication system, a more powerful DSP is very desired for the acceleration of symbol-rate data processing to meet system requirement. For these purposes, we propose a powerful DSP with specially designed architecture and instruction set for third generation wireless communications. This paper is organized as followed, the programmable correlator array architecture is presented in Section II. Section III describes the detailed architecture of the proposed DSP. In Section IV, the performance evaluations of the proposed DSP and several previous arts are taken into consideration. The design flow and chip implementation issues of the DSP is discussed in Section V. Finally, a summary is given to conclude this paper.
II. PROGRAMMABLE CORRELATOR ARRAY
A. Tri-Code Correlator
Traditional correlator can only correlate the input signal with code 1 or 1. In order to improve the programmability and reduce power consumption, the tri-code correlator shown in Fig. 2 is developed.
The LSB of the CODE signal represents the real code of input data to correlate with. In hardware implementation, 0 represents that the code equals to 1 and 1 represents that the code equals to 1. The MSB of the CODE signal is gated to the clock signal of the accumulator. If the MSB of CODE is set to 1, this correlator behaves as a traditional correlator. Otherwise, this correlator will stop correlating the input signal. Thus, this is a tri-code correlator that can correlate with 1, 1, and 0. If two tri-code correlators are connected with some preprocessing of the two input codes as shown in (1), it becomes a dual-code correlator that can reduce power consumption [13] .
and
(1)
In (1), " " and " " are the two input codes. After correlated, the two outputs of the two correlators are added to get the correlation value for code " " and subtracted to get the correlation value for code " ".
B. Programmable Correlator Array
From Third Generation Partnership Project (3GPP) drafts, there should be two phases in de-modulator. The first phase is code acquisition [14] , [15] and the second phase is code tracking. During the acquisition phase, three steps are performed in turn: First, a chip match filter is needed to match the primary synchronization channel in order to find out the slot position. Next, 17 correlators are needed to find out 16 consecutive slot numbers so that the code group information and frame boundary can be got by looking up code group table. Finally, 16 correlators are needed to identify the scrambling code if there are 16 codes in each group. After these three steps, code acquisition is completed and code-tracking phase can be started. During the tracking phase, a simple early-late delay lock loop will be used and a RAKE receiver is also needed to eliminate the multipath effect. If all above operations are needed to perform on a single architecture, then a reconfigurable correlator array that can be programmed as different architectures for different phases or steps must be considered.
From the above analysis, a programmable correlator array architecture with low-power consideration is proposed to meet the two phases requirement as shown in Fig. 3 , where the TCC in Fig. 3(a) is a tri-code correlator as mentioned before. This architecture looks like the correlator array architecture in [16] , but the proposed one saves a lot of chip area because it doesn't need the data unit delay in [16] . According to the simulation result in EPIC TimeMill, our architecture can save about 20% power consumption when compared with [16] .
In Fig. 3(b) , the CODE signal is 17 by 2 bits wide. The 17 pairs of codes can be fed into 17 correlator bank elements (CBEs) separately by multiplexers. We will show how the architecture can be reconfigured to meet the acquisition phase and tracking phase requirements.
1) Slot Synchronization: When doing slot synchronization, the primary synchronization code is fed into correlator bank and the codes from previous stage CBE are fed into next stage CBE. By this approach, the correlator array can be used to substitute a chip match filter that consumes large chip area. But it may need more time to finish slot synchronization.
2) Frame Synchronization: When doing frame synchronization, 17 secondary synchronization codes are needed as CODE signal for 17 CBEs. These 17 codes also can be preprocessed by using (1) before fed into correlator bank in order to reduce power consumption. After correlation, the outputs are sent to DSP to do some postprocessing to get the actual correlation values.
3) RAKE Receiver: The architecture also can be configured as a RAKE receiver. The scrambling codes that have different delays can be combined as CODE signal. After correlation, the outputs are sent to DSP to do channel estimation and RAKE combining.
III. DSP
The proposed DSP is specially designed for symbol-rate I/Q channel data processing, such as channel estimation, RAKE combining, Viterbi algorithm, and finite-impulse response (FIR) filtering operation that is widely used in communication systems. The design issues for the proposed DSP are discussed in the following paragraphs.
A. WCDMA System Simulation
Before designing the DSP architecture, the simulation of WCDMA system is first considered. After the simulation, several key operations that can be well executed by the DSP can then be extracted. Fig. 4 shows the proposed complete system diagram of the WCDMA receiver that contains a correlator array, a digital signal processor, some code generators, and a system controller. The controller controls the tasks of baseband processing, providing the code seeds to the code generators and handshaking with the DSP. The code generators receive the code seeds and then generate the codes to correlator array. The processing in chip-rate is handled by the correlator array, and the symbol-rate processing is handled by the DSP.
After the correlator array architecture is designed, we use C language to simulate the WCDMA system baseband part based on 3GPP documents. The simulation program contains three major parts: transmitter, channel model, and receiver. The transmitter contains four physical channels: synchronization channel, primary common control physical channel, secondary common control physical channel, and dedicated physical channel. These are spread by its own channelization code that preserves orthogonality. After spread, they are summed and multiplied by the scrambling code. Afterwards, the signal summed with synchronization codes is fed into a pulse shaping filter. The channel model contains multipath effect, Rayleigh fading effect, AWGN, and carrier offset. The delay profile suggested in 3GPP document for four paths is used.
We first perform the floating-point simulation, and then the fixed-point simulation. From the simulation result, we analyze that the error of 6-bit word length precision would be acceptable for the correlator output to the DSP. Besides, during the simulation, we also find out that there exists large amount of several key operations among each decoding phase. According to these results, we extract several key operations out, and then define special instructions and processor architecture for these operations. The details are discussed in the following sections.
B. DSP Architecture Overview
Fig . 5 shows the block diagram of the proposed DSP. The DSP uses the modified Harvard architecture with two data memories and one program memory, each has a 16-bit addressing space. The data memories use 16-bit word length and the program memory uses 28-bit word length, respectively. Each data memory has its own address generator (AG) which has eight index registers and modulo addressing mode.
The datapath contains four components: the arithmetic logic unit (ALU), the multiply-accumulate unit (MAC), the barrel shifter (SFT), and the comparator (CMP). The inputs of ALU, SFT, accumulator of MAC, and CMP are 40-bit wide. And the inputs of multiplier are 16-bit wide. The output of data path can be stored into two 40-bit registers (D0 and D1) or two data memories.
There are five pipeline stages in the DSP: instruction fetch, instruction decode, operand read, execution, and write back. One instruction can be executed in one clock cycle.
C. SubWord Parallel (SWP)
According to our simulation results of WCDMA system, 6-bit word length precision is enough for correlator output. There- fore, a normal 16-bit DSP datapath can be separated into two 8-bit data for I channel and Q channel, respectively. By this subword parallel (SWP) architecture, the symbol rate I/Q channel data processing can be effectively accelerated. Thus, the proposed DSP can support two 8-bit operations or one 16-bit operation at one time. In some special cases, it also supports two 16-bit operations when the inputs of datapath come from two 40-bit accumulators. The data format is shown in Fig. 6 .
D. Instruction Set
The proposed instruction set can be classified into four kinds: computation, data movement, program flow, and special instructions.
1) Computational Instructions:
These instructions contain datapath operations such as ALU, CMP, MAC, and SFT operations. The instruction format can be simplified as in Table I .
In Table I , X and Y means input operand that can be internal registers or data memory address pointers. If X or Y is a pointer, the data memory content of the pointed address is read out and fed into datapath. D means the output destination of datapath output, and it also can be register or data memory address pointer. A and B are also output destinations, but they only can be memory address pointers. The detailed instruction information of computational type is listed in the Tables V-VIII  of the Appendix. 2) Data Movement Instructions: These instructions process the data movement in the DSP. It contains five formats of data movement: memory to register, register to register, register to memory, memory to I/O bus, and I/O bus to memory. The detailed instruction information is listed in Table IX of the Appendix. 3) Program Flow Instructions: These instructions involve the change of program counter value such as branch, looping, jump, and return. The detail instruction list is shown in Table X of the Appendix.
4) Special Instructions:
These instructions are specially designed for wireless communications. We will discuss each in detail in the Section III-E. The special instruction list is shown in Table XI of the Appendix.
E. Special Instructions for Third Generation Wireless Communications 1) Channel Estimation for RAKE Combining:
The main idea of channel estimation is to use the known pilot symbol to obtain current mobile channel response. Then this information can be used for maximum-ratio-combining as the spirit of RAKE receiver [2] . In this manner, some complex arithmetic such as multiplication and multiplication-accumulation are needed. In a normal 16-bit DSP, it will take six cycles to complete a complex multiplication that contains four multiplications and two add/subtraction operations. Because the proposed DSP support either 8-bit or 16-bit data format, it has four 8-bit multipliers as shown in Fig. 7 . Thus, it is sufficient to perform a complex multiplication/multiply-accumulation (complex MUL/MAC) in one instruction cycle. The results of complex MUL/MAC become two 16-bit words for real and imaginary part. These two words can be stored to two 16-bit data memories or be saved in a 40-bit accumulator. The data flow of complex MUL/MAC is shown in Fig. 8 .
2) Viterbi Algorithm: Convolutional encoding is the most popular error correction coding method in wireless communication systems. Convolutional encoded data is decoded by using Viterbi algorithm on trellis diagram [17] . There are two steps in Viterbi decoding process as discussed in [18] . The first step is the metric update. In this step, the operation add-compare-select (ACS) is used. Because this step is most time-consuming in Viterbi decoding, TI C54x uses special instructions to speed up ACS operation. Even though, it still needs five instruction cycles to complete one butterfly calculation. We use SWP with single instruction streams multiple data streams (SIMD) [19] architecture to accelerate the ACS operation as shown in Fig. 9 . The MAC bypasses the multipliers and behaves as a 24-bit substractor/adder and a 16-bit adder/substractor. The ALU behaves as a 24-bit adder/substractor and a 16-bit substractor/adder. At the same time, the comparator selects and saves the minimum distance of previous calculated path distances to data memories. In other words, two ACS operations can be finished in one cycle if SWP and SIMD features are used at the same time. This is also the reason for separating CMP from ALU unit.
Obviously, two memory-read and two memory-write are needed per cycle when performing this dual-ACS operation. Thus, there may be some options for data memory type. At least one 2R2W port memory or two 1R1W port memories are needed. According to [20] , when the port number for a SRAM equals to four, its size is about three to four times larger than two two-port SRAM. Thus, two 1R1W port memories are preferred. According to the indexing of the metric data shown in Fig. 10(a) , its exclusion graph is drawn in the left of Fig. 10(b) based on only two 1R1W port data memories are available here. By this exclusion graph, the metric data must be arranged as shown in right-hand side of Fig. 10(b) . In this way, the data flow is smooth for each trellis diagram stage. Without this consideration, it will need to use one 2R2W port SRAM as in [21] that consumes much more area than two 1R1W port SRAM, or need two cycles to write back the new metric data as TI C54x does.
When performing ACS operation, the two flags of CMP results are needed to be stored to a transition table for further trace back operation. Thus the two result bits are stored in a 16-bit shift register that shifts two bits per cycle. After the shift register is full, the content of shift register is moved to transition table in data memory.
The next step of Viterbi algorithm is trace back. Here we refer the architecture in [21] and support for constraint length from 5 to 9. As shown in Fig. 11 , the middle shift register is configurable from four bits to eight bits in order to support constraint length to and a special instruction called "CFGTRC" is executed to configure it before running trace back instruction "TRCBK." The four most significant bits of this shift register are used as four least significant bits of data memory address while constraint length equals to 9. The four least significant bits of this shift register are used as shift amount control for barrel shifter in order to shift right the transition data to correct state. As a result, the least significant bit of barrel shifter is the decoded data and is stored into a 16-bit shift register.
3) FIR Filtering: FIR filtering are widely used operation in many communication systems [22] . Since the proposed MAC architecture support 8-bit data format operation, it can speed up FIR filtering operation by a factor of two easily by active the left two 8-bit multipliers, a 32-bit adder below the cross-bar, and the last stage 40-bit adder in MAC. The data flow of this dual FIR operation is shown in Fig. 12 , and the memory arrangement of input data and coefficients is show in Fig. 13 . Input data sequence is stored in one data memory in packed form, and the filter coefficients are stored in another data memory also in packed form. Fig. 14 shows the example assembly code for a 33-tap FIR filter. Although there is some overhead when doing addressing registers initialization (address 0-10), it can be ignored if the input data stored in data memory 0 (addressed by Am0) is large enough.
F. Other Features
During slot synchronization or frame synchronization procedure, we need to compare I channel and Q channel square sum of different time delay. Thus the operation is needed. From the proposed MAC architecture, it can support this operation easily by modifying the input data of FIR instruction. The same data whose format is shown in Fig. 4 is fed into the two input ports of MAC, and the control signals of MAC are the same as those of the FIR instruction.
The proposed DSP has five interrupt vectors that are useful to get data from correlator array. Besides, the DSP has a general input-output port that can be used to configure the correlator array.
Zero-overhead looping instruction is also supported by the proposed DSP. The "DO #addr" instruction specify the loop ending program address and also set the next address of this instruction as loop starting address. The program will execute between the loop starting address and the loop ending address. Before using this instruction, the 16-bit register "COUNT" must be configured which stores the looping times. Every time the program counter meets the ending address, the COUNT value automatically decreases one until COUNT becomes zero. At the same time, the program sequencer set the next program address to the loop starting address. Thus, no any extra cycles will waste while executing looping program codes.
IV. PERFORMANCE EVALUATIONS
The proposed DSP has been compared with TI C54x [23] and C55 [24] , LODE from Atmel [25] , and MDSP-II [26] that are also designed for communication applications. Table II summaries the performance of these DSPs in terms of convolutional decoding, FIR filtering, and complex arithmetic.
The first three rows are the convolutional decoding at for GSM, IS-95, and 3G WCDMA, respectively. Since convolutional encoded data is decoded by using the Viterbi algorithm as mentioned before, thus we calculate the required processor MIPS for the Viterbi algorithm in each communication standard. The required instruction cycles for single bit processing of the Viterbi algorithm including ACS operation and trace back operation are calculated first. Then the supported bit rate in each standard, such as 9.6 K bit per second (bps) in GSM and 384 K bps in WCDMA, is multiplied by the previous calculated result to obtain the total required processor MIPS. The fourth row uses the N-tap FIR filtering operation to measure the required instruction cycles in each DSP. The fifth row compares the required instruction cycles for one complex MUL/MAC operation performed in each DSP.
From Table II , we can find out that the proposed DSP has about four times to eight times performance at convolutional decoding and two times to four times performance at complex arithmetic. It also has the same performance with some DSP's that have two MAC units when performing FIR filtering operation.
Because the proposed MAC uses SWP architecture that contains four 8-bit multipliers, the MAC would be a little slower than a normal MAC that contains only one 16-bit multiplier. From our simulation result, the proposed MAC is slower than a normal MAC about 14%. Thus we can normalize the performance comparison result as shown in Table III . 
V. DESIGN FLOW AND CHIP IMPLEMENTATION
As previous discussed, the WCDMA system simulation is first considered. After the simulation, several key operations that can be well executed by the digital signal processor can then be extracted. Then we target on these operations to define the instruction set and design the architecture for the DSP in order to accelerate these operations efficiently. The complete design flow is shown in Fig. 15 . The design issues of system simulation, key operations extraction, and instruction set definition have been discussed in Section III, thus we focus on the remaining design issues in the Sections V-A-D.
A. Hardware C Simulation
After the instruction set is defined, a cycle-accurate simulator written in C language (hardware C) is developed to make sure the instruction set can work as we expect. Besides, when developing the simulator, we can also get some timing information for the proposed DSP architecture. If the architecture has any problem, it can be found early when running testing programs on the simulator and can be fixed easily. Without the help of this simulator, the total design time will increase a lot if we need to modify the DSP architecture when developing HDL design, this is because that design bugs are easily occurred during HDL design.
The simulator interface is shown in Fig. 16 . Users can load DSP program file or write DSP program directly on the right-hand side of the simulator. The execution result can be export to file and some internal register or memory content can be shown directly on the simulator. As shown in Fig. 16 upper-right corner, the simulator can set the cycle number to be run or can run to the desired program address. Thus, it is very useful when developing and debugging application programs.
Besides, since the simulator is written in hardware C, it can be mapped to HDL design directly. The relation of hardware C and hardware pipeline is shown in Fig. 17 , which shows the last stage in hardware pipeline should be executed first in C program. 
B. Cell-Based Design
After the instruction set is verified, we translate the hardware C to HDL design. Then SYNOPSYS design compiler is used to synthesize the gate level HDL using 0.35-m cell library as our synthesis library. The synthesis report shows that the maximum clock frequency is up to 40 MHz and the critical path is the MAC unit.
C. Full-Custom Design
After synthesis, the timing critical multipliers and adders in DSP datapath are manually designed by full-custom design flow and the others are still designed by cell-based design flow. By this approach, the performance of the DSP can be improved about 50% compared to pure cell-based design version.
D. Chip Implementation
After the full-custom portion is completed, the automatic P&R is performed, and then the layout verification (DRC and LVS) is checked. Finally, we use EPIC TimeMill and PowerMill for postlayout simulation to obtain the chip features. Table IV summaries the chip features of the DSP. The chip layout is shown in Fig. 18 . TABLE V  ALU INSTRUCTION LIST   TABLE VI  CMP INSTRUCTION LIST   TABLE VII  MAC INSTRUCTION LIST   TABLE VIII  SFT INSTRUCTION LIST VI. CONCLUSION In this paper, an enhanced DSP with programmable correlator array architecture is proposed for the third generation wireless communication systems. The low power programmable correlator array can be reconfigured as chip match filter, code group detector, scrambling code detector, and RAKE receiver. The proposed DSP is designed with several special instructions to accelerate the operations for symbol-rate data processing such as channel estimation, Viterbi algorithm, and FIR filtering. Since the processor MIPS needed for these key operations are effectively reduced, we believe that the proposed architecture would be useful in upcoming 3G systems that support much higher data rate then 2G systems.
APPENDIX
See Tables V-XI. TABLE IX  DATA MOVEMENT INSTRUCTION LIST   TABLE X  PROGRAM FLOW INSTRUCTION LIST   TABLE XI  SPECIAL INSTRUCTION LIST 
