This paper describes a 16-bit programmable xed-point digital signal processor, called MDSP-II, for mobile communication applications. The instruction set of MDSP-II was determined after a careful analysis of the GSM Global System for Mobile communications baseband functions. An application-speci c hardware block called MCA Mobile Communication Accelerator was incorporated on-chip to accelerate the execution of the key operations frequently appearing in Viterbi equalization.
I. Introduction
Programmable digital signal processors DSP's have been widely used to reduce the development cost and the time-to-market in many applications 1 . In the case of mobile communication systems such as GSM Global System for Mobile communications and IS-136, however, DSP's have only been applied to speci c areas such as speech codec because of their limited performance. To o vercome this limitation, a number of application-speci c DSP's have been developed. For example, Lucent's DSP1618 performs the Viterbi decoding using a co-processor 2 , which supports various decoding modes with control registers at the cost of chip area. On the other hand, TMS320C54x supports speci c instructions for the Viterbi decoding 3 , which makes it very popular for mobile communication. However, since it has only one multiplier, it is di cult to handle the multiplication and accumulation MAC of complex numbers which is quite often used in such applications as channel equalization. Recently several DSP's which can support two M A C operations per cycle have been developed. For example, VLIW Very Long Instruction Word-based TMS320C6x, which lacks a dedicated MAC unit but has two multipliers and six arithmetic units, performs MAC operations by using separate multiply and add instructions 4 . Also, the dual MAC unit in the Lucent's DSP16000 performs two m ultiplications and two accumulations in one cycle, and supports instructions for the Viterbi decoding 5 .
This paper describes MDSP-II, a 16-bit DSP with a special functional block called MCA Mobile Communication Accelerator, developed using the MetaCore framework 6 which is a design framework for generating application-speci c instruction set processorsASIP's for DSP applications, along with the generator set for software tools such as compiler, assembler, and instruction set simulator. The design of MDSP-II was initially started with a basic hardware architecture and a basic instruction set provided by the MetaCore DSP design framework. The hardware architecture and the instruction set of MDSP-II have been customized for mobile communication applications, including the application-speci c functional block called MCA, the details of which can be found in 6 .
MDSP-II was designed with the objective of performing several crucial mobile communication functions faster and with less processing power requirement due to the adoption of MCA, a special functional block, as well as instruction set optimized toward the mobile communication applications. Therefore, the remaining processing power can be used for other features such as echo cancellation and speech recognition for voice dialing. Even the half-rate complex vocoder function to enhance the channel utilization can be implemented with the remaining processing power. At least, the remaining processing power always helps to reduce the power consumption to make MDSP-II more competitive for portable applications, simply by performing the same functionality at the reduced clock rate.
In this paper, we rst analyze the key algorithms in the mobile communication applications requiring many clock cycle counts, and then extract frequent and time-consuming operations to e ectively reduce the total clock cycles required. The architectures of MDSP-II and MCA are described along with the key operations supported. Finally, experimental results on the proposed accelerator are shown with the overall performance improvement of MDSP-II in GSM, one of the major mobile communication applications. . GSM is the European standard for digital cellular systems. We selected GSM as a major application for analysis and benchmarking because it is one of the popular mobile communication standards in the world and many of its algorithms are similar to those of other standards. The baseband processing blocks of GSM are shown as shaded in Fig. 1 , where the functional blocks with relatively less signi cant complexities such as interleaving deinterleaving, enciphering de-ciphering are not included. To use as a reference in estimating the number of clock cycles necessary for each block, we chose MDSP-II without MCA, because the performance of MDSP-II without MCA is similar to the that of TMS320C5x, a widely used xed-point DSP. For both voice coding and voice decoding, regular pulse excitation long term predictor RPE LTP-based voice coder was employed. In the channel decoding, the Viterbi algorithm was used to decode a convolutional coding. For the equalization to ght the inter-symbol interference due to multi-path fading, the Viterbi equalization algorithm based on maximum likelihood sequence estimation MLSE was used. Fig. 2 shows the required processing power of each component of the GSM baseband functions. Most general-purpose DSP's cannot support all the baseband functions, as the total required processing power for GSM itself is 53 million instructions per second MIPS. Furthermore, future systems will have a bigger demand on the power requirement, aggravating the already tight p o wer budget. Fig. 2 shows that nearly 80 of the total processing power 42 MIPS out of the total 53 MIPS is consumed just for equalization. Therefore, it is important to accelerate the execution of equalization for performance improvement. The detailed behavior of the Viterbi equalization, the required processing power analysis, and three key operations are explained in the following section.
II. Key Operations of Mobile Communication
A. Viterbi Equalization for GSM states at time k-1. For each state transition, there is one possible input vector and the corresponding channel output reference value. At time k, the di erence between the k-th received signal value and the reference value is called the branch metric and its accumulation along a path is called the path metric. After obtaining the branch metrics, the Viterbi algorithm computes path metrics of the two paths entering each state and determines a path with the smaller path metric, called survivor. Finally, the transmitted sequence is estimated by tracing back the survivors. A comprehensive tutorial on the Viterbi equalization is given 7 .
B. Computational Considerations
Again referring to Fig. 3 , the received signal is complex, consisting of in-phase and quadrature components. As a consequence, complex multiplication and accumulation complex MAC is frequently used for channel estimation and reference generation. Each complex MAC needs 6 clock cycles in the general-purpose DSP's, and about 9 MIPS is necessary for the complex MAC operations in GSM. The square distance calculation for obtaining the branch metric requires heavy computation. One such operation needs 10 clock cycles in the general-purpose DSP's, and a total of 20 MIPS is occupied for the distance calculation in GSM. The ACS is a key operation of the Viterbi algorithm, which is used for both equalization and channel decoding. In GSM, the ACS operations for equalization and decoding need about 10 MIPS and 3 MIPS, respectively.
Based on the above analysis of the required processing power in GSM applications, three key operations complex MAC, square distance calculation, and ACS are identi ed for acceleration using application-speci c hardware in the design of MDSP-II. Details of the architectures of MDSP-II and MCA are explained in the following section.
III. MDSP-II
A. Architecture of MDSP-II and instructions are stored in the program memory PMEM. The heart of the architecture consists of three execution units operating in parallel, i.e., a xed-point data processing unit DPU, a memory address generation unit AGU, and a program control unit PCU. Peripheral unit PU is also included for communication with external devices. Brief explanation on each unit is as follows: DPU, which consists of various basic hardware units, has been designed to optimize the time-critical inner-loop functions of most DSP algorithms. With the help of the MAC unit, the MCA unit performs such operations for the Viterbi equalization as complex MAC, distance calculation, ACS operation, and trace-back operation. The details of these operations will be explained in the following section. AGU performs e ective address calculations necessary for fetching operands in memory. It can generate two addresses and modify them in one clock cycle, which occurs simultaneously with the operation of DPU. AGU contains eight address registers ar0, ar1, , ar7, each of which can be controlled independently to support three addressing modes generally used in DSP algorithms, i.e., linear, modular, and bit-reverse modes. PCU supports zero-overhead loop control using hardware-DO as well as branching, subroutine control, and exception handling using a program counter PC and an instruction register IR. PU has two serial ports SIO0 and SIO1, a timer TIMER, and an interrupt controller INT. Fig. 5 shows three instruction formats of MDSP-II with the corresponding examples. INST denotes the mnemonic of instructions, and ACC denotes the accumulator a0 or a1, while S1 and S2 denote either memory operand or accumulator. When used as a memory operand, S1 or S2 must specify both address pointer e.g., ar0 and its update rule e.g., ++ for the auto-increment. The optional part + -is used to discriminate between MAC+multiply and add and MAC-multiply and subtract instructions. If ACC is not speci ed as the destination, the relevant special register is used as the destination.
The rst example shows the usage of ADD instruction, which is common to arithmetic and logic instructions. MAC+ instruction performs two operand fetches from memory which is pointed by address registers; ar0 and ar1 followed by two address register updates ar0=ar0+1, ar1=ar1+1 and two arithmetic operations multiply & add. HDIS Hamming distance calculation shows an example where the destination is implicitly speci ed.
B. MAC Unit and MCA Unit ;; special register = S1 op S2
[Example] ADD a1, *ar0 ;; a1 = a1 + (*ar0) MAC+ a0, *ar0++, *ar1++ ;; a0 = a0 + (*ar0++ x *ar1++) HDIS *ar0++, *ar1++
;; Hamming distance between *ar0++ and *ar1++ ;; is stored in a special register unit executes the instructions for complex MAC, square distance calculation, and ACS of the Viterbi algorithm, with the help of the MAC unit. The MCA unit consists of a 16x16 multiplier, a 36-bit adder, a Hamming-distance unit, a queue, a product register CPR, two operand latches CS1, CS2 and three 36-bit accumulators CACC0, CACC1, CACC2.
IV. Accelerated Key Operations
In this section, a detailed description on the behavior of each key operation is given.
A. Complex MAC
Since the MAC operation occurs very often in most DSP applications, every DSP has the MAC unit. In the equalization of GSM, the received signal is a complex number which is composed of the in-phase and quadrature terms. However, complex MAC, which is necessary in the Viterbi equalization, cannot be e ciently handled in the traditional MAC unit due to the di culty in maintaining both the real part and the imaginary part. In MDSP-II, complex MAC is performed by assigning the MAC unit for calculating the real-part value and the MCA unit for the imaginary-part value.
The complex MAC operation for Eq:1 One complex MAC corresponds to the evaluation of the real-and imaginary-part of the i-th entry and adding them to or subtracting from the corresponding accumulator. Therefore, one complex MAC is composed of eight data loads, four multiplications, and four accumulations. Compared with the general-purpose DSP's with only one MAC unit where at least four clock cycles are necessary for executing one complex MAC operation, MDSP-II executes one complex MAC operation in two clock cycles by exploiting the MCA unit as well as the MAC unit. Fig. 7 shows the whole datapath for the complex MAC operation consisting of the MAC unit and the MCA unit. The real part is calculated in the MAC unit and its result is accumulated in the rst accumulator ACC0, whereas the imaginary part is calculated in the MCA unit with its result accumulated in CACC0.
Even though there are eight data loads in Eq. 1, four duplicate data loads can be eliminated if two latches CS1, CS2 are properly controlled to reuse the previously latched values.
The behavior of complex MAC is further explained with the program and data structure shown in Fig. 8 . The program for complex MAC is just a repeating sequence of`CMAC1 *ar0++, *ar1++' and`CMAC2 *ar0++, *ar1++'. INIM instruction needs to beexecuted a priori to initialize all the MCA-related registers. The behavior of complex MAC operation can be easily understood with Table I , which i s produced based on the behavior of CMAC1 and CMAC2 instructions as shown in Fig. 9 . Table I 
B. Distance Calculation for the Viterbi Algorithm
There are two t ypes of distances that are evaluated in the Viterbi algorithm of GSM. One is the square distance for equalization and the other is the Hamming distance for channel A general-purpose DSP needs about 10 instructions for this calculation. However, MDSP-II can calculate one distance every two cycles using the datapath shown in Fig. 10 . To calculate one square distance, two one-cycle instructions, i.e., SDIS1 and SDIS2 need to be executed in sequence, where the behavior of each instruction is described in Fig. 11 .
The behavior of the square distance calculation is as follows. After the di erence, a-c, is obtained by the ALU in the MAC unit and squared by the multiplier in the MCA unit, the squared value, a-c 
B.2 Hamming Distance for Channel Decoding
The Hamming distance of two integer numbers is used as metric in the convolutional decoder. The Hamming distance can be simply obtained by counting the number of 1's after the exclusive-OR operation of two 16-bit numbers. To count the number of 1's is loop-intensive if implemented in software. Instead, 1's counting is implemented in one clock cycle using Wallace tree 10 of depth 6. The Wallace tree is composed of 15 full-adder 
C. ACS Add-Compare-Select of the Viterbi Algorithm
A part of the trellis diagram is shown in Fig. 13 , where A, B, and C are states at each time and p0 is a path metric up to A while p1 is a path metric up to B. d0 is the distance between A and C, while d1 is the distance between B and C. The path metric up to C is, therefore, the smaller of p0+d0 and p1+d1 9 . Given path metrics and distances, determining the path metric of C is performed by the ACS operation which is composed two additions, one comparison, and one selection.
MDSP-II performs two additions simultaneously in one cycle using the MAC unit and the MCA unit, which is called dual addition mode. There are two instructions supporting the dual addition mode; DADD1 for the Hamming distance and DADD2 for the square distance because the latency of the Hamming distance calculation is 1 and the latency of the square distance calculation is 2. The behavior of the ACS operation is as follows, where the behavior of related instructions DADD1, DADD2, and ACS is shown in Fig. 14 . When the square distance calculation has been done using SDIS1 and SDIS2 instructions, two distance values, i.e., d0 and d1, are stored in CACC0 and CACC2, respectively. Then, DADD2 instruction is executed such that ACC0 contains p0+d0 and CACC0 contains p1+d1 eventually as shown in Fig. 15 . For the Hamming distance calculation, CACC0 and CACC1 contain d0 and d1, respectively, which are 
V. Experimental Results
The MDSP-II was fabricated using a 0.6 m TLM CMOS process as a 9.7 mm 9.8 mm die and operates up to 55MHz clock based on 5-stage pipeline. Details are shown in Table   ALU II, and the photomicrograph of MDSP-II is shown in Fig. 17 . We compared MDSP-II with TMS320C54x which is one of the most popular DSP chips for mobile communication applications. The result of performance comparison for the Viterbi decoding is shown in Table III , where r denotes the code rate and * denotes that GSM-speci c generator polynomials were used. The constraint length is 5 for all cases.
MDSP-II performs the Viterbi decoding in 0.8 MIPS for all cases. MDSP-II is superior to TMS320C54x except for the case r=1 2 * where the trellis diagram is symmetric so that the special instructions of TMS320C54x, which are applicable only to the symmetric trellis diagram, can be used to simplify the decoding 3 . However, the symmetry of a trellis diagram does not exist in the Viterbi equalization that is the most time-consuming portion 
VI. Conclusions
In this paper, we presented a 16-bit DSP called MDSP-II designed for GSM based on an accelerator called MCA Mobile Communication Accelerator which supports the complex MAC operation and the Viterbi algorithm. In MDSP-II, the inclusion of MCA has produced a performance improvement by 170 for GSM at the additional cost of 5.1 in chip area. Because the input output interface of MCA is very simple and is controllable at the instruction-level, it can be used as a functional block in designing other appropriate application-speci c DSP's.
