Abstract -This paper describes the architecture and the performance of a new programmable 16-bit Digital Signal Processor (DSP) engine. It is developed specifically for next generation wireless digital systems and speech applications. Besides providing a basic instruction set, similar to current day 16-bit DSP's, it contains extra architectural features and unique instructions, which make the engine highly efficient for compute-intensive tasks such as vector quantization and Viterbi operations. The datapath contains two Mu1tipl:y-Accumulate units and one ALU. The external memory bandwidth is kept to two data busses and two corresponding address busses. Still, the internal bus network is designed such that all three units are operating in parallel. This parallelism is reflected in the performance benchmarks. For example, an FIR filter of N taps will take N/2 instruction cycles compared to N for a general purpose 16-bit DSP, and it requires only half the number of memory accesses of a general purpose DSP.
INTRODUCTION
Next generation wireless digital systems and speech applications require a performance not offered by current day DSP's. To accommodate more users and to improve the quality of the speech, more complex algorithms are introduced. Examples are the Japanese PSI-CELP, the GSM-Half rate and the IS-95 standards.
The most basic characteristic of a DSP is the clock rate at which the processor operates. From a system design viewpoint the clock rate is determined by two main factors: 1) The application and the system level requirements and 2) The complexity of the required algorithms. Usually, in a complex system such as a digital cellular phone, it is desired to have a single crystal generating the master clock. This reduces overall cost and minimizes other issues such as different clock frequencies interfering with each other in the system. Hence, a suitable clock frequency for the system has to be found from which all other clocks are generated for different modules. The modules range from an audio codec to an analog front end ASIC to DSP's where the baseband signal processing takes place. The second factor determining the master clock rate is the complexity of the algorithms that are residing on the DSP. Almost always, this alone is the deciding factor for the master clock since it will have the highest speed requirement. For example, in a digital cellular phone, this can range from 40 MHz to 60 MHz depending on the standard used and the system features included.
One of the main things a system designer worries about is how much the system clock can be reduced without affecting performance. Lower speed means less power consumption and hence longer battery life for a cellular phone. However, reducing clock speed means either not adding extra features which might distinguish the product from others or sacrificing performance. It is to this effect that different DSP architectures play a crucial role.
Two paths are possible to improve performance. Increasing the clock speed at which the DSP runs by reducing propagation delays in the circuits and adding deeper and deeper instruction pipelining or by keeping the clock speed fixed and increasing the parallelism in the DSP operations. There are two terms used in the industry to describe the performance of a DSP. MIPS (Million Instructions Per Second) and MOPS (Million Operations Per Second). Let's assume that a DSP can fetch one instruction per clock cycle. In a pipelined machine this means that effectively 1 instruction is performed per clock cycle. Of course, in reality, there are instructions requiring effectively two clock cycles or more such as most of the program flow type and instructions that are more than one word long. MIPS measures the number of instructions the DSP can process in one second. For the DSP in this example, a clock rate of 20 MHz will generate a performance very close to 20 MIPS.
In a general purpose DSP, an instruction is defined as one fundamental signal processing operation such as an add or multiply-accumulate. In this case the MOPS figure will almost be equivalent to the MIPS figure. Hence, to increase the MOPS figure, the MIPS should be increased which in turn will require higher clock rates. On the other hand, when parallelism is employed, an instruction can incorporate several operations such as subtract and square-accumulate in parallel. In this case the MOPS figure will be several times higher than the clock rate or the MIPS figure.
In reference [l] is has been shown that the clock speed and hence MIPS will not keep on increasing like the trend has been in the past decade and will flatten out in the next decade. However, the emergence of application or domain specific DSP's which have specialized functions and parallelism will dominate and will keep the MOPS figure in the rise instead. This is the strategy applied in the design of the presented DSP core.
Next Generation DSP's: Programmable DSP's were introduced in the early 80's for compute intensive applications. These applications run mostly under tight real-time constraints and involve a large number of operations which are in tight repetitive loops involving many memory accesses.
The basic Harvard DSP architecture consists of a datapath, able to perform a multiply-accumulate in a very efficient way and employs a separate program bus and data bus to improve the memory bandwidth, as shown in Figure 1- In the 1st generation processors, such as the TMS32010, z and x are mapped on the same accumulator, y and J come from the data memory. But since the processor has only one data bus and hence one data fetch per cycle, this instruction takes 2 cycles to execute.
In the second generation processors, such as the TMS32C25, part of the program memory can be mapped onto the data memory (RAM). Therefore during a single instruction repeat operation, the program bus is freed and is used to fetch data from the data memory. [n this case two data can be fetched in one cycle and hence a multiply-accumulate instruction takes 1 cycle to execute.
Typical for the 3th generation is the addition of extra addressing features, such as circular buffers, and speciA functions, such as bit manipulations. These features are introduced to address new and more complex algorithms and support the emerging digital cellular standards, such as the North-American Digital Cellular
To increase the performance even further, extra computational units are added to operate in parallel. For example, the MAC datapath and the ALU datapath can be split and operate in parallel. With the increase in the number of datapath units, came an increase in the memory bandwidth. For example, if each datapath unit operates on two data items, four data items have to be brought to the inputs to keep both datapath units operational every clock cycle. This increase in memory bandwidth resulted in an increase in local and global busses and, in particular, the two separate data busses to the data memory. This is the trend in the 4th generation DSP's.
The DSP core presented in this paper, can be considered a fifth generation processor. Unique to this IDSP are its Dual-MAC and ALU which operate in parallel. The internal bus structure, the two data busses which connect to the data RAM, and the instruction set are build such that all three units are kept working in parallel. The details will be described in the Section 2 . Alternative Approaches: Several approaches are used to improve DSP performance. The first and most obvious solution is to increase the clock rate. This is achieved by scaling the technology and increasing the pipeline depth of the processor. This is applied in the DSP56300 [2] by implementing a seven stage pipeline. However, a deep pipeline increases the penalty for program flow instructions, such as branches, calls, etc ... A high clock frequency increases the power consumption and will also make the memory accesses difficult. The reason is that only small on-chip memories can be accessed in a single clock cycle. This problem is addressed in 131. The multiply-accumulate unit operates at double the clock frequency compared to the memory accesses. The drawback is that memory management becomes difficult. Two 32-bit words are read from memory and split in the datapath unit. Then the datapath operates on the even and the odd data units consecutively.
IS-54.
A second approach is to organize the datapath units in a VLIW (Very Large Instruction Word) architecture [4] . Because each unit operates completely independent from one another, this requires a full crossbar bus network, where each output is connected to each input. Moreover, the memory bandwidth has to be very wide to allow multiple memory accesses at the same time (worst case: the sum of all inputs and outputs). And last, the instruction word becomes very wide because each of the units is programmed independently and each operand has to be specified. This is an expensive solution, both in terms of area and power.
A third approach which avoids the long instruction words, is the selection of a SIMD (Single Instruction Multiple Data Machine). This approach is chosen in [5] to implement the PSI-CELP speech codec algorithm. Data words, twice the width of the datapath, are read from memory and the high level and low level words go to different units. This requires a memory bandwidth which is double of a single-MAC processor. Moreover, bo! h units execute exactly the same instruction making it difficult to program.
Instead of taking a VLIW or SIMD approach, a domain-specijic DSP has been selected in Lode. It has been designed with cellular and speech processing applications in mind. A predecessor of this DSP core is described in [6]. It was designed specifically for the l?SI-CELP algorithm. It also has a Dual-MAC and ALU datapath. It has however a VLIW architecture, a two stage pipeline and a very different bus network.
LODE ARCHITECTURE
The Lode architecture is shlown in Figure 2- Bus Network: When three. units operate in parallel, care must be taken that all three units have operands to work on. In general this would require a three fold increase in memory bandwidth compared to only one datapath unit, making this solution too expensive in terms of area and power.
Instead of choosing a general VLIW or SIMD architecture, a domain-specifc DSP has been designed. Therefore, the possible input combinations are restricted while making sure that all three units can operate in parallel. The selection of input combinations and the bus structure has been made such that typical DSP operations and especially operations for wireless communications and speech processing are optimally supported. Internally, the bandwidth is increased by the introduction of a special delay register, keg. Furthermore, small local connections are provided to route the data from one unit to the next. Local connections are more effective from power consumption viewpoint. The flow of the data through the processor is illustrated for several instructions in Section 3.
Data Memory Busses:
Two address busses and two data busses are provided to connect the Lode core with the data memory. This data memory is outside the core. Its size and organization depends on the application. The outside memory architecture should be such that it supports two accesses (two reads or one readlone write) each clock cycle.
Address Generation Units:
The DSP core has two independent address generation units (AGU). In VLIW terminology, these can be considered as two extra datapath units. Hence, in total 5 units operate in parallel! The two AGU's share 8 pointer registers and 8 pointer modifier registers. When these registers are not in use by the AGU's, they can be used by the other datapath units as general registers. This is also true for the other special registers available. The AGU's support the following addressing modes: memory direct, register indirect, register direct, short (8-bit) and integer (16-bit) immediate data, and special support for double precision operations.
Pipeline:
The number of pipeline stages follows the flow of data. There are five pipeline stages: fetch, decode, read, execute and write. During fetch, an instruction is fetched from the program memory. This instruction is decoded during decode stage. During the read stage, data is read from data memory and placed in the pipeline registers MO, M1. In parallel, addresses are post-modified. During execute, data is taken from MO, M1, the accumulators or the registers, an operation is performed and the result is placed in one of the accumulators. During the write stage, data is written back to memory. 
RAM (outside core)

INSTRUCTION SET
The instruction set reflects the choice for a domain-specific DSP. A general VLIW architecture requires a minimum of three function fields to specify the operations on each datapath unit, and a minimum of six source and destination fields to specify how the inputs and outputs are connected. Add to these the fields controlling the AGU's. The Lode instruction set reflects the bus network and the options for inputs and outputs. When all datapath units are in use, the instruction describes how the data flows through the processor.
Example 1: If only one datapath unit is used, the selection of inputs and outputs is very wide: For example, a single "multiply-accumulate'' operation has the following syntax: [,ppml; where an and am are any accumulators, op0 and opl can be data from memory, from another accumulator, from a pointer register or from any special register. ppm describes the address post-modification in case, op0 or opl are reads from memory.
Example 2:
The "Dual multiply-accumulate" instruction, useful in the computation of H R filters, hac, the following syntax:
where the first MAC receives two data operands from data memory, the second MAC receives the same operando, op0, but receives as second input the contents of the delay register keg. This is used to compute two filter outputs at the same time.
As a result, a block FIR computations requires only N/2 instruction cycles and only half of the memory accesses compared to a single MAC implementation.
Example 3:
The "square distance and accumulate instruction" is executed in one cycle. This is a basic function of the vector quantization process in speech compression algorithms. It is used to perform the following operation:
For this instruction, the AMU and one MAC unit are used with the following syntax:
the input data operands, op0 and opl, are routed through the AMU, the result is placed in a3. Next, the square of a3 is taken in the MAC unit and the result is accumulated in aO. Asr specjfies the input shift amount.
Example 4:
The "add-compare-select (ACS)" operation for the Viterbi butterfly. The dual MAC structure cat1 also be used to do dual addition and dual subtraction combinations by bypassing the multipliers. This, in parallel with the AMU unit performing a maximum or minimum operation, will efficiently compute the ACS butterfly operation of a radix 2 Viterbi core in four clock cycles. For example,
uses the dual MAC unit to add and subtract opl from the respective accumulators, simultaneously the AMU finds the maximum of the two accumulators, a0 and a l , and stores the results into memory. The max instruction will implicitly store a decision bit, depending on whether a0 or a1 was a maximum, into the a2 accumulator. Several flavors of this instruction are supported, thus achieving the efficient 4 cycle Viterbi butterfly ACS operation.
Example 5: "Galois field (GF) operations." Two specialized instructions in the AMU allow Galois field operations such as division and multiplication to be peiformed on polynomials over GF(2). This allows 1 cycle per bit CRC calculations, again a basic operation in error control coding for many data fields in the digital cellular standards. An example of such an instruction is, a0 = gfdiv(a1 <asr);
where the a1 accumulator contains the divisor polynojqial over GF(2) and the a0 accumulator contains the dividend. This instruction can be repeated up to 40 times (40-bit accumulator) before saving and reloading aO. At the end of the operation, a2
will have the quotient of the division and a0 will have the remainder. For CRC calculation, the remainder is the result of interest.
Example 6: Bit manipulation instructions such as the bit test instruction, allows simultaneous testing and saving the bit in question in one cycle, thus making interleaving and de-interleaving operations very efficient.
PERFORMANCE
The Lode core maximizes performance and keeps the clock rate low by using parallelism in its datapath. The main features of the Lode core, as described previously, are its dual MAC structure and its complex AMU. At its best all three units can operate in parallel together with the two address generation units.
It is known that parallelism can reduce the energy consumption at the cost of increased area. In this case, the dual MAC structure with the delay register in between, increases the speed of filtering operations such as FIR and IIR by twofold. The total number of multiplications to compute an FIR, and hence the energy, is the same as for an implementation on a single MAC DSP. However, energy is saved because only half of the memory accesses are needed compared to a single MAC DSP. Similarly, the number of instruction cycles and hence the control overhead is reduced to half compared to a single MAC DSP.
The area overhead is only in the extra MAC. The amount of busses or memory ports is similar to a single MAC DSP. The Lode core has a total of 45,000 gates. This corresponds to the core shown on Figure 2 -1 and does not include data nor program memory. One MAC consists of about 7,000 gates, which corresponds to a 15% area overhead. Note however that memory area dominates in current-day DSP implementations. The core wid1 typically occupy 1/3 of the area or even less. Hence depending on the application, the overhead in area is around 5% or less.
Table2 shows the performance of the Lode core in some basic operations commonly found in the wireless digital cellular applications as well as many speech compression algorithms. The results are compared to general purpose 16-bit fixed point DSP's, such as AT&T 1616, TI C~X , ADI21xx and Motorola 56xxx. In this paper, we presented a 16-bit fixed point DSP engine for wireless communications. The datapath units, the bus network and the instruction set are designed such that the compute intensive blocks of wireless cellular systems and speech compression algorithms require a smaller number of instruction cycles, resulting in "lower MIPS", thus allowing the addition of useful product differentiating features in software. Yet, the external memory bandwidth was kept to two memory ports. It is estimated that this DSP core runs at 40 MHz clock frequency in a 0.5 pm CMOS technology.
AC s
The design and implementation of this DSP was a team work. The following persons are gratefully acknowledged for their contributions: Mark Jensen, Harlan Neff, Kambiz Homayounfar, Gordon Jacobs.
