Abstract
Introduction
The large number of emerging radio standards and convergence of wireless products has lead to an increased interest in Software Defined Radio (SDR). Increased flexibility is also needed in baseband processors in order to meet the requirements for time to market and product lifetime. Power consumption continues to be very important.
To handle the requirements of demanding applications such as wireless networking, third/fourth generation mobile telephony, and digital video broadcasting, a high degree of parallelism is needed in the baseband processor. For many applications, general DSP processors cannot reach the required performance, cost, and power consumption so new architectures are needed.
We are proposing an architecture based on an application specific DSP core and a number of flexible accelerators, connected via a configurable network.
An approach based on hardware acceleration of selected algorithms both increases the parallelism in the system, and improves the efficiency since the processor core can focus on the tasks more suitable for software implementation, such as multiply-accumulate based operations.
Other programmable solutions, such as [1] , [2] , and [3] , are typically based on highly complex VLIW and/or multiple processor cores. The approach described here leads to lower control overhead and reduced memory requirements, resulting in lower area and power consumption.
Our approach also leads to a much higher degree of hardware reuse, between algorithms and between standards, than a fixed function implementation. It can therefore reach a lower silicon cost for a multi-standard baseband processor than a direct-mapped ASIC would, even with the program memory and instruction decoding overhead taken into account.
Section 2 reviews the properties of algorithms and processing required for the physical layer in a typical radio system and their implications on the choice of architecture. Sections 3 to 7 describes the architecture in general. In sections 8 and 9 a demonstrator implementation for Wireless LAN is described.
Overview of Baseband Algorithms
Most baseband processing jobs are performed in similar ways in many different radio standards. In order to build an efficient baseband processor it is essential to analyze these jobs and make sure that they can be executed efficiently while at the same time keeping enough flexibility to accommodate for the differences that do occur between different standards. A few important properties of baseband processing are listed below.
Few data dependencies
The processing flow in the baseband processor is to a large extent fixed, and data dependencies are limited to parameters such as data rate and packet size, typically extracted from the header of the data packet. There are essentially no backward dependencies between successive steps in the processing flow. This makes it possible to achieve Figure 1 illustrates task-level pipelining in an OFDM receiver. The absence of data dependencies also makes it possible to minimize the data memory requirements to the order of a few symbols (i.e. a few FFT blocks in an OFDM system, or a few times the spreading factor in a CDMA system).
The memory access patterns in baseband processing are also to a large extend fixed and regular. Data is typically treated as vector elements and accessed sequentially. The amount of housekeeping data and state variables are typically small, can be kept in registers in many cases. With the exception of bit reversed addressing for FFT computation and modulo addressing for efficient implementation of circular buffers, only a small number of simple addressing modes are useful.
The proposed architecture achieves a large degree of parallelism since the DSP core and a large number of accelerators can operate simultaneously in a pipelined manner. Memory size and memory accesses are minimized by eliminating the need for data buffers when sending data between accelerators.
Complex valued, convolution based computations
A very significant part of the computations in a baseband processor are so called convolution based processing: FIR/IIR-filtering is used for symbol shaping in the transmitter and for decimation and channel compensation in the receiver. Auto-and cross-correlation is used for packet detection, frequency and timing offset estimation and channel estimation. Despreading in direct sequence spread spectrum (DSSS) systems can also be seen as a cross correlation. OFDM systems relies on efficient computation of fast Fourier transforms (FFT).
Traditional DSP processors can generally implement these types of functions efficiently thanks to multiply- accumulate (MAC) units and specialized bus architectures and addressing modes. However, unlike other applications, the baseband processing is based almost entirely on complex valued data (in-phase and quadrature-phase value pairs). In a programmable baseband processor it is therefore beneficial to optimize the data path and instruction set for convolution based calculations on complex data, even at the expense of slightly larger overhead for real-valued calculations. The DSP core in the presented baseband processor natively handles complex data, which results in reduced program size and control overhead for complex valued calculations.
Limited set of bit-level operations
Most remaining computations are bit-level operations for mapping/demapping, channel coding, interleaving, scrambling, and error checking. These are typically not suitable for software implementation and some, e.g. Viterbi/turbo channel decoding, are very demanding. However, these operations tend to be similar between most standards and for many of them very efficient hardware implementations (such as linear feedback shift registers) exist. These facts often makes it suitable to implement these functions as flexible/configurable hardware accelerators that can be reused between multiple standards. Figure 2 gives an overview of the architecture. The system is controlled by a specialized processor core, which is connected to a number of accelerators and interfaces via a configurable network.
Architecture overview
Accelerator and network configuration is carried out by dedicated assembly instructions or via a control and status register file. 
The DSP Core
An application specific DSP core has been developed for the baseband processor. The DSP core can be described as a rather simple RISC-like processor with the following enhancements:
• A powerful dual complex MAC unit, including two 12x12-bit complex multipliers (i.e. 8 real-valued multipliers), two complex 32-bit adders, and two complex 16-bit adders. The MAC unit can execute e.g. two complex multiply-accumulate operations or one complete radix-2 FFT butterfly plus rounding and saturation each clock cycle.
• An instructions set optimized for baseband processing, including a novel type of instructions operating on vectors of complex numbers.
• Dedicated instructions for network and accelerator configuration and control.
The MAC unit together with the vector instructions have proven to result in very compact assembly code for implementation of baseband algorithms, see table 1. The vector instructions are processed by a dedicated vector execution control unit, which allows control flow and network/accelerator control instructions to be executed in parallel with the vector operations as illustrated by the assembly code example in figure 3 . The DSP core is described in more detail in [4] .
The Accelerator Network
Memories, accelerators, and external interfaces are connected to the core via the interconnection network. The network is in principle a crossbar switch which is configured by the core using dedicated assembly instructions. This eliminates the need for an arbiter and addressing logic, thus reducing the complexity of the network and the accelerator interfaces, still allowing many concurrent communications.
The complexity of the network can be reduced substantially from the original full crossbar since many units only need to communicate to a subset of other units. In most cases the network would in practice be divided into two: one in which the data are complex valued samples and one using "bit-based" data. Typically the DSP core and/or a mapper/demapper accelerator would act as a bridge between the two sub-networks.
Each accelerator has one read port and one write port to the network. A connection is set up by connecting one read port to one write port. The reading unit requests one unit of data by asserting a ReadRequest signal during one clock cycle and the transmitter uses a DataAvailable signal to indicate that new data is available. The reading unit may have up to two outstanding read requests, but must then halt if no data available signal is received. This protocol allows a new data item to be transfered every clock cycle but still provides sufficient flow control.
A chain of accelerators connected to each other via the network will automatically synchronize and communicate without any interaction by the processor. This allows truly concurrent operation of the core and any number of accelerators with zero synchronization overhead in the core. This also minimizes the number of memory accesses since no intermediate storage is needed when sending data between accelerators.
Other network implementations than this one could be used. For example globally asynchronous, locally synchronous solutions (GALS) such as the one in [5] could be considered. However, since implementations of this architecture typically will have a small area and relatively low clock frequency, our implementation is sufficient and has the benefit of very simple interfaces and no handshaking overhead. It also allows the use of a standard synchronous design flow.
Memory Architecture
Using a number of small data memories gives enough memory bandwidth to keep the core/CMAC and accelerators fully occupied. The network always gives a unit (core or accelerator) exclusive memory access, thereby eliminating stall cycles due to access conflicts. After finishing a task, the entire memory containing the output can be "handed over" to an accelerator or interface by reconfiguration of the network. This eliminates data moves between memories.
Memory addressing is distributed to separate address generators in each memory. Avoiding centralized addressing logic in the DSP core improves modularity and scalability. Customized memories and addressing modes can be added without redesigning the DSP core. Addresses and addressing modes are configured using the same interface as for accelerator configuration. No addressing information needs to be sent over the network.
Reducing memory sizes and memory accesses was a major focus in the design since a large part of the power consumption in a programmable architecture takes place in the memories. The small, and thereby fast, on-chip memories and the moderate frequency eliminates the need for caches. Thereby a lot of control overhead is avoided. Furthermore, the execution time is completely predictable, which is a major advantage in hard real time systems.
Choosing Accelerators
A key issue is the choice of accelerators. This has previously been discussed in [6] . The main factors to consider are: 1) The relation between the area that would be occupied by the accelerator and the cycle cost for a pure software implementation of a function and 2) to which extent the accelerator can be reused between standards. The reuse factor can often be improved by adding configurability to the accelerator.
Acceleration leads to reduced power consumption by reducing the overhead for execution of non-software-friendly operations. It also decreases the clock frequency requirements by increasing the parallelism in the system. Thereby, additional power savings can be made since the supply voltage may be lowered. However, these gains must be carefully measured against the added silicon cost of the accelerator.
Preferably, clock gating should be used to reduce the power consumption of accelerators when they are not used. The implementation is fairly straight forward due to the inherent modularity of the architecture.
As briefly mentioned above, operations like convolutional encoding and scrambling will often be accelerated since they have very simple (and similar) hardware implementations but high overhead when implemented in software. Viterbi and turbo decoding are very demanding operations and will have to be accelerated in most cases in order to reach reasonably low clock frequencies.
The ADC/DAC interface accelerator contains a configurable decimation and symbol shaping filter, a rotor for carrier frequency offset compensation, and a configurable packet detector based on autocorrelation. The packet detector will wake the core from idle mode when an incoming frame preamble is detected. These functions can be reused between many standards. They also have to run continously for a large part of the time and the filtering is quite demanding at high sample rates.
Implementation for WLAN
The described architecture has been proven in an implementation for a converged 802.11a/b/g baseband processor.
The manufactured baseband processor has a program memory size of 4096x16 bits. Four identical 256x32 bit data memories for complex data are connected to the network. Each of these memories consists of two interleaved memory banks, allowing two consecutive addresses (vector elements) to be accessed in parallel. These memories also have FFT addressing support. A 2048x32 bit coefficient memory connected directly to the core is used for FFT and filter coefficients, look-up tables, and other data not processed by accelerators. Using dual memory banks instead of dual port memories saves power.
As described above, the ADC/DAC interface contains a decimation filter, rotor and packet detector. Other accelerators that are reused between 11a and 11b standards are the scrambler and the MAC-layer interface. Furthermore the Viterbi decoder and interleaver for the 802.11a standard must be accelerated. Acceleration of the Walsh transform used for reception at the highest data rate in the 802.11b standard resulted in a decrease of the required frequency for the 11b receiver from approximately 160 to 110 MHz. Finally an accelerator for BPSK/QPSK/16-QAM/64-QAM mapping and demapping was also implemented.
Results
Firmware was implemented for 802.11a and 11b transceivers. Results in terms of memory usage and required clock frequency for different modules can be found in table 1. The instruction set has proven to be very efficient. Only about half of the available program memory is required to store the entire 11a and 11b transceiver firmware on chip. Data memory requirements are also about half of the available data memory. Parts of the firmware as well as most of the data memory is shared by Rx and Tx modules, so the actual requirements are less than the sum of the numbers in the table. An 802.11a/b/g baseband processor demonstrator chip, with accelerators for ADC/DAC interface+frontend processing, demapping, interleaving, scrambling, CRC, Walsh transform, and MAC-layer interface was implemented and manufactured using a 0.18 µm CMOS standard cell library. The chip features and measured performance can be found in table 2. Figure 4 shows a die photo.
The chip will function correctly at least up to 220 MHz, implying that power can be saved by reducing supply voltage in a converged 802.11a/b/g transceiver running at 160 MHz which is the required frequency for 54 Mbit/s reception in 802.11a/g.
The processor is flexible enough to also support standards with lower data rate, such as GSM/GPRS and Bluetooth. Firmware for kernel functions of these standards has been developed. 
Conclusions
An architecture for programmable baseband processing has been presented. The architecture is based on an application-specific DSP core and a set of accelerators connected via a configurable network.
The interconnection network allows a high degree of parallelism since connected units can synchronize and communicate independently of the DSP core. It also minimizes data memory size and memory accesses since there is no need for buffer memory when sending data between accelerators and entire data memories can be transferred between different computation units by reconfiguration of the network.
By providing a good tradeoff between flexibility and performance, minimized data and program memory requirements, and a high degree of hardware reuse, the presented architecture enables very area and power efficient implementations of multi-standard radio baseband processors.
Efficient hardware implementation of selected algorithms leads to reduced power consumption due to reduced control overhead and lower frequency requirements as well as reduced program code size.
The instruction set has proven to result in efficient firmware for both OFDM (802.11a) and DSSS (802.11b) as well as Gaussian frequency shift keying (Bluetooth/GSM) systems.
Implementation results indicate that the silicon cost will be smaller than for existing programmable and non- programmable solutions. The measured power consumption is low considering that low-power design techniques such as clock gating was not used in the fabricated chip.
