Abstract-Programmability will be increasingly important in future multi-standard radio systems. In this paper we present an enhanced baseband processor architecture capable of efficiently supporting simultaneous multi-standard operation. Our DSP processor is based on the SIMT (Single Instruction stream Multiple Tasks) architecture which allows concurrent vector tasks to be executed on the processor controlled by only a single narrow instruction stream. By profiling and mapping GSM and WLAN (IEEE 802.11g) to the architecture we show that simultaneous support for the above mentioned standards can be accomplished with 245 MHz clock frequency and 1359 words of complex data memory on the given architecture.
I. INTRODUCTION
Programmable baseband processors are necessary to enable flexible multi-standard radio systems. Programmability can also be used to quickly adapt to new and updated standards since a pure ASIC solution will not be flexible enough. ASIC solutions for multi-standard baseband processors are also less area efficient than their programmable counterparts since processing resources cannot be efficiently shared between different operations.
In our judgment EVP-16 [1] represents state-of-the art of programmable baseband processors today. However, we believe that our SIMT-architecture outperforms EVP-16 in the following aspects. The EVP-16 architecture is based on a VLIW scheme with code compaction (both "horizontally" and in time) whereas the SIMT architecture could reach the same utilization of its data-path using only a narrow instruction flow for a given baseband program. The SIMT architecture also reduces the complexity of the control-path and memory system compared to the EVP-16, yielding lower area and power.
Traditionally the research focus has been on enabling multistandard baseband processors and the issue of efficiently being able to execute several standards simultaneously has not been covered. By using a single programmable baseband processor to execute several baseband processing programs at the same time we can:
. Improve hardware reuse. (Save silicon area.) . Share software kernel functions. (Save program memory)
. Utilize shared information such as link state and channel parameters. (Improve performance) However, as baseband processing is a hard real time problem with latency requirements on 1 ,us scale and is computationally heavy, special attention must be paid to achieve efficient multi-standard support with low scheduling overhead. One goal of this work is to identify what could be done in hardware to support efficient execution of several radio standards at once. In this paper we only consider reception tasks in GSM and Wireless LAN, as reception is generally more demanding than transmission.
In this paper we first present the SIMT architecture which is suitable for running the presented standards separately. The SIMT principle is presented in section III. Then section IV describes how to divide baseband processing algorithms into processing tasks. In section V GSM and WLAN algorithms are profiled and mapped to the architecture. Scheduling is described in Section VI and results are presented in section VII. Finally, conclusions are drawn in section VIII.
II. BACKGROUND
Analysis of baseband processing [2] [3] the SIMT processor utilizes a number of parallel data-paths (both SIMD and scalar data-paths). However compared to VLIW-based machines only a narrow (16 bit) instruction stream is used to control the execution. This is accomplished by using vector instructions that runs over time in a SIMD execution unit utilizing the vector structure found in most baseband tasks. To further reduce the control-path complexity, the SIMT architecture only issues one instruction word each clock cycle.
For example, a 128 sample complex dot-product could be calculated by using only one 16-bit instruction when issued to a SIMD execution unit. This instruction will be executed over a number of cycles (64 clock cycles on a 2-way complex CMAC). Next clock cycle another vector operation could be issued to another vector unit, thus providing highly parallel computing without the need of a complex control-path and high program memory cost. While the vector execution units are used, the scalar controller could perform control-oriented tasks. This principle is illustrated in Figure 1 . In Figure 1 a 64 sample complex dot-product is issued to the Complex Multiply-Accumulate (CMAC) unit, next cycle a 64 sample complex vector addition is issued to the Complex ALU (CALU), then some miscellaneous control oriented code is executed on the scalar (control) data-path.
Vector instructions are enabled by the underlying architecture which consists of: 1) Small separate memory blocks with private address generators.
2) Vector execution units (Complex MACs and Complex ALUs).
3) A scalar data path (core) executing RISC instructions. 4) An on-chip network. 5) Accelerators for certain tasks. When a vector instruction is issued, a corresponding memory bank is connected through the on-chip network to the execution unit, thus giving the execution unit exclusive read/write access to the memory. Since each execution unit is connected to its own memory blocks, concurrent vector operations can be supported on all execution units. By using an on-chip network, the number of memory accesses can be reduced since memories can be "handed over" to other execution units by a simple network reconnect. [5] The SIMT architecture is presented in Figure 2 . A. Context switch To allow simultaneous multi-standard execution of different baseband programs, the controller core supports several com- puting contexts. The difference between the SIMT architecture and architectures without multi-cycle vector support is that the context switch will only affect the controller core and its data-path. E.g. a task could issue a vector operation to a certain vector execution unit and in the next cycle perform a context switch, keeping the vector operation running on the vector execution unit. Control could be returned to the original context when the vector operation is completed, thus enabling a very efficient multi-tasking environment. This is especially useful in order to simplify programming and to hide the cycle cost of control oriented code with vector operations. However, a task scheduler must be used to coordinate the usage of the vector execution units.
IV. TASK ANALYSIS
To aid profiling of GSM and WLAN reception baseband programs and to further facilitate scheduling in the processor, we introduce the task concept. This "task" shall not be confused with the "task" in SIMT which refers to a vector operation. Besides being used in scheduling, division of a program into a chain of tasks helps identifying data dependencies and maximum tolerable latency of different operations during the profiling stage. In this paper we define a task to be an atomic operation that is executed on one execution unit. Hence a program is considered to be a sequence of atomic tasks.
A. Task classification
To assist profiling and scheduling, we define the concept of task-groups, e.g. a set of tasks with a common dead-line.
Some task-groups typically associated with the radio configuration will have very short (2-5 ,us) dead-lines. Common to those tasks, they need to be performed before any data can be stored in memory. Automatic Gain Control (AGC) and similar tasks often fall in this group. These tasks often start a chain of further processing tasks. The processor must have enough resources to manage all possible such tasks within the maximum latency time for each of them. A task-group is further divided into atomic tasks.
B. Task latency
Since baseband processing is a hard real-time problem, hard deadlines exist where data must be completely processed. These limiting factors are imposed by the radio front-end and higher data layers:
* AGC/AFC: The processor must perform Automatic Gain Control (AGC) and Automatic Frequency Control (AFC) tasks before the data are stored in memory for further processing. . Higher protocol layers: In most packet systems such as WLAN, the standard stipulates the maximum allowed time from the end of a received packet to the beginning of the transmitted response packet.
V. PROFILING
Profiling of WLAN and GSM programs visualizes resource usage and highlights task dependencies. During profiling, the programs are divided into chains of tasks that execute on the initially assumed hardware. Profiling results are then used to enhance and dimension the initially used hardware architecture accordingly. In Table I [5] . The processor core is also assumed to sustain one instruction per clock cycle. Due to the SIMT architecture this is true for most operations executing on the architecture.
B. Task analysis
Algorithm selection and analysis of WLAN and GSM reception yield the task division presented in Table II . The cycle count cited is the number of cycles the corresponding kernel operation will consume on the hardware.
Immediate tasks need to be performed before any data can be stored in memory. Dependency analysis gives the following processing latency requirements presented in Table III. The  table also Scheduling can be performed in many ways, most traditional scheduling principles are discussed in [6] . However, in this paper we have used a lightweight scheduler due to the low scheduling complexity and extreme performance requirements. The scheduler described below is only intended to perform basic scheduling to illustrate hardware performance.
As the baseband processing tasks consume about 40-100 cycles, the scheduling and and task switch operations must be performed with a minimal cycle count not to dominate the processor load.
To efficiently support scheduling, special context switch instructions are inserted between tasks in the software at compile-time. This task partition could be performed automatically or by the programmer. Upon execution of such an instruction the processor performs a context switch to the supervisory code which then performs another context switch to issue the next task.
A. Scheduling principle
The scheduling principle used in this investigation is based on knowledge of the period in which GSM and WLAN tasks are initiated. From the GSM standard we know that we might receive a data burst every 576 ,us. WLAN bursts can however be received more often. WLAN activity can in worst case be back-to-back on the radio channel. With this in mind and the latencies from Table III , we conclude that all WLAN tasks will have precedence over GSM tasks (Shortest-period-first scheduling). This will require the processor to being able to serve the largest GSM task within the respective latency time.
The processor will execute the scheduler between each task in the GSM task flow. If a WLAN packet is detected, the task scheduler will start to execute the WLAN flow.
VII. RESULTS
From Table II and Table III we can conclude that the maximum memory usage case will be when a GSM processing burst is interrupted by a WLAN packet. This will require a total of 1359 words of complex memory (671 words from GSM, 208 words from WLAN and 480 constant words. Stacks etc. in the controller not included).
A. Resource usage
In Table IV the resource usage for different task groups is presented. B. Peak load The peak load of the processor will occur if a long WLAN packet is received just after the start of the synchronization task in GSM. Then the processor needs to simultaneously support 47 MIPS (GSM) along 196 MIPS (WLAN) during peak conditions as illustrated in Figure 3 . The worst case load originates from the overall computing requirements over a GSM slot with full WLAN traffic. The case where WLAN tasks 1-2 (See Table II and III) occur precisely after a subtask in GSM task 5 is issued will only require 56.25 MIPS. (224 + 48 + 88 cycles in 6.4 p,s).
To reduce the power consumption, clock gating support is essential. As shown in Figure 3 , not all computing capacity is needed all time. Turning off the clock of individual execution units, accelerators and memories can thus save power. C. Scheduling overhead
We have deliberately chosen a lightweight scheduling scheme in this paper to keep scheduling overhead low, at the cost of a slightly over-designed hardware in order to maintain a guaranteed performance. By enhancing the core with single cycle context switch instructions and mechanisms for resource allocation, scheduling overhead can me kept to a minimum. Under full load in both GSM and WLAN there will be maximally 126 task switches per GSM slot (which is the largest scheduling period).
According to the peak load calculated in Section VII-B, the peak load of the processor is 243 MIPS. Each task switch will consume maximally 12 cycles (context switching, network setup, et.c). This will add another load of maximally 1512 cycles per 576 ,us which corresponds to 2.6 additional MIPS.
VIII. CONCLUSION
It is possible to manage both GSM and WLAN reception on the presented architecture at only 245 MHz using 1359 words of complex data memory. The scheduler and task switch mechanism only adds 1.1% cycle overhead. This extra overhead is compensated by the hardware reuse. This makes SIMT based baseband processors well suited for simultaneous multi-standard processing in mobile terminals.
