Terrestrial Digital Video Broadcasting (DVB-T) is currently being introduced in many European countries and planned to supplement or replace current analogue broadcasting schemes in a large part of the world. It is also considered as an additional down-link medium for third generation UMTS mobile phones, where a special variant, DVB-H, is under development. Current DVB-T receivers still are built upon dedicated application specific integrated circuits (ASICs). However, designing ASICs is a tedious and expensive task. We will show that it is possible to implement a DVB-T receiver in software on an application-specific digital signal processor (AS-DSP). We analyze the computational requirements of a DVB-T receiver and investigate its potential for parallization. Further, we present our AS-DSP, the M5-DSP, which is based on a novel architecture and design methodology, and report on implementing the core algorithms of a DVB-T receiver on it.
INTRODUCTION
Terrestrial Digital Video Broadcasting (DVB-T) [1] is currently being introduced in many European countries and planned to supplement or replace current analogue broadcasting schemes in a large part of the world [2] . It is operational in the whole U.K. and has replaced analogue broadcasting completely in the area of Berlin in Germany. It is also considered as an additional down-link medium for third generation UMTS mobile phones [3] in case multiple users request the same data, e.g. video streams from a sports event or news. For hand held devices a new standard, DVB-H, is under development [4] . DVB-H is a time-sliced version of DVB-T, where data are not transmitted at all times. Hence, data rates and power consumption are reduced.
Most DVB-T receivers are still stationary devices but the first mobile phone (Samsung SGH-P700) featuring a T. Dräger is now with Signalion GmbH, Dresden, Germany T. Richter is now with ADIT GmbH, Hildesheim, Germany This work was sponsored in part by Deutsche Forschungsgemeinschaft (DFG) within SFB358-A6 DVB-T receiver has just been introduced. All these devices are built upon application specific integrated circuits (ASICs) which require large design, verification, and manufacturing efforts and expenses. Furthermore, they can not easily incorporate design changes which are required frequently in evolving standards such as DVB-H. We suggest that application-specific digital signal processors (AS-DSPs) could be a suitable mean to implement flexible software solutions for various transmission standards [5, 6] . In this paper, we demonstrate, how a computationally demanding receiver for DVB-T can be implemented on the M5-DSP, an AS-DSP which we designed for a DVB-T receiver. The M5-DSP was designed using a novel design methodology which allows for automatically generating the DSP cores as presented in [7] . A scaled-down version of the M5-DSP featuring less data paths could also be used as a receiver for DVB-H.
We first introduce DVB-T in section 2, analyze the computational requirements and the potential for parallization. In section 3 we describe the architecture of the M5-DSP and explain implementation results in section 4.
TERRESTRIAL DIGITAL VIDEO BROADCASTING
DVB-T is an OFDM-based broadcasting system which allows for employing a single-frequency-network (SFN). Also, multiple video and audio streams can be transmitted over one frequency channel. Both techniques help save frequency resources over analogue broadcasting. The block diagram of a DVB-T transmission system is shown in figure 1 with the transmitter on top and the receiver below. We will focus on the receiver, since only very few transmitters have to be implemented for a broadcasting scheme. Some parameters of DVB-T are summarized in table 1. So far, only ASIC solutions for DVB-T receiver chips are available, e.g. [8, 9] . Also, a professional measurement receiver employing multiple DSPs, FPGAs, and ASICs has been presented [10] . 
Receiver Structure
The receiver consists of the algorithms depicted in the bottom part of figure 1. The computationally most demanding algorithms are marked in grey. They comprise the OFDM demodulation, Viterbi decoding, and Reed-Solomon decoding.
• The OFDM demodulation consists mainly of an FFT over 2k or 8k points, depending on the number of carriers [11] . The critical arithmetic operation in an FFT is the multiply-accumulate (MAC) operation.
• The Viterbi decoder requires decoding a 64-state convolutional code at a net data rate of 3 to 34 Mbit/s. The critical operation in the Viterbi decoder is the add-compare-select (ACS) operation for calculating the state metrics.
• A Reed-Solomon decoder decoder is used as a second stage of error correction. It requires Galois-field arithmetic over a Galois-field of GF(2 8 ). The critical operations are the GF-multiply operations.
Since the computational requirements of these algorithms will have a strong influence on the architecture of an AS-DSP for these algorithms, we will analyze the computational requirements in the following section. The operation counts per second for the critical operations of the respective algorithms explained in the previous section are summarized in table 2. It can be seen that the OFDM demodulation and the Viterbi decoder have such high computational requirements that they exceed the computational power of most general purpose DSPs. The parameters for table 2 were chosen for the operation mode which is employed currently in Germany. The requirements for the Viterbi-and RS-decoders can be even higher for higher modulation schemes and data rates.
Computational Requirements
Since the clock rates for an AS-DSP which could perform all operations of the receiver algorithms serially would be excessive, we will analyze the potential for parallization to find out, if parallization can be used to achieve lower clock rates. Performing computations in parallel allows for lower clock rates. However, computing algorithms in parallel is only possible if the algorithms bear no inherent datadependencies which limit parallel computation.
Parallization

Algorithm
Of the three main algorithms, the FFT [12] and the Viterbi-decoder [13] can be performed in parallel while only the syndrome calculation of the Reed-Solomon-decoder can be parallized. The Derandomizer can also be computed in parallel but requires very little computational power. For an assumed number of data paths P = 16 we estimated the number of operations per second as presented in table 3. This number of data paths was chosen since it results in a target clock rate for the processor of about 200MHz. This is a reasonable clock rate for a standard-cell design with a simple three-stage pipeline, as will elaborated later on.
THE M5-DSP
The M5-DSP was designed following a platform-based hardware-software-co-design methodology introduced in [7] . The platform, depicted in figure 2 consists of a fixed control processing part and a scalable signal processing part where the functionality of the data paths can be tailored to suit the application. 
Architecture
The control processing part consists of the program control unit (PCU) which performs operations like jumps, branches, and loops. It features a zero-overhead loop mechanism supporting two nested loops. Two address generation units (AGUs) are available. They serve two purposes: When the processor performs parallel computations on the signal processing part, they generate addresses for the dual-port data memory. When the processor executes serial control code, one AGU still performs address calculations while the other AGU performs microcontroller tasks.
The signal processing part consists of P slices where each slice comprises data memory, a register file, and a data path including an arithmetic-logic-unit (ALU), multiplier (MUL), and shifter (BS). An interconnectivity unit (ICU) connects the slices with each other and the control part of the processor.
All slices are controlled using the single-instruction multiple-data (SIMD) paradigm. This allows for efficiently controlling a large number of slices since only very little control overhead is required.
The M5-DSP features a simple three-stage pipeline with stages fetch, decode, and execute. No stages for read and write-back to and from the register file are required due to the special architecture of the data paths which is outlined in the following section.
Data Paths
The schematic of one of the data paths of the M5-DSP is depicted in figure 3 . It consists of a multiplexer network, marked light grey at the top, the functional units (FUs), marked medium grey, and dedicated output accumulators, marked in dark grey, at the bottom.
Each of the FU writes its output into its dedicated output accumulator located below. This allows to operate the FUs fully orthogonally, since no shared registers or busses could pose structural hazards for parallel operation. As input operands, each FU can select the output accumulators of other FUs via its input multiplexers. If each FU can access each other FUs output, the resulting multiplexer network can become quite large. However, the connectivity can be reduced to those connections which are required in the application code. We first wrote our application code assuming full connectivity. As a second step we profiled the application code and removed all connections which were not used. The resulting connectivity can be seen in figure  3 . Only connections where an arrow meets the input multiplexer ( ) are available. We also created a compilerbased tool to extract the required connections directly from the application in C [14] . This multiplexer networks resembles the bypass network of superscalar processors. However, it is controlled fully in software and thus needs no control hardware which consumes much area and power in superscalar processors.
Despite the usual FUs ALU, MUL, and BS, we also included a load-store-unit (LST), a unit for register file accesses (REG), and an interconnectivity unit (ICU) into the data path. Including the LST unit into the data path allows for load operations to bypass the register file. This allows for a smaller register file with less ports compared to RISC processor's data paths. The REG unit acts similar to the LST unit but provides three independent accesses to the register file per cycle.
The ALU features special instructions for accelerating the calculation of FFTs and Viterbi decoders, which require a second output accumulator. These features were described in [13] and help reduce the instruction count of FFT and Viterbi-decoder below the instruction counts in table 3.
With these features the number of instructions for one ACS operation can be reduced from 6 down to 2. Hence, the operations count for Viterbi trellis calculation is reduced from 180 down to 60. Also, the multiplier supports Galois-field arithmetic as in [15] for the Reed-Solomon decoder.
Instruction Set Architecture (ISA)
Our M5-DSP features a very long instruction word (VLIW) instruction set architecture (ISA) which allows for controlling each FU of the data path in each cycle in parallel. The program control and address generation unit of the control part of the processor can be controlled by functional instruction words (FIWs) within the VLIW as well. The size of one VLIW instruction is 170 bits.
Memory
The M5-DSP features a Harvard-style memory architecture with separate data and program memory. The data memory is organized in slices. The required size of the data memory depends on the target application. For DVB-T, the trellis data of the Viterbi decoder require the largest intermediate storage with about 384kbit. Since also state and transition metrics and incoming data need to be stored, we choose a data memory size of 1Mbit. For a high data throughput for the FFT we employ dual-port memory, allowing two independent read or write accesses in each cycle.
The program memory needs to store about 1k lines of code as can be found in the implementation results in the following section 4. Each line consists of a VLIW instruction of size 170 bit. Hence, the size of the single-port program memory of the M5-DSP is 180kbit. For implementing more control code of the DVB-T receiver, the program memory will have to be extended to about 2k lines.
IMPLEMENTATION RESULTS
We implemented both hardware and software for the DVB-T receiver algorithms.
Hardware
For the hardware implementation we created a VHDL model of the processor using our gencore tools as in [16] . The VHDL description was synthesized for a standard-cell library by Virtual Silicon TM for the 130nm 8-layer-metal UMC process using Synopsys Design Compiler TM . For place and route and back-annotation we used Cadence SoC Encounter TM . For ALUs and Multipliers we used Synopsys DesignWare TM components. For memories, SRAM macros from the Virtual Silicon TM library for the UMC 130nm process are used.
The resulting layout can be seen in figure 4 . The data memories are located closely to the data paths of their respective slices. The ICU, located in between all slices' data paths, connects the data paths. The control processing part (PCU, AGU, and decoder) are hardly visible due to their small size. This confirms the efficiency of the SIMD paradigm.
Clock Rate Our M5-DSP achieves a clock rate of 250MHz which exceeds our initial assumption of 200MHz. The critical path is in the ALU. Table 4 . Die Size of the M5-DSP Power Consumption We estimated the power consumption using Synopsys Power Compiler to be about 300mW. Again, a large portion of this power is consumed in the memories. This compares favorably to the power consumption of an ASIC like the LSI Logic L64782 [9] which consumes about 800mW but also includes parts of the analog front-end. However, it does not provide any flexibility to accommodate changes in evolving standards as our M5-DSP does. Also, we did not implement any clock-gating yet and the standard memory macros could still be replaced by lower-power memories like [17] . This still bears potential for future reductions of the power consumption which will be exploited in a chip that will be taped out by the end of 2004.
Software
For the software implementation we created software development tools (Assembler, Linker, Simulator, and Debugger) using the EDGE TM toolsuite by LISAtek/ CoWare and wrote the application software in assembly language. Work on a compiler to speed-up this tedious task is under way. This section shall summarize the performance data of our implemented algorithms.
Algorithm
Cycles FFT 2k/8k 4675/22283 Viterbi Trellis 120768 Viterbi Traceback 11427 RS-Decoder 3700 Table 5 . Cycle Counts of Receiver Algorithms Table 5 shows the cycle counts for computing one FFT symbol, the Viterbi decoder for 16 blocks of 204 bytes, and the Reed-Solomon decoder for one block of 204 bytes. Please note that the application code for the Reed-Solomon decoder is not fully optimized yet. It should be possible to get the cycle count down to about 2000 cycles.
These cycle counts yield a workload of our processor of about 150 MIPS. Considering the clock rate of 250MHz and the still to improve RS-decoder, this leaves enough computational resources for implementing channel estimation and equalization, interleavers,and control code. The required lines of code and hence program memory for the respective receiver algorithms are shown in table 6.
VLIW Instructions
CONCLUSIONS
We presented the implementation of a receiver for DVB-T on an application-specific DSP which we designed for this application. The DSP is capable of performing the computationally intense algorithms in software by means of parallel execution and specially tailored data paths for Viterbi decoding, FFT and Galois-filed multiplication.
The implementation results show that the costs of the DSP are on par with commercially available ASICs but our M5-DSP also provides flexibility for accommodating future upgrades or changes of the standard by requiring only rather simple software changes instead of redesigns of an ASIC. A scaled-down version of the M5-DSP could also be used for implementing a receiver for the upcoming DVB-H standard. A silicon implementation is under way and will be taped out by the end of 2004.
ACKNOWLEDGMENTS
We would like to acknowledge the help of many students, in particular Rene Habendorf, Rene Beckert, Karsten Todtermuschke and Carsten Köckritz who wrote the assembly code for the receiver algorithms and Thomas Schuster who helped with the VHDL coding of the data paths.
