In this paper, we introduce a new verification platform with ARM-and DSP-based multiprocessor architecture. Its simple communication interface with a crossbar switch architecture is suitable for a heterogeneous multiprocessor platform. The platform is used to verify the function and performance of a DVB-T baseband receiver using hardware and software partitioning techniques with a seamless hardware/software co-verification tool. We present a dual-processor platform with an ARM926 and a Teak DSP, but it cannot satisfy the standard specification of EN 300 744 of DVB-T ETSI. Therefore, we propose a new multiprocessor strategy with an ARM926 and three Teak DSPs synchronized at 166 MHz to satisfy the required specification of DVB-T.
I. Introduction
The current multimedia multitasking environment with video and audio processing requires complex signal processing in heterogeneous environments. Existing standards are frequently extended and new standards often appear. Therefore, to accommodate the flexibility of design changes, system on chip (SoC) implementation based on digital signal processors (DSPs) or general purpose processors (GPP) is becoming more important. Due to the gradual increase in the number of operations and the amount of data processed to satisfy the performance requirements of digital signal processing, the use of heterogeneous multiprocessors is becoming more and more demanding [1] . An efficient network is needed to reduce the communication overhead among IPs in an SoC [2] , [3] - [6] .
A digital TV baseband receiver [7] is a complex digital signal processing system that requires high-speed computation. Its terrestrial broadcasting specifications include those of DVB-T, ATSC, and cable TV (Opencable). Their existing standards are often extended to upgrade performance; therefore, software design change capability is needed. Recently, MorphoSys developed a single-chip software solution for DVB-T baseband receivers [8] , and Perugia University introduced a softwarebased DVB-T implementation comprising three DSPs (TMS320), a microcontroller, a field-programmable gate array (FPGA), and an application-specific integrated circuit (ASIC) [9] .
The platform proposed in this paper offers several advantages over previous solutions.
The proposed flexible and scalable software-hardware codesigned platform incorporates the advantage of the ARM926 platform (as a general-purpose processor) and multiple Teak DSP platforms (as co-processors), along with a communication interface (CI) module to meet the standard specification of EN 300 744 of DVB-T ETSI. We first profile the performance of each functional block of the DVB-T baseband receiver, and based on the profile analysis, we perform a hardware-software partition and map each functional block onto the proposed multiprocessor platform.
Our proposed CI with a crossbar switch forms a simple and regular structure with multiple channels to provide concurrent access to the multi-master processor platform. It is also reconfigurable and facilitates scalability and flexibility for future extension of applications. The proposed platform can verify the function and performance of DVB-T baseband receivers by using the Mentor Graphics Seamless coverification environment (CVE) hardware/software coverification tool.
The remainder of this paper is organized as follows. Section II describes the hardware/software co-design environment of the dual-processor platform of the DVB-T baseband receiver. Section III gives an overview of the DVB-T baseband receiver and presents our multiprocessor platform with hardware/software co-design methodologies. Section IV presents a performance evaluation of our DVB-T receiver implementation. Finally, section V presents our conclusions.
II. Hardware/Software Co-design Environment of Multiprocessor Platform of DVB-T Baseband Receiver
Recently, multiprocessor platforms have becomes more popular for high-end systems, such as digital TV, xDSL, game applications, and medical signal processing. A multiprocessor platform is defined as a system consisting of at least two processors sharing a memory, a hardware co-processor, and IPs, to provide four features: modularity, programmability, scalability, and reusability. Modularity and flexibility enable user-oriented design environments of the software-hardware partitioning structure, to reduce the overall design complexity. Scalability and reusability enable the incremental updating of additional functions without changing the entire architecture. Our proposed platform and CI satisfy the above properties. The overall structure of our dual-processor platform is shown in Fig. 1 .
ARM926 Platform
The ARM926 platform comprises a processor module, a memory module, and peripheral modules. Each module is designed using VHDL. The processor comprises an ARM926 processor, an arbiter, an AMBA bus, and a decoder. The memory module comprises an SRAM, a dual-port RAM (DPRAM), and a memory controller. The peripheral module comprises a direct memory access (4 kB/s), an APB bridge, a timer, and an interrupt controller. To verify the platform as shown in Fig. 2 , we used Seamless CVE and a Linux operating system. We wrote a platform test program and compiled it with the operating system and the application program using an ARM compiler to generate an image file to be simulated.
Teak DSP Platform
As shown in Fig. 3 , a Teak DSP is a 16-bit fixed point DSP core with dual MAC structure. To communicate with the ARM926 platform, the Teak DSP platform needs two interfaces: the DPRAM and CI. The DPRAM interface is used as a direct data exchange and mailbox, and the CI is used as a channel between the IPs and shared memory. Our platform utilizes a Z-space to realize the bus interface unit (BIU) for the purpose of data exchanges. While the X and Y space data memory interface unit (MIU) manages internal memory accesses, the BIU manages the communication with the external blocks, such as the DPRAM and the CI. In this way, while the DSP core executes an operation with the X-and YRAMs, the DMA can transfer data needed for external devices. This can be effectively used for DVB-T baseband data transmission. In this platform, the P-RAM is a program memory, and GPIO is used to generate some protocol signal between external interfaces, such as the CI. The wait-state generator is used to synchronize external signals through the Zspace of the different access times between the Teak DSP and the CI and between the ARM platform and the CI.
The Teak DSP platform is also verified using Seamless CVE, but unlike the ARM926 platform, the Teak DSP platform does not use an operating system except for the start-up code. Hardware parts written with VHDL are simulated using ModelSim, while some software parts written with C are verified using a Teak debugger. The start-up code includes an interrupt handler, a null handler, stack, heap, and virtual register.
Dual-Processor Platform with Communication Interface
As the density of IPs in SoCs increases, the importance of efficient communication is growing. Many studies have been carried out with the goal of creating an efficient communication SoC network. Communication topologies can be classified into two types: bus and network structures. An example of bus-based topology is ARM's advanced microcontroller bus architecture (AMBA), which has been expanded to a single-layer advanced high-performance bus (AHB), a multi-layer AHB, and AMBA advanced extensible interface (AXI). Currently, the multiprocessor platform with AMBA AHB is widely used with the ARM processor. However, the AMBA platform connects resources by using a commonly shared bus, which leads to a bottleneck problem. Network-based designs include Sonic's uNetwork, Opencore's Wishbone, and NoCs [10] . These designs increase flexibility by extending existing communication structures with predefined, parameterized components.
Our platform includes a CI, which connects the main processor (ARM926), the co-processor (Teak DSP), the shared memory, and the functional block IPs. The CI is implemented with a crossbar switch architecture, which enables the effective co-processing of heterogeneous components, such as the AMBA bus of the ARM926, the BUI of the Teak DSP, the shared memory, and the peripheral IPs. Using this crossbar switch structure, each master (the ARM platform and the Teak DSP platform) can have access to a slave (the shared memory and the IPs) through the independent bus channel, provided that any given slave is not occupied by more than one master. In this way, we can resolve the bottleneck problem of conventional shared bus architectures. Figure 4 depicts a multiprocessor platform with CI based on a crossbar switch cell structure (see Fig. 5 ). In Fig. 4 , ARM denotes an ARM platform comprising an ARM926, DMA, an arbiter and decoder, memory, and an AMBA AHB bus.
The CI comprises a crossbar switch module and a CI controller. The CI controller plays a role in sending authorized signal on request of communication. The platform should be able to reform itself to support various applications; therefore, the CI should allow easy reconfiguration to support various applications. The CI is designed with a regular 2-dimensional array of crossbar switches to enable flexible extension for specific applications.
The basic function of the CI controller is as follows. The CI connects a particular slave with a master when there is a request from the master to the slave. The CI controller has an arbitration function, which redistributes the ownership of a slave upon a master's request toward the slave through the user's pre-defined priority. If two masters simultaneously send access signals to a slave, the access order is determined by round-robin priority policy. The CI controller has to spend 2 additional clock cycles establishing the communication channel between master and slave. The state machine of the CI controller (see Fig. 6 ) is described in the following subsections using as an example a case with 4 masters and 6 slaves.
A. Idle State
In the initialization state of CI controller, when the system power is authorized or after reset, the CI controller is idle. It always receives request signals from masters (processors) to slaves (shared memory and IPs) when it is in the idle state. In this state, master 1 sends the CI controller access request signals, such as M1_S1_req, M1_S2_req, and M1_S6_req.
B. Round-Robin State
In this state, if there is no request to slave 1 from other masters, it transits to the idle state. If another master's access request is in standby mode, access permission to slave 1 is given to the next master based on the round-robin priority method. Access by other masters is prevented because one master has priority.
C. M1_Grant State (Master 1 Grant)
In this state access permission of master 1 to slave 1 is authorized. In this state, a grant signal is sent to master 1, and a path is connected between master 1 and slave 1 by the controlling crossbar switch cell.
III. Multiprocessor Platform of DVB-T Receiver
In this section we present the DVB-T baseband receiver. We then discuss hardware/software partitioning and multiprocessor scheduling with regard to our multiprocessor platform.
Basics of DVB-T Baseband Receiver
In DVB-T, OFDM technology is used because it can overcome interference among symbols by using multi-path channels; however, when a particular sub-channel's interference is severe, various error correction techniques can be used in addition to FFT to maintain efficiency. A block diagram of the DVB-T baseband receiver is shown in Fig. 7 . In this section, we briefly describe each functional block. (For more details, see [11] and [12] ).
To accurately restore OFDM signal in the DVB-T receiver system, frequency synchronization should be performed first. Coarse frequency synchronization is followed by fine frequency synchronization. Coarse frequency synchronization uses a Classen algorithm, which estimates the frequency offset by using phase differences from the pilot-tone included in two OFDM symbols with symbol periodic intervals. For fine frequency synchronization, we use a Beek algorithm, which estimates the frequency offset by using the phase difference of two signal intervals with "atan2" after accumulating the correlated value of the guard interval and cycle prefix. The symbol synchronization structure satisfies the capacity of the multi-path fading channel with only approximate symbol synchronization by using the remainder of the window offset within some range of the guard interval. We transmit the signal by inserting a guard interval between symbols to prevent The FFT window location restoration extracts a valid sample, eliminating the guard interval from received sample lines. The algorithm used in symbol synchronization uses the absolute value of the signal power difference. The frequency offset compensator multiplies three factors: the estimated frequency offset, the sampling time, and the received symbol signal. The FFT transforms parallel signals into serial signal to restore the desired symbol. The FFT has the highest complexity in the OFDM system and is usually implemented using a hard-wired ASIC. The equalizer compensates the phase and amplitude distortion of received signals caused by channel effects.
Hardware/Software Partitioning and Multiprocessor Scheduling
The DVB-T baseband receiver is usually implemented as hardware because of the complexity of computation and frequent memory access. There are tradeoffs between power consumption, processing speed, adaptability, and development cost. Hardware-driven design is superior in terms of processing power and power consumption, whereas software-driven design is superior in terms of flexibility of application modification, thus, reducing development time. In this study, each functional block of the DVB-T baseband receiver is mapped onto the software or hardware of the multiprocessor platform by evaluating its performance. To satisfy the required specifications of the receiver, we partition the functional blocks as shown in Fig. 8 . We check the execution time of each functional block and partition them into hardware and software, taking hardware sharing into consideration. In this regard, we consider the performance, including overhead (data transfer), while we assign the computation-intensive parts with sharing to hardware. Figure 9 shows a shared memory structure. Each processor platform has to receive a grant from the bus arbiter (that is, the CI controller) to have access permission to the shared memory. For example, let us consider the case in which master 1 has access permission and masters 2, 3, and 4 are in standby mode. If master 1's access has been terminated, master 2 will have access permission. When master 1 requests permission to access slave 1 again, according to the round-robin scheme, the order of access for the standby masters is master 3, master 4, and master 1.
If there is no request from the masters to slave 1, the state Fig. 9 . A shared memory structure and hardware-software partitioning.
Shared memory Processors DMAs
Crossbar switch cell transits to idle. In this work, only round-robin priority policy is used to decide priority order. However, there are still various priority order decision methods that impact performance in the multiprocessor platform with a shared memory.
We first exploit each functional block of the DVB-T baseband receiver based on the computation requirements, the complexity of hardware realization, and parallel processing effectiveness. We then allocate it to each of the processor elements of our multiprocessor platform as shown in Fig. 10 . The ARM926 processor plays a role in the control of the whole system and scheduling. Signal processing is performed by three Teak DSPs which function as follows: DSP1 handles signal compensation and removal of guard intervals, DSP2 handles the fine frequency synchronization and the FFT function, and DSP3 handles the coarse frequency synchronization.
In Fig. 10 , the CI cell is a crossbar switch, and the CI controller connects a master with a slave by the master's request and serves as an arbiter to redistribute the master's right to occupy the designated slave when there is an access request to the slave. The redistribution is based on the priority policy. Note that we used the two shared memories to take real and imaginary data into consideration.
IV. Implementation and Simulation Result
We implemented the DVB-T baseband receiver by applying the CI with the reconfigurable crossbar switch structure using the dual-processor and multiprocessor platform. The platform uses the hardware-software co-design environment of Seamless CVE. Our dual-processor platform verification is divided into hardware and software parts. The hardware parts include the ARM926 platform, the CI and the Teak DSP. The software parts include the operating system and the DVB-T baseband receiver algorithm.
To increase the design flexibility, we replaced many hardware parts with software. During the symbol period at the 8 MHz channel, the processing speed depends on each functional block's data structure. In the 2k mode, when the guard interval is 1/8 in 8 MHz, the whole symbol period of 252 µs complies with the standard specifications of EN 300 744 of DVB-T ETSI [3] . The floating point model of the DVB-T functional blocks is replaced with a fixed model of 32-bit precision. To verify the performance of the fixed model, we measured the bit error rate (BER) with a 64QAM AWGN channel and a 64QAM Rician channel as shown in Figs. 11(a)  and (b) , respectively. Figure 12 shows the partitioning of functional blocks and scheduling for dual processor implementation. The symbol synchronizing block takes 590 µs using software. It calculates the absolute value of the power difference between two sample signals. For the coarse frequency synchronizing algorithm, which calculates the amount of frequency movement due to the Comparator  SIN table   COS table   Multiplier   Multiplier Adder frequency offset by detecting sub-carriers of FFT output, we obtained 312 µs after hardware/software partitioning. Figure 13 shows a hardware block diagram configured on the basis of the functional block evaluation results.
Hardware/Software Paritioning of Each Functional Block
The frequency synchronizer comprises the frequency offset compensator and the frequency offset estimator. The frequency offset compensator compensates the result of frequency offset estimators, such as the Beek and Classen algorithms. The Teak DSP computes frequency offset from The Beek algorithm estimates frequency offset by using the phase difference of two signal intervals with "atan2" after accumulating the correlated values of the guard interval and the cycle prefix. Here, we implemented the multiplication for correlation and addition for accumulation using the hardware block as shown in Fig. 14 , while "atan2" was realized by using the Teak DSP.
The coarse frequency synchronizer seeks the continuous pilot tones randomly distributed among receiver signals and estimates the frequency offset as the relocated position. To identify a continuous pilot tone, we use the maximum correlation value within the range of signals before and after the pilot tone. The width of the signal range provides a tradeoff between performance and computation time. The signal ranges can be computed by the Teak DSP to control the tradeoff. Table 1 . Performance evaluation of each functional block of the DVB-T baseband receiver implemented as shown in Fig. 4 . Note that the same hardware block is used for both the Beek and Classen algorithms. The FFT was realized by hardware (using Xilinx's Core library) as shown in Fig. 15 .
The equalizer comprises three parts: the distributed pilot position and value extraction; the equalizer filter coefficient update using the LMS algorithm; and equalization. Parts 1 and 2 are realized by software (Teak DSP) to control the performance, and the equalization, part 3, is realized by hardware as shown in Fig. 16. 
Simulation Results of Dual-Processor Platform
The simulation results for the proposed dual-processor platform are shown in Table 1 . The ARM926 and Teak DSP are operated by synchronizing with 100 MHz. As shown in Table 1 and Fig. 17 , our dual-processor platform does not fulfill the system specification of one symbol processing time which is 252 µs. Therefore, in the following subsection, we propose the multiprocessor platform with an ARM926 and three Teak DSPs to increase parallelism and, thus, improve performance.
Using Multiple DSPs
The extension from dual-processor platform to multiprocessor platform is accomplished by adding two more Teak DSPs as shown in Fig. 18 . Since the CI is flexible and scalable, all we have to do is to increase the number of crossbar switch cells and to include the new component's address decoder functions. Figure 19 shows the scheduling of power-on synchronization and Fig. 20 shows the scheduling afterwards. The pipeline structure of the multiprocessor operates at 166 MHz. Based on the scheduling analysis, arithmetic operators such as the FFT, multiplier, and adder are implemented using a hard-wired IP. Due to interconnection bottlenecking, Table 2 , the required symbol period of each functional block of our DVB-T baseband receiver is 252 to meet the system requirement. Table 2 shows the results of using two shared memories. The memory access latency time (MAL) is the time delay such that the processor reads data from a designated memory address, and then writes the data into the memory address. Modem data includes both real and imaginary data. Using two shared memories, task 1 writes I (imaginary)-data into memory 1 while task 2 simultaneously writes Q (real)-data into memory 2. Table 2 shows the experimental data of MAL for one shared memory and two shared memories for each functional block. The MAL of two shared memories is 40% lower than that of one-shared memory as shown in the table.
We compared the existing two DSP-based DVB-T receivers with a system using three 32-bit DSPs [5] and a recently published heterogeneous reconfigurable system using 64 PEs [14] . Table 3 compares the processing times of the DVB-T receiver blocks in each system. Although we only used three 16-bit DSPs (at 166 MHz), the timing constraints of 252 µs for both the demodulator and the equalizer were met as shown in Fig. 21 . The performance was comparable to or better than that of the other systems and used fewer resources.
V. Conclusion
In this paper, we implemented and verified a new DVB-T baseband receiver multiprocessor platform using hardwaresoftware co-design techniques. The platform comprises an ARM, three Teak DSPs, hardware IPs, and a CI to meet the system specification. To communicate between heterogeneous processor platforms and provide scalability, a simple CI with a crossbar structure was proposed to ease the scalable extension. The symbol period required for each functional block of our DVB-T baseband receiver is 252 µs, which fulfills the system requirement. The adoption of general purpose processors provides great flexibility in the implementation of the DVB-T baseband receiver. Furthermore, it enables the implementation Fig. 19 . DVB-T baseband receiver scheduling I using multiprocessor platform. of adaptive equalization schemes and synchronization schemes, in which the quality of the equalizer and the synchronization block varies according to channel characteristics and channel variations. The flexibility of the software approach along with the obtained performance test results demonstrates that the proposed multiprocessor platform is highly suitable for application to DVB-T baseband receivers. There is a question as to how the performance of such a system would compared with that of one which implements more of its functions on an FPGA (which is also programmable) and with fewer DSPs. However, using an FPGA requires much design time; therefore, the time to market would be higher in comparison with using DSPs. Moreover, an FPGA cannot be integrated with other IP's into a single chip. Our main concern was to design a single-chip platform that provides the flexibility to quickly evolve to meet new standards. The platform can be implemented as a real SoC, acquiring the IP licenses of the ARM and the Teak DSP. The goal of this paper is to realize the DVB-T receiver using a multiprocessor SoC platform that would form the basis of a future SoC.
In our future work we plan to address several issues. First, the data transfer overhead during software-hardware partitioning can be reduced using DMA. Second, a finergrained scheduling algorithm could improve performance. Finally, the proposed configurable multiprocessor platform with crossbar switch architecture could be applied to other advanced multimedia applications, such as H.264, 3-D image processing, and voice recognition.
