Abstract
Introduction
HSDPA (High Speed Downlink Packet Access) is the evolutionary mode of WCDMA, which is currently being developed in next-generation wireless communication standard 3GPP as part of Release 5 [1] . Based on the feedback information from the uplink, the base station makes a decision as to which modulation and coding scheme, power allocation, and code allocation to use. With the combination of multichannel spreading codes, a multi-code CDMA system is achieved to provide high data rate up to 10 Mbps for the cellular mobile system, so as to support wireless multi-media services in the future.
The product development involves the implementation of advanced signal processing algorithms in real-time highspeed system with efficient resource usage. In the literature, although many algorithms have been proposed, not all of them are applicable in a real system because of the high computational complexity for real-time implementations.
Others need some modifications to be applicable to the product. Rapid prototyping of these algorithms can verify the algorithms in a real environment and identify potential Figure 1 . The design flow using Precision C RTL generator implementation bottlenecks, which could not be easily identified in the algorithmic research. A working prototype can demonstrate to service providers the feasibility of HSDPA. Moreover, it can provide up-front analysis of implementation issues that may arise during the product development process. To meet the fast changing market requirement, a design methodology that can study different architecture tradeoffs efficiently is highly desirable.
In this paper, we derive an efficient design methodology that uses Precision C Synthesis [2] and HDL designer from Mentor Graphics. This design flow shown in Fig. 1 starts with a C/C++ fixed point algorithm and schedules for the most efficient VLSI architecture in terms of timing and area tradeoff based on the architecture and resource constraints. RTL output can then be generated directly from a C/C++ level design. This makes it easy to design and maintain as well as to transfer the technology to product development groups. The generated RTL source code can be imported to the HDL designer to interface with other design blocks for high-level integration. Using this methodology, we rapidly prototyped several major algorithms in the HSDPA system, including computational intensive signal processing blocks such as chip-level equalizer, configurable turbo interleaver for the 3GPP standard, and the synchronization algorithms such as Clock Tracking and Automatic-Frequency-Control (AFC) algorithms [4, 5] . The implementation of Clock Tracking and AFC will be studied in detail to demonstrate the efficiency of this methodology. Based on our experience, typical wire-less communication algorithms can be prototyped with an efficiency enhancement of 2X-3X by using this methodology. Fig. 2 . System diagram for HSDPA prototype system.
2

HSDPA System Model
The primary objective of the HSDPA standard development is to provide enhanced data services while maintaining circuit-switch service to guaranteed-QoS users in the WCDMA cellular system. It features the important enhancement including link adaptation, HARQ(Hybrid-Automatic-RepeatRequest), etc. The system diagram for the HSDPA prototype system is depicted in Fig. 2 . In the transmitter, the host computer running the network layer protocols and applications interfaces with the DSP, which hosts the MAC layer protocol stack and handles the high-speed communication with FPGAs. A DSP interface core in the transmitter reads the data from DSP and adds CRC code. After the turbo encoder, rate matching and interleaver, a QPSK/QAM encoder modulates the data according to the HARQ control signals. With the CPICH(Common Pilot Channel) and SCH(Synchronization Channel) information inserted, it is spreaded and scrambled with PN long code and then ported to the RF transmitter. At the receiver, searcher will find the synchronization point. Clock Tracking and AFC are applied for fine synchronization. After the rake receiver, received symbols are demodulated and de-interleaved before the rate dematching. Then after a HARQ buffer, a turbo decoder decodes the soft decisions to bit stream, which is sent to upper layer applications. Other key algorithms such as channel estimation and chip-level equalizer are applied to eliminate the distortions caused by the wireless multi-path and fading channels.
3 Architecture Scheduling
Real-time Technologies & Architectures
The realization of such a complicated end-to-end communication system highly depends on the task partitioning based on the real-time requirement and system resource usage, which roots from the complexity and computational architecture of the algorithms. High-level software solutions, such as general-purpose processor (GPP), and software configurable DSP processors, e.g. TI's TMS320C6000 series, are preferable if applicable. However, although these two technologies provide higher flexibility and configurability, they are not powerful enough in speed for the PHY layer of HSDPA system. We can seek hardware solutions such as FPGA and ASIC platforms for high-speed lower layer algorithms, while leaving slower tasks for DSP and GPP based on a smart task partitioning. Although ASIC design is compact and less expensive when the product volume is large, it is not easy to configure at the stage of prototyping. FPGA provides hardware programmability and the flexibility to study several area/timing tradeoffs in hardware architectures. It can easily achieve the concept of system-on-chip with hardware configuration. In the FU layout architecture, we can map many FUs in parallel to achieve high pipeline performance. Although the instruction scheduler and multiple FUs are used in some advanced processor architectures, the processor architecture still achieves instruction-level pipelining while the layout architecture achieves FU-based pipelining of explicit design which significantly improves performance.
Previous FPGA Solutions
The most fundamental method of creating hardware design for an FPGA or ASIC is to use industry-standard hardware description language (HDL), such as VHDL or Verilog, based on either dataflow, structural or behavioral models. However, this design method is very low-level for system engineers to understand and highly based on the off-line logic modeling. For a very complex design like the HSDPA system, the design and troubleshooting can be very difficult. Graphical schematic design tools such as Hardware Design System (HDS) from Cadence or HDL designer from Mentor Graphics are more intuitive. However, the design is still manual and the intrinsic architecture tradeoffs need to be studied offline. It is not easy to change a design dramatically once the hardware architecture is laid out. Moreover, de-tailed knowledge of hardware components is still required because all of the hardware components need to be synthesizable for a hardware implementation. Manual optimization makes timing and area tradeoffs of the design difficult to evaluate, especially when the retiming is critical for high-speed designs. Some C/C++ level RTL tools such as System C attempt to combine a high-level abstraction with the ability to generate synthesizable RTL; however the detailed hardware specification in the language makes it difficult to use.
Precision C Design Flow
Precision C synthesizer is a new RTL design tool optimized for hardware design from Mentor Graphics. It is a true C/C++ level architecture scheduler. We were one of the first Beta users of the tool and one of the first in the industry to integrate Precision C in a complete rapid prototyping methodology for advanced wireless communication systems. Our rapid prototype design flow shown in Fig . Algorithm test bench using C/C++: The matrix level computations must be converted to plain C/C++ code. All internal Matlab functions such as FFT, SVD, eigen-value calculation, complex operations etc, need to be translated with efficient arithmetic algorithms to C/C++ functions. In most cases, we use a test vector sampled from the ADC, which has the distortion effects the algorithm is targeting, e.g., frequency offset. In C/C++, we would model real-time system effects by following the exact data pattern as in the actual hardware implementation such as: data streaming, buffering, interface design etc. After the floating-point version of C/C++ functions matches the Matlab simulation, we need to convert them to fixed-point version and study the word-length effects of the algorithm in C/C++ level. (3). Precision C Scheduling: By following some C/C++ design styles, both behavioral and RTL output can be generated in Precision C from C/C++ level algorithm. Behavioral output is not synthesizable, but it has the same behavior as RTL and is much faster than the RTL output in software simulation. By studying the parallelism in the algorithm, many of the FUs can be reused in the computational cycles.
In Precision C, we can add both timing and area constraints and Precsion C will schedule efficient architecture solutions according to the specified constraints. The number of FUs is assigned according to the timing/area constraints. Software resources such as registers and arrays are mapped to hardware components and necessary Finite State Machines (FSM) are generated. In this way, we can study several architecture solutions efficiently and achieve the flexibility and productivity of DSP with the performance of FPGA. More details about architecture scheduling are discussed in 3. The early stage will verify the individual algorithm the same as ModelSim result. Afterwards, the algorithm is integrated into the whole system, and we test the complete FPGA system in hardware. Finally, the FPGA designs will be integrated with the DSP and host PC for an end-to-end system demonstration.
Architecture Scheduling
In general, more parallel hardware FUs means faster design at the cost of area, while resource sharing means smaller area by trading off the speed. Even for the same algorithm, different applications may have different real-time requirement. For example, FFT needs to be very fast in OFDM based systems for high data throughput rate, while it can be much slower for other applications such as spectrum analyzer. The best solution would be the smallest design meeting the real-time requirements, in terms of clock rate, throughput rate, latency etc. The hardware architecture scheduling is to generate efficient architectures on different resource/timing requirements.
The programming style is essential to specify the hardware architectures in the C/C++ program. Several high level conventions are defined to specify different architectures to be used. For example, the use of array will be mapped to memory while the use of variables is mapped to a register file. Unlike System C, Precision C does not require very detailed knowledge of the hardware components. Here we only highlight some important features of the Precision C architecture. Precision C will schedule architectures in two basic modes according to the behavior of the real-time system: the throughput mode or the block computation mode.
1). Throughput mode:
It assumes that there is a top-level main loop. In each computation period, the data is input into the function sample by sample. The function will process for each sample input. Usually, no handshaking signals are required. The temporary values are kept by using static variables. The throughput is determined by the latency of the processing for each sample. Therefore it is more suitable for the sample-based signal processing algorithms. Typical computations for this mode are filtering and accumulation type computations in wireless systems.
2). Block mode: In block mode, the function processes once after a block of data is ready. The input data are either arrays or vectors in C code. The hardware interface will use RAM blocks to pass the data. Precision C will generate FSMs for the write enable, MEM address/data bus and control logics. Typical block computations are FFT, turbo decoder etc. Usually the throughput mode will be used for the front-end pre-processing blocks because of high-speed realtime requirement while the block mode is used for lower speed post-processing modules. In Precision C, first we can specify the general requirements on the CLK rate, standard I/O and handshaking signals such as RESET, START/READY, DONE signals for a system. Then we can specify the building blocks in the design by choosing different technique libraries, e.g. RAM library and Coregen library. This will map the basic components to efficient library components such as divider or pipelined divider from the C language operator "/". The keys for optimization of the area/speed are loop unrolling, pipelining and the resource multiplexing. Loop unrolling is a procedure to repeat the loop body by trading higher speed for increased area. By unrolling, we may have multiple copies of FUs. But these FUs can be used in parallel if there is no dependency between the computations. Pipelining is basically a computational assembly line where multiple operations are overlapped in execution [6] . The use of memory can affect the performance dramatically. In a C level design, the arrays are usually mapped to memory blocks. We can also map the internal or external RAM/ROM blocks used in the algorithm. In some cycles, some FUs might be in IDLE state. These FUs could be reused by other same computations that occur later in the algorithm. Thus there will be a lot of possible resource multiplexing in an algorithm. Multiplexing FUs manually is extremely difficult when the algorithm is complicated. In many cases, therefore multiple FUs must be applied even for those independent computations. The size can be several times larger with the same throughput as in Precision C solution. In Precision C, we specify the max number of cycles in resource constraints. We can analyze the Bill-Of-Material (BOM) used in the design and identify the large size FUs. We can limit the number of these FUs and achieve a very efficient multiplexing. In the scheduling result, we can study the computational dependency. Usually the logic and multiplex in the design will determine the clock rate and the cycle number. With the detailed reports on many statistics such as the cycle constraints and timing analysis, we can easily study the high level architectures of the algorithm and rapidly get the smallest design by meeting the timing as much as possible. 
Case Study: HSDPA Synchronization
In the complete HSDPA prototype system, we used Precision C to design several blocks with high computationalcomplexity. As shown in Fig.2 , these blocks are the FFTbased chip-level equalizer with transmission diversity [7] , the configurable turbo-intervealer for the turbo encoder, the Clock Tracking and AFC blocks. We only use the simple case of Clock-Tracking and AFC to demonstrate the features of the architecture scheduling with Precision C because of space limit in this paper.
Clock Tracking 1). Algorithm:
The mismatch of the transmitter and receiver crystal will cause a phase shift between the received signal and the long scrambling code. The "Clock-Tracking" algorithm [4] will track the code sampling point. The IF signal is sampled at the receiver and then down-converted with a digital demodulator at local frequency. The separated I/Q channel is then down-sampled to be four phases' signals at the chip-rate, which is 3.84MHz. By assuming one phase as the in-phase, we compute the correlation of both the earlier phase and the later phase with the de-scrambling long code according to the frame structure of HSDPA. When one phase is much larger than another phase (compared with a threshold), it will then be judged that the sample should be moved ahead or delayed by one-quarter chip. Thus the resolution of the code tracking can be one quarter of a chip. 
2). Architectures:
The system interface for Clock Tracking is depicted in Fig. 5 . At the down sampling after the DDC (Digital Down-Converter Xilinx core), the in-phase, early, late phase are sent to rake receiver and Clock-Tracking respectively. The long Code will be loaded from ROM block. The Clock-Tracking algorithm computes both early/late correlation powers after descrambling, chip-matched filter, and accumulation etc.
This computational intensive algorithm is suitable for Precision C scheduling. The C level function will get both early and late phase as input and generate a flag to indicate early, in-phase or late as output. This flag is used to control the adjustment signal of a configurable counter. The adjusted inphase samples are then sent to Rake receiver for detection. Thus the code tracker is integrated with IP cores and the other HDL designer blocks (down-sampling, MUX etc). The clock-tracking algorithm could also use manual layout architecture in HDL designer. We will most likely build parallel architecture with duplicate FUs as in Fig. 6 for rapid prototyping. First, we will have a desrambling procedure that is a complex multiplication with the long code. Then we will have a chip-matched filter that is basically mapped to an accumulator. Then after each symbol, we need to compute power and accumulate for each frame. We finally have a comparator to make a decision. Altogether, we will have copies for both early/late paths. This requires 16 multipliers and 12 adders. This architecture is optimal for fully pipelined computation where a sample will be input in each cycle. However, in our system, since we use 38.4 MHz clock rate, only one sample will input at the chip-rate for each 10 cycles. The pipeline is idle for the other 9 cycles and the resources are wasted. With Precision C, we can schedule several solutions by setting different constraints, as in Table 1 . In these designs, the FUs are multiplexed within the timing constraints. Because of the dependency of the computation, there will be a necessary latency for the first computation result to come out even if we use many FUs. For example, in solution 1, although we use 8 multipliers and 6 adders, the best we can achieve is 7 cycles latency. The size is 5600 FPGA Look-Up- Tables  (LUTs) . By setting the number of constraints and the maximal acceptable number of cycles (10 cycles), we will have different solutions with size from 2000 to 1300 LUTs. We can choose the smallest design as in solution 4 for implementation by still meeting the timing constraint. , one single multiplier is reused in each cycle, by avoiding the dependency. After each multiplication, an addition follows and for the cycles 2-9, multiplications and additions are done in parallel. Moreover, we still meet the 10 cycle timing constraint easily. In solution 4, the hardware is used most efficiently. This is almost the minimal possible size achievable theoretically for this particular algorithm. The savings in hardware can also reduce the power consumption that is critical specification for mobile systems.
Automatic Frequency Control
The frequency offset is caused by the Doppler shift and frequency offset between the transmitter and receiver oscilla-tors. This makes the received constellations rotate in addition to the fixed channel phases, thus dramatically degrading performance. Automatic Frequency Control (AFC) is a function to compensate for the frequency offset in the system. In the analog system, the function is implemented with an analog Phase-Locked-Loop component. However, for a definable software radio (DSR) type of architecture, the frequency offset is computed with a DSP algorithm and controlled by NCO (Numerical Control Oscillator). Usually, the offset is very small and changes very slowly. When the signal passes through the channel, multi-path phases and attenuation factors will be introduced in addition to the AWGN noise. Before the rake receiver can detect the signal correctly, frequency offset will be estimated and compensated to the frequency offset at the Digital-DownConverter (DDC). Here we discuss two algorithms briefly and focus on the hardware architecture of the applied algorithm with Precision C scheduling.
1). Transient phase estimation based AFC:
The transient phase offset estimation algorithm [5] assumes that after the rake receiver, the channel phase errors have been perfectly corrected by a rotation and weighting procedure. The channel parameters are estimated separately in advance. Then by using the decision statistics after the rake receiver, we can compute the instant phase with an ARCTAN function. Then we can compute the frequency offset by a relation between the sampling phase and the sampling time. However, we found some assumptions invalid for our prototype environment, so we proposed a stable and efficient algorithm based on spectrum analysis.
2). Spectrum analysis based AFC:
The derivation of the algorithm is omitted and we only describe the computations in this algorithm for FPGA architectural study. The frame structure of HSDPA is depicted in Fig. 8 . There are 15 slots in each frame. In each slot, the first 5 bits are pilot symbols and the second 5 bits are control signals. Each symbol is spreaded by a 256 chip long code. So in the algorithm, we first use long code to descramble the received signal at chiprate. We then do the matched filtering by accumulating 256 chips. By using the conjugate of local pilot, we get the dynamic phase of the signal with the frequency offset embedded. To increase the resolution, we finally accumulate each of the 5 pilot bits as one sample. The 5 control bits are skipped. Thus the sampling rate for the accumulated phase signals is reduced to be 1500 Hz. These samples are stored in a dual-port RAM for the spectrum analysis using FFT. After the de-scrambling and matched-filter as well as accumulation, we get a very stable sinusoid waveform for the frequency offset signal as shown in the figure. Fig. 9 . Abstract Precision C scheduled processor architecture.
In this design, we have several tradeoffs to study. The phase accumulator has a similar architecture as the Clock-Tracking algorithm. We will focus on the architecture tradeoff of FFT. Although Xilinx core also provides a variety of FFT IP cores, the purpose of them is usually for high throughput applications, and they usually have considerably large sizes. But in our algorithm, we do not need the FFT to be very fast, so we can relax the timing constraint to get a very compact design. The complete AFC algorithm only needs to be updated in each frame length, which is 10 ms. With Precision C scheduling, we can have several solutions with only 1 multiplier and 1 adder reused for each MULT and ADD operation. The latency is larger than the Xilinx core, but the area is smaller. Finally, for all three blocks and different point FFT, we achieve the same minimal size around 1000 LUTs, saving about 3X over the standard Xilinx Core. The architecture generated by Precision C is shown in block diagram as in Fig. 9 . For a complicated algorithm as FFT, it can consist of several Processing Elements (PE) with different FUs. In each of the PEs, there are a lot of multiplexing control signals and system FSMs for register latching and memory addressing. These PEs will cooperate either in parallel or in pipeline mode with some other PEs. In principle, it could be considered as a joint processor and layout architecture and achieves hybrid high-level pipelining and FU-level pipelining. With multiple FPGAs working together, the system can be easily partitioned and ported to several FPGAs with different capacity and resources. This has equivalent computation power to multiple pipelined DSP processors working in parallel. 
Performance & Productivity
Precision C is even more advantageous for algorithms with significant memory access and intensive computations. In our HSDPA system, other advanced algorithms scheduled by Precision C are the configurable turbo interleaver for 3GPP standard and an FFT-based chip-level equalizer. The turbo interleaver has very complicated memory accesses. The equalizer solves a problem of matrix inversion with Toeplitz structure. Even with greatly reduced complexity, the efficient FFT-based algorithm is still one of the most dominant algorithms in the transceiver design. This algorithm involves many element-wise FFT computations as well as sub-matrix inverse and multiplication, as shown in Fig. 11 . There are a lot of tradeoffs among the algorithms. It is quite possible to spend 1 year for a single solution that could not be guaranteed the most efficient architecture in area/timing tradeoff. However with Precision C, we can study several architectures in 3-4 months. In manual layout, the clock rate is hard to predict. And it is especially difficult to do retiming for critical path balance in large designs. An additional advantage of Precision C is that 40 MHz to 200 MHz is configurable and predictable with auto retiming.
Because of the complexity involved, we omitted the detailed scheduled results from Precision C. The dramatic productivity improvement and the percentage of the major algorithms in a HSDPA system are show in Tab.3. The workload for manual HDL design is estimated based on our HDL designer experience. Fig. 11 . FFT-based chip-equalizer
Conclusion
In this paper we described a rapid prototyping methodology using Precision C scheduler and HDL designer. We used Precision C to build the backbone RTLs of new advanced algorithms in the HSDPA system and integrated them with Xilinx IP cores as well as HDL designer blocks with the schematic capture of HDL designer. We studied FPGA architecture tradeoffs efficiently and found the most efficient solution for a specific architecture/resource constraint. The productivity was improved dramatically by moving the workload from low-level logic layout to high-level abstraction of the architectural research.
We are grateful to Mentor Graphics for the provision of a Beta license for Precision C and for technical support to integrate the complete design flow.
