Abstract
Introduction
Applications in the field of digital communications are becoming more and more diversified and complex. This trend is driven by the emergence of turbo-communications which generalize the principle of iterative processing introduced by the turbo-codes [1] . Implementation of turbocommunication systems -such as channel decoding, equalization, demodulation, synchronization or MIMO systems-is becoming crucial to reach the nowadays performance requirements in terms of transmission quality (e.g. throughput and error rates). In addition to the continuously developing new standards and applications in digital communication domain, the severe time-to-market constraints make inevitably resorting to new design methodologies and the proposal of a flexible and efficient turbo communication platform.
Good tradeoffs between flexibility and performance can be achieved by the use of programmable/configurable processors rather than ASICs. Concerning turbo decoding, several turbo-decoder implementations have been proposed these last few years. Some of these implementations succeeded in achieving high throughput for specific standards with a fully dedicated architecture. In [2] , the ASIC implementation enables high performance turbo decoding dedicated to 3GPP standards. In [3] , a new class of turbo codes more suitable for high throughput implementation is proposed. However, such implementations do not take into account flexibility issues. Unlike these implementations, others include software and/or reconfigurable parts to achieve the required flexibility while achieving lower throughput [4] . In fact, the concept of the application-specific instruction set processor (ASIP) [9] constitutes the appropriate solution for fulfilling the flexibility and performance constraints of emerging and future applications as shown in [10] and [11] . Despite the appropriateness of ASIP concept, the execution speed associated to ASIP's instruction set simulators (ISS) is too slow to validate a complete system, especially in the case of digital communication applications, which imply very long error rate simulations. Executing these simulations in a reasonable time imposes to run them on a hardware prototype. Therefore, in this context, system validation requires a proper prototyping flow.
In this work, we present a flexible and high performance ASIP model for turbo decoding and propose a validation flow of this ASIP from its high level description to the FPGA prototype.
The rest of the paper is organized as follows. The next section presents the turbo decoding algorithm to better understand subsequent sections. Section 3 details the proposed ASIP architecture model for turbo decoding. Section 4 describes the flow we proposed to verify and prototype our processor. Then, this flow is illustrated in section 5 through a FPGA prototyping on a development board. Finally, section 6 summarizes the results obtained and concludes the paper.
Convolutional Turbo Decoding
In iterative decoding algorithms, the underlying turbo principle relies on extrinsic information exchanges and iterative processing between different Soft Input Soft Output (SISO) modules. Using input information and a priori extrinsic information, each SISO module computes a posteriori extrinsic information. This a posteriori extrinsic information becomes the a priori information for the other modules and is exchanged via interleaving and deinterleaving processes.
For convolutional turbo codes [1] , classically constructed with two convolutional component codes, the SISO modules process the BCJR or Forward-backward algorithm [5] , which is the optimal algorithm for the maximum a posteriori (MAP) decoding of convolutional codes ( Figure  1 ). So, a BCJR SISO will firstly compute branch metrics (or γ metric), which represents the probability of a transition occurring between two trellis states (s': starting state, s: ending state). Note that a branch metric can be decomposed in an intrinsic part i γ due to systematic information and a priori information and an extrinsic part e γ due to redundancy information. Then a BCJR SISO computes forward and backward recursions. Forward recursion (or α recursion) computes a trellis section k (i.e. the probability of all states of the trellis regarding the k th symbol) using the previous trellis section and branch metrics between these two sections, while backward recursion (or β recursion) computes a trellis section k using the future trellis section and branch metrics between these two sections. With max-log-MAP algorithm [6] , it can be expressed:
Finally, the extrinsic information of the symbol k is computed for all decisions d k from the forward recursion, the backward recursion and the extrinsic part of the branch metrics.
( )
As presented in [15] , implementing an efficient turbodecoder requires a good exploitation of the parallelism. In turbo decoding with the BCJR algorithm, parallelism techniques can be classified at three levels: (1) BCJR metric level parallelism, (2) BCJR SISO decoder level parallelism, and (3) Turbo-decoder level parallelism. The first (fine grain) parallelism level concerns the processing of all metrics involved in the decoding of each received symbol inside a BCJR SISO decoder. Parallelism between these SISO decoders, inside one turbo decoder, belongs to the second parallelism level. The third (coarse grain) parallelism level duplicates the turbo decoder itself.
The following paragraph focuses on the BCJR metric level parallelism, since the ASIP presented in the next section only uses this level. This level exploits the inherent parallelism of the trellis structure, and also the parallelism of BCJR computations [7] .
Parallelism of trellis transitions
Trellis-transition parallelism can easily be extracted from trellis structure as the same operations are repeated for all transition pairs. In log-domain [6] , these operations are either ACS operations (Add-Compare-Select) for the maxlog-MAP algorithm or ACSO operations (ACS with a correction offset [6] ) for the log-MAP algorithm.
Each BCJR computation (1)(2)(3) requires a number of ACS-like operation equals to half the number of transitions per trellis section. Thus, this number, which depends on the structure of the convolutional code, constitutes the upper bound of the trellis-transition parallelism degree.
Parallelism of BCJR computations
A second metric parallelism can be orthogonally extracted from the BCJR algorithm through a parallel execution of the three BCJR computations.
Parallel execution of backward recursion and APP computations was proposed with the original ForwardBackward scheme, depicted in Figure 2 .a. So, in this scheme, we can notice that BCJR computation parallelism degree is equal to one in the forward part and two in the backward part. To increase this parallelism degree, several schemes are proposed [8] . Figure 2 .b shows the butterfly scheme which doubles the parallelism degree of the original scheme through the parallelism between the forward and backward recursion computations. This is performed without any memory increase and only BCJR computation resources have to be duplicated. Thus, BCJR computation parallelism is area efficient but limited in parallelism degree. 
ASIP for Turbo Decoding

Context of Architectural Choices
As seen in section 2, hardware implementation achieving high throughput should first exploit the BCJR metric level parallelism. The complexity of convolutional turbo codes proposed in all existing and emerging standards is limited to eight-state double binary turbo codes or sixteen-state simple binary turbo codes. Hence, to fully exploit trellis transition parallelism (section 2.1) for all standards, a parallelism degree of 32 is required. The implementation of future more complex codes can be supported by splitting trellis sections into sub-sections of 32-parallelism degrees and by processing sub-sections sequentially. Regarding BCJR computation parallelism (section 2.2), we choose a parallelism degree of two instead of four (maximum). Using a parallelism degree of four with the butterfly scheme leads to underutilization of the BCJR computation units (two are only used half of the time).
These parallelism requirements imply the use of specific hardware units. To implement these units while preserving flexibility, application-specific instruction set processors constitute the perfect solution [9] . The use of such processors in embedded SoCs is becoming more and more mandatory due to the increase in application complexity and the emerging of new applications and ever changing standards. Several approaches and frameworks are now proposed by EDA vendors for ASIP design. ASIP design involves, in addition to hardware cores, software development tools like simulators, compilers, assemblers, debuggers and linkers. In one approach, an environment where the designer can select and configure predefined hardware elements to enhance a predefined basic processor core according to the application needs is proposed. Userdefined hardware blocks, together with the corresponding instructions, can be added to the processor [14] . In an other approach the designer has full freedom for the ASIP architecture where he uses an Architecture Description Language (ADL) to specify the instruction set and the ASIP architecture [12] [13] .
In this paper we used the Processor Designer framework from CoWare, which is based around the LISA ADL [12] , allowing the automatic generation of ASIP software development tools, and VHDL, Verilog and SystemC models for hardware synthesis and system integration.
The Architecture of the ASIP
The presented processor, dedicated to the BCJR algorithm, is an enhanced version of the ASIP proposed in [10] .
3.2.1
Global view and memory organization The ASIP is mainly composed of operative and control parts besides its communication interfaces and attached memories (Figure 3 .a). The operative part is tailored to process a window of 64 symbols by means of two identical BCJR computation units, corresponding to forward and backward processing in the MAP algorithm. Each unit produces recursion metrics and extrinsic information. The storage of recursion metrics produced by one unit, to be used by the other unit, is performed in cross memories. Another internal memory (config) contains up to 256 trellis descriptions, so that the processor can be configured for the corresponding standard. Incoming data that group systematic and redundant information of the channel, in addition to extrinsic information, are stored in external memories attached to the ASIP (input_data, info_ext). Depending on the application's requirements, the depth of incoming data memories can be scaled up to 65536 to cover all existing and emerging standards' frame-length specifications. The external future and past memory banks are used to initialize state metric values for the beginning and end of each window according to the message passing method. Each bank has two memories, one storing forward recursions and the other backward recursions. The depth of these memories can be scaled to the number of windows required with a maximum of 1024.
For all the external memories, memory latencies of one cycle in read/write access have been integrated in the ASIP pipeline. This specific feature has been added to cope with FPGA implementation requirements since embedded memory blocks are synchronous in an FPGA. (Figure 3.b) . The 32 adder nodes are organized as a 4x8 processing matrix. In this organization, for an 8-state double binary code, the row and column of an adder node correspond respectively to the considered symbol decision and the ending state of the associated transition. For a 16-state simple binary code, transitions with ending states 0 to 7 are mapped on matrix nodes of row 0, if transition bit decision is 0, or matrix nodes of row 1, if transition bit decision is 1, whereas states 8 to 15 are mapped on nodes of rows 2 and 3. An adder node (Figure 3 .c) contains one adder, multiplexers, one register for configuration (RT) and an output register (RADD). It supports the addition required in a recursion between a state metric (coming from the state metric register bank RMC) and a branch metric (coming from the branch metric register bank RG), and also the addition required in information generation since it can accumulate the previous result with the state metric of the other recursion coming from the register bank RC. The max nodes (Figure 3 .d) are shared in the processing matrix so that the max operations can be performed on RADD registers either rowwise or columnwise, depending on the ASIP instructions. A max node contains three max operators connected in a tree. This makes it possible to perform either a four-input maximum (using the three operators) or two two-input maximum. Results are stored either in the first rows or columns of RADD matrix or in RMC bank to achieve recursion computation. The BCJR computation unit also contains a GLOBAL ALU that computes extrinsic information, hard decisions and other global processing, and a Branch Metric (BM) generator, that performs branch metric calculations from extrinsic information register bank (RIE) and from channel information available in the pipeline registers (PR).
BCJR computation unit
Each BCJR computation unit is based on Single Instruction Multiple Data (SIMD) architecture in order to exploit trellis transition parallelism. Thus, 32 adder nodes (one per transition) and 8 max nodes are incorporated in each unit
3.2.3
Control The ASIP control part is based on a six-stage pipeline (Figure 3 .e). Pipeline length was kept short in order to preserve some flexibility for further extensions. The stages are Fetch, Decode, Operand Fetch, Branch Metric, Execute, and Store. In comparison to [10] , a Branch Metric (BM) stage has been added to the pipeline in order to anticipate the calculation of branch metrics performed in the BM generator, to increase the clock frequency of the ASIP.
The control part also requires several dedicated control registers. Thus, the window size is fixed in the register R_SIZE, and the current processed symbol inside the BCJR computation unit A (resp. BCJR computation unit B) is stored in the pipeline register ADDRESS_A (resp. ADDRESS_B). These addresses, as well as the program counter and the corresponding instruction, are then pipelined. In addition, the control architecture provides branch mechanisms and a Zero Overhead Loop (ZOL) fully dedicated to the butterfly scheme (see section 2.2). To alleviate the ASIP instruction set, the ZOL mechanism is tightly coupled with addresses generation (see Figure 4) . Thus, the first loop is performed while the address of the symbol processed by unit A is smaller than the address of the symbol processed by unit B. In case of odd window size, the middle symbol is processed by unit A when both addresses are equal. Finally, the second loop is performed while the address of the symbol processed by unit B is positive. 
ASIP Instruction Set and code example
The designed instruction set of our ASIP architecture is coded on sixteen bits. The original version [10] contains 30 instructions that perform the basic operations (control, operative and IO) of the MAP algorithm. To increase performance, the ASIP was extended with compacted instructions that can perform several operations in different pipeline stages within a single instruction. This implies a better code efficiency and in the same time, a more compact code. The following example illustrates the code required to perform continuously a 48-symbol sub-block decoding of a Wimax SISO using the butterfly scheme. The first 6-instructions load the required configuration and initialize the recursion metrics. Then the butterfly loops are initialized using the ZOLB instruction. The first loop (2 instructions) only computes the state metrics for the double binary code (thus 2 max operations are required). The second loop (5 instructions) additionally computes the extrinsic information of the eight-state Wimax code. Finally, the SISO exports the sub-block ending metrics and program branches to the beginning of the butterfly.
Regarding the execution time, 2*N/2 cycles are needed in the first loop of the butterfly scheme, and 5*N/2 cycles in the second part, where N is the sub-block size. Thus, 3.5 cycles are roughly needed to process one symbol. Figure 6 gives an overall representation of the prototyping and verification flows proposed in this work. Prototyping flow is colored with dark gray and verification flow with light gray.
Prototyping and Verification Flow
LISA verification
Once the processor is described in LISA language, the Processor Designer framework can generate automatically the ASIP software development tools (macro assembler, assembler, linker, simulators and debugger). Thus it enables to simulate and debug the ASIP at high level of abstraction. Simulations are run on different testbenchs (i.e. different initial conditions: frame size, trellis, SNR…), so that tests cover most of processor functionality. In order to perform automated simulation of a testbench at LISA level and also at lower levels of abstraction, initial memory contents of the testbench have to be included in the executable file. Therefore, memory contents are inserted in the assembly files in user-defined sections, which are defined in linker command file.
From LISA to RTL HDL
The Processor Generator tool included in Processor Designer creates automatically an HDL model of the processor described in the LISA language. In the flow, HDL model is represented by ASIP HDL files and HDL simulation memory files, since memory files are not synthesizable.
The quality of generated RTL code strongly depends on LISA modelling and on the HDL optimization options chosen in Processor Generator. To improve the efficiency of the generated RTL code, LISA code has to be as close as possible to the desired RTL code. So each resource has to be scaled at the right size and the code must exhibit all possible sharing that could not be detected automatically during synthesis. For example, operators have to be shared by all instructions that use these operators. In our ASIP, all instructions containing the word add use the operation containing the adders of processing matrix.
Based on enhanced LISA code, the generated HDL code can be further improved using HDL optimization options (path sharing, decision minimization, hierarchical pattern matching, condition decoupling for group calls and expressions). These options affect the generation of HDL code with respect to area or timing.
HDL verification
In order to verify the generated HDL, the Processor Designer toolsuite also provides an utility, called exe2txt. It extracts memory contents from the executable of the LISA testbench thanks to a memory-layout file generated by the Processor Generator. Then it generates memory-content files (mmap) that can be read by the generated HDL simulation memories. Hence, each LISA testbench has it corresponding set of mmap files. Thus, these mmap files and HDL files generated by the processor generator enable to run the same testbench at HDL level using an HDL simulator, such as Modelsim for example. 
Chipscope
From HDL to FPGA
The ASIP has been implemented using the Xilinx ISE tool suite. The HDL generated by Processor Generator is imported in a new Xilinx ISE project, but HDL simulation memories have to be substituted manually by memories that can be mapped onto an FPGA. This substitution implies the creation of IP memories with the correct parameters (depth, width, interfaces) and the adaptation of the generated IP to cope with HDL interfaces of the ASIP.
Inside Virtex FPGA, two different IP memories exist: distributed RAM and block RAM. Distributed RAM are zero wait state memories and use FPGA logic, whereas block RAM are embedded blocks of 18Kb of dual port RAM with one wait state.
The choice of appropriate memories depends on LISA modeling. Zero wait state memories are easiest to model in LISA, because there is no need to care about synchronization and to integrate it into the pipeline. Nevertheless the logic expense obtained after synthesis may impose the implementation of one wait state memories in LISA code. In our case, all memories except cross memories and program memory have been transformed in one-wait-state memories.
More generally, synthesis results (mainly logic utilization, frequency and critical path) given by synthesis tool (XST in ISE tool suite) may imply feedback in LISA modeling. No feedback is possible on HDL code due to the lack of readability of this generated code. However a feedback can be performed on HDL generation options.
This feedback on LISA modeling is extremely useful to have a balanced processor. This means that critical paths are close in the different pipeline stages.
Once iterations between LISA model and synthesis results are done, the last step towards FPGA programming is the place and route. This only requires a user constraint file (ucf) to specify the mapping of the FPGA on the board.
On Chip Verification
Similarly to HDL verification of a testbench, on chip verification requires to set memory with the right content.
The Xilinx FPGA memory IP used in our flow can be initialized in dedicated memory-content files with the extension .coe. As memory-content is already available in mmap files, we have created a mmap2coe script to convert memory-contents in coe format.
To perform on chip verification of the ASIP, the Xilinx ChipScope Pro tools are used.
The ChipScope Pro Core Inserter is a post-synthesis tool used to generate a netlist that includes the user design as well as a customizable logic analyzer and a controller providing a communication path between the JTAG Boundary Scan port of the target FPGA and the logic analyzer. The embedded logic analyzer can capture any internal signal of the design with elaborate trigger events. The trace data information are then stored using on-chip block RAM resources. The ChipScope Definition and Connection file (cdc) contains information about data to monitor.
Then the obtained design is mapped onto the FPGA and monitored data are recovered thanks to the ChipScope Pro Analyzer tool, which interfaces directly to the logic analyzer.
Finally, recovered data enable to verify the functionality of the on-chip ASIP according to the same testbenchs, which were used during LISA and HDL verification.
FPGA Prototyping
System description
The system is prototyped with the XUP Virtex-II Pro development board, which contains a Xilinx XC2VP30 FPGA. This FPGA device presents 13969 slices and 2448 Kb of embedded Block RAM. The slices can be used either as LUT or as distributed RAM (up to 428Kb).
Synthesis results
The ASIP presented in section 3, has been synthesized according to section 4 flow.
First syntheses have been performed on the base of a LISA model using memories without read latencies. Thus, memories have to be implemented as distributed RAM on FPGA logic. Consequently, following LISA modeling techniques for optimized HDL code [16] , ASIP utilizes 74% of logic resources with a clock frequency of 98 MHz.
To reduce logic occupation, most of the memories have been remodeled in LISA code to be one wait state memories, so that they could be implemented as Block RAM. Synthesis results of this new model reveal a 68% logic utilization and a 12% Block RAM utilization. Furthermore, ASIP frequency reaches 110 MHz. Thus, the modified ASIP enables to increase the clock frequency and to reduce logic utilization.
Based on this model, synthesis is performed on several targets to have an idea of what was possible on other FPGA-based board with our design (Table 1) and also on ASIC implementation (Table 2) . 
On-chip Evaluation
ASIP model was then validated with Chipscope. A logic analyzer has been integrated with the design and programmed in order to watch 206 internal signals of the ASIP over 8192 samples. Chosen signals enable to debug the processor all along the testcase. The overall design occupies 70% of the logic resources and 88% of the Block RAM.
Obtained results are coherent with HDL and LISA results. Thus, on this board, with a clock frequency at 110Mhz, the ASIP can process one iteration on a frame of a double binary turbo code application like Wimax at the throughput of 63 Mbps. In consequence, the throughput achieved by this validation prototype is 6.3 Mbps for a 5-iteration turbo decoding.
Performing on the fly turbo decoding implies the integration of specific IO blocks, which is on-going work. Other prototyping work on the integration of this ASIP in a multiprocessor FPGA prototype is also presented in [17] .
Conclusion
In this paper, a complete design flow is presented from algorithm to prototyping. Through the description of the turbo decoding application, we propose an efficient implementation of a high throughput flexible turbo decoder. This implementation is achieved using an ApplicationSpecific Instruction-set Processor with SIMD architecture, a specialized and extensible instruction-set, and 6-stages pipeline control. The proposed ASIP is developed in LISA language and generated automatically using the Processor Designer framework from CoWare. Besides, the paper presents a verification and prototyping flow enabling rapid prototyping on FPGA reconfigurable logic and memory resources. With this flow, the ASIP is prototyped on a development board based on a Xilinx Virtex-II Pro FPGA and achieves with 68% of the FPGA resources a 6.3 Mbit/s throughput to decode a double binary turbo code with 5 iterations.
