Abstract: This paper presents a flexible and high-efficiency decoder for turbo product code using extended Hamming code. The supported component code ranges from (8, 4) to (128, 120) to provide enough flexibility for various communication standards. A novel Chase decoder architecture is developed with high efficiency using a low complexity algorithm. Moreover, a conflict free interleave memory access model for variable length is provided. A 90 nm standard cell technology shows that the decoder sustains a maximum throughput of 5.6 Gbps and consumes 300 k gates.
Introduction
Turbo product code (TPC) performs close to the Shannon limit and has lower complexity than turbo convolutional code [1] . TPC easily achieves throughput of gigabit per second by decoding rows and columns in parallel. Numerous wireless and optical communication protocols have adopted TPC, but their code length varies. For example, TPC in IEEE 802.16 chose the (16, 11), (32, 26), or (64, 57) extended Hamming code as component code in both row and column dimensions. Thus, the TPC decoder urgently needs high throughput and flexibility characteristics for future high data rate communication systems.
A TPC decoder is composed of soft input soft output (SISO) decoder and interleaving resource. Most TPC decoders use Chase algorithm for SISO decoding of block code. Register file, memory, and connection network are candidates for turbo code interleaving resource. Numerous implementations have been proposed, but few have achieved high efficiency while supporting flexibility. In this paper, a high-efficiency Chase decoder for TPC is designed using a low complexity algorithm. Moreover, we propose the conflict free memory access model of using memory as interleaving resource, and support variable code length in our design. We implement the TPC decoder using several Chase decoders and interleaving memory on Chartered 90 nm CMOS technology.
TPC decoder design

Low complexity Chase decoding algorithm
The Chase algorithm of block code includes three steps: least reliable sorting, test vector decoding, and soft output. In the least reliable sorting, input R = {r 0 , r 1 . . . r n−1 } is sorted to determine the p least reliable value in R, and a hard decision is defined by Y = sign(R). Then 2 p test vectors are generated according to p and Y , which traverse all possible codes on p locations. After algebraic decoding of test vectors, a valid code set is generated, where D = {d 0 , d 1 . . . d n−1 } is the code with minimal Euclidean distance to R. In the soft output, the decoder seeks a candidate code C in the valid code set with a minimal distance to R and d i = c i for every position i. The soft value is then calculated by C and D.
The most complex part of the original Chase algorithm lies in the following procedures. First, decoding each test vector T i requires syndrome calculation S i = T i ⊕ H. Second, Euclidean distance computing is difficult to implement in hardware, so a simplified expression should be devised. Third, sorting in a candidate code search is necessary for each position i. The complexity of these three tasks grows with the code length n, causing low-efficiency Chase decoding in the case of variable code length. Several modifications have been proposed to address these issues. In [2] , test vector decoding is done in Gray code order, which reduces its complexity to 1 vectormatrix multiplication plus 2 p − 1 vector-vector multiplications in GF (2) . In [3] , dot product is applied instead of Euclidean distance. Only different bits between a valid code and Y contribute to the metric.
For the least reliable position i, there are at least two test vectors with an ith bit of 1 if p ≥ 2. They cannot be corrected both at the same time; thus c i = 1 exists in the valid code set. Similarly, c i = 0 also exists in the valid code set, so we can always find candidate codes for least reliable positions. The extended Hamming code only corrects one error; thus, there are at most p + 2 different bits and at least p − 2 between a valid code and Y . Therefore, extrinsic information differs in the same range. Averaging p is suitable for estimating extrinsic information when using maximum metric. This approach is useful because searching candidate codes is not necessary for all bits; only for the p least positions. This premise is true for all code lengths.
Algorithm 1 is the modified algorithm with less complexity than the original one and much less dependency on code length. Section 3 shows that the coding gain of low-complexity algorithm only slightly declines, or becomes even better in certain code lengths. Figure 1 shows the proposed Chase decoder architecture, which takes input r i and generates output λ i serially. Three modules make up the algorithm that implements three modules: least reliable sorting (lines 1-2); test vector decoding (lines 3-9); and soft output (lines 10-19). Latency of the Chase decoder includes input time n cycles and test vector decoding time 2 p cycles. Output also takes n cycles. The complexity which varies with n is hidden by the input and output latency; only bit width of registers and combination logic are relevant.
High-efficiency Chase decoder architecture
Fig. 1. Chase decoder architecture
Least reliable sorting module takes r i in from RAM, each with read address increment 1. RAM will be read later by the other two modules. Sorting occurs simultaneously with r i input. Hard decode Y with its syndrome, p locations with r i , and corresponding p columns in H are obtained at the end of input, and then stored in registers.
Test vector decoding module is divided into three stages in the pipeline. The first stage inverts a bit of T j in each cycle to generate a new test vector and syndrome. Then, algebraic decoding corrects at most 1 bit of test vector according to the syndrome. If necessary, RAM read request is sent to read address generation unit (AGU) for access to r i at the corrected position. The decoding result is compared with Y , and differences are recorded in stage 2. Stage 3 uses the comparison to compute relative dot product M by adding r i or −r i where different bits lie.
In the soft output module, sorting is performed once M j is prepared in the test vector module. There are p + 1 minimal sort logics for D and unreliable bits, as well as 1 maximum sort logic for reliable bits. When the test vector decoding is complete, minimal and maximal value in registers are ready for soft output generation. Thus, we can use r i in RAM and then compute for λ i immediately.
Conflict-free memory access model
In Turbo code, interleaving resource is necessary to rearrange data in a particular order between iterations. Usually, three kinds of hardware are adopted: register, memory, and connection network. Register file is the simplest way, but hardware overhead caused by large code matrix is intolerable. The (128, 120) 2 code needs 8 × 128 × 128 bit storages if symbols are quantized by 8 bits. Some studies [4, 5] have addressed issues of using connection networks, such as the Omega network. However, connection networks lack the capability of supporting variable length. Suppose the SISO decoder contains eight Chase decoders in TPC, so that eight rows or columns can be processed in parallel. While decoding (8, 4) 2 code matrix, the output of row SISO can be delivered through interleaving connection network to column SISO directly in a delicate manner and utilized without latency. However, in the case of (16, 11) 2 code, two subprocedures consisting of eight rows each are needed for row SISO. Column SISO has to be divided as well. The rows are separated, and thus, output of row SISO cannot ensure continuity to form a whole column. So, the connection network cannot rearrange data into a column order without storage. In our design, we use memory as interleaving resource. Figure 2 shows a full iteration of TPC. 
, where k is bank index and l is internal address in the bank. More than one subprocedure is needed if q 1 > 3 or q 2 > 3. For example, when the TPC matrix has 16 rows, row SISO decoder first processes rows 0-7, and then processes rows 8-15. In the memory for row, we place D x,y at A x,y = (x mod 8, y + x/8 × 2 q 2 ). Any two data are stored in the same bank only if they are within the same row or belong to two subprocedures. In the memory for column, a similar scheme is applied to avoid reading conflict, where A x,y = (y mod 8, x + y/8 × 2 q 1 ). This address scheme ensures conflict-free memory reading, because any two data in a bank would not be read at the same time.
We solve data rearrangement issues and avoid memory write conflicts in a way inspired by data interleaving with connection network. Eight outputs of row SISO are to be written in memory for column according to the address described above. Write conflict is avoided by the ith Chase decoder generating output data Λ : {λ 0 , λ 1 , . . . λ n−1 } in the order of output 0 = λ i , output t+1 = output t +1 mod n. For example, Chase decoder 0's output order is {D 0,0 , D 0,1 , . . .}, Chase decoder 1's output order is {D 1,1 , D 1,2 , . . .}. This setup makes the eight outputs of row SISO different from each other in the sense of mod 8; thus, they will be stored in different banks. Barrel shifter is inserted to direct data to the bank where it belongs. The output order of Column SISO rearranges data similarly.
Result and comparison
The proposed TPC decoder is implemented in HDL and synthesized on Chartered 90 nm technology by Synopsys Design Compiler. The whole decoder has four duplications of architecture in Fig. 2 , consisting of eight SISO decoders (with eight Chase decoders each) and eight interleaving memories (2 * 8 * 2 k bit each). Ignoring latency of fulfilling four iterations, throughput is calculated by T = f × P , where f is the decoder frequency and P is the number of Chase decoders in an SISO. Area A is represented by the number of equivalent two input NAND gates. If the Chase decoder supports a maximum code length of 2 q , the complexity of decoding a bit is o(q). A fair comparison of TPC decoders with different code length is obtained using efficiency E = qT A . Table I compares the synthesized result at 700 MHz with the related TPC decoders. Our proposed decoder shows high efficiency while supporting variable code length compared with other architectures. Coding gain only slightly declines and is even better in (32, 26) 2 . Throughput can be improved by duplicating more Chase decoders that do not affect efficiency.
Table I. Comparison of related decoders
Conclusions
This paper proposes a high-efficiency Chase decoder using low complexity algorithms. Moreover, conflict-free memory access model is provided to avoid decoder read and write congestion. The TPC decoder exhibits high efficiency in cases of variable code length. Proven to be highly flexible and more efficient than others, our decoder is suitable for future wireless and optical communication.
