Abstract-In this work, a decoder chip for time-invariant tailbiting LDPC convolutional code (TB-LDPC-CC) is proposed. By modifying the layered decoding scheduling, the proposed decoding algorithm can achieve twice faster decoding convergence than the conventional flooding scheduling. Furthermore, 30.77% storage requirement is also reduced due to adaptive channel value addressing employed in memory-based decoder design. The multiple frame sizes handling ability can lower the power and adapt to multiple applications. By integrating these techniques, a TB-LDPC-CC decoder chip supporting three frame sizes is implemented in UMC 90nm CMOS technology. The decoder containing 4 processors occupies 2.18mm
I. INTRODUCTION
Today's forward error correction codes in emerging wireless communication system must possess superior error-correcting capability, high decoding throughput, fine energy efficiency, and extra flexibility of code-rates and frame sizes. Low-density parity-check block codes (LDPC-BCs) perform well in errorcorrecting performance and throughput [1] , but the decoder would pay extra cost in order to support multiple code-rates and frame sizes. Contrarily, turbo code can provide adjustable code-rates and frame sizes through puncturing and properly terminating, while the limited parallelism and high hardware cost are still barriers toward multi-Gbps throughput. Fortunately low-density parity-check convolutional codes (LDPCCCs), proposed in 1999 [2] , have the characteristics of both LDPC and convolutional code, giving the potential of high throughput, flexible code-rates, and frame sizes.
Even though the previous LDPC-CC decoder chip [3] achieved better energy/hardware efficiency than state-of-the-art designs [4] - [6] , it could only decode continuous stream without proper codeword termination. Therefore, it is not suitable for packet-based transmission schemes such as IEEE 802.11ac/ad and LTE-Advanced applications [7] , [8] . To perform codeword termination, we employ tail-biting technique instead of appending termination sequence, avoiding the performance degradation due to code-rate loss. Based on the layered normalized min-sum decoding algorithm [9] , the proposed TB-LDPC-CC decoder has the following features: 1) 3 kinds of frame sizes supported by inter-processor permutation network, 2) proper termination via tail-biting technique without code-rate loss, 3) 4-stage pipelining for high operating frequency, 4) QCexpansion matrix enabling parallel decoding without memory access collision, and 5) adaptive channel value addressing with inter-row permutation network to save 30% storage requirement. Fig. 1 demonstrates one of possible application scenario for the proposed TB-LDPC-CC decoder. In a 4x4 MIMO system with unequal modulation, different coding schemes can be assigned to different spatial streams, which bring out 3 combinations: 1-to-1 with one (4992, 2496) LDPC-CC decoder, 2-to-2 with two (2496,1248) LDPC-CC decoders, and 4-to-4 with four (1248,624) codewords decoded individually. In conventional LDPC block code decoder, four decoders are required to handle four codewords at the same time; otherwise, a large buffer is necessary to store unprocessed data.
The rest of this paper is organized as follows. Section II reviews the encoding algorithm of TB-LDPC-CC and derives the tail-biting property examination. In Section III, the proposed code construction of TB-LDPC-CC as well as its decoding scheduling algorithm is presented. The details of decoder architecture and optimization techniques are discussed and analyzed. Moreover, Section IV reports the implementation results of proposed TB-LDPC-CC code, including bit-error rate (BER) performance and measurement results of the 90-nm chip. Finally, the conclusions are given in Section V.
II. TAIL-BITING LDPC-CC

A. Encoding of Tail-biting LDPC-CC
Tail-biting technique could convert a convolutional code into a quasi-cyclic (QC) block code. The tail-bitten version of parity-check matrix for LDPC-CC as well as its circular decoder architecture was proposed in [10] , which is obtained by wrapping the last (c − b) · m s columns of the syndrome former after t = t N time instant, where m s is the memory size of encoder and code-rate R = b/c (b < c). The wrapping operation will result in a circular Tanner graph. However, the encoding of TB-LDPC-CC by wrapping operation will need a matrix multiplication circuitry similar to encoding of LDPCBCs, which is no longer as simple as convolutional encoder.
Another implementation for encoding the TB-LDPC-CC was proposed in [11] , which used the state-space variables to cal-culate the initial state of the encoder. This technique originally came from [12] , which presented the encoding of tail-biting codes with feedback encoder. Let S t denote the state vector at time instant t. Let u t be the information sequence and v t be the code bits. For the case of systematic encoding and a partial syndrome encoder, the correct initial state can be calculated from the following relationship
where A(t) is the state matrix with size of m s (c−b)×m s (c−b), B(t) denotes the input matrix with size of m s (c − b) × b, C(t) is the output matrix with size of c×m s (c−b), and D(t) denotes the feed-forward matrix with size of c×b. Note that the variables in the above equations are functions of time due to time-varying are considered. These matrices A(t), B(t), C(t) and D(t) can be determined from the state transitions of the partial syndrome encoder.
The complete solution of (1) is given by the superposition of the zero-input solution S [zi] tN and the zero-state solution S [zs] tN . The zero-input solution and the zero-state solution can be derived by applying (1) recursively. The zero-input solution S [zi] tN is the state achieved after t = t N time instants if the encoding started in an arbitrary state S 0 and all input bits are zeros. The zero-state solution S [zs] tN denotes the state achieved after t time instants if the encoding started in the all-zero state S 0 = 0 and input is the information sequence u. Then according to the tail-biting criterion, let the state at time t = t N is equal to the initial state S 0 , the initial state of the encoder is
where
As long as the matrix in (2) is invertible, the encoding procedure for tail-biting LDPC convolutional code can be summarized as the following steps.
1) Determine the zero-state response S [zs] tN
2) Calculate the initial state S 0 3) Perform actual encoding to obtain valid code sequence v
B. Proposed Tail-biting Property Examination
For a time-invariant LDPC convolutional code, the state matrix is irrelevant to time. Consequently, the equation (2) can be rewritten to
Assume the state matrix A is binary, we can find a matrix Z such that
Therefore, the matrix
where α i ∈ {0, 1}. It is observed that if matrix I + A is invertible, then the matrix I + A tN is also invertible at time instant t N = 2 n . From this relationship, we still cannot know if this code could be tail-bitten for any time duration or not. Fortunately, if I+A is invertible, then the time-invariant LDPC-CC can surely be tail-bitten at the encoding time duration t N = 2 n . In this case, the matrix becomes
The equation (6) can be easily derived from the mathematical induction. Once that I + A is invertible, the matrix I + A 2 n or (I + A) 2 n is also invertible. Hence, this kind of time-invariant LDPC-CC can have the tail-bitten version when the encoding time duration is equal to 2 n .
III. PROPOSED ALGORITHM AND ARCHITECTURE
A. Code construction for tail-biting LDPC-CC
The proposed code construction procedure for TB-LDPC-CC is illustrated in Fig. 2 . First of all, we let the diagonal entries of the information part and parity part be ones for fast encoding purpose. Secondly, according to the equation (6), the parity part of the polynomial parity check matrix is forced to be lower triangular to ensure that this code can have the tail-bitten version. And next, fill each entry with monomial polynomial to lessen the short cycles. It is worth to mention that the differences of neighbor coefficients in the same column are greater than one to avoid memory access collision in the hardware. At last, the quasi-cyclic expansion (6 by 6) is applied to obtain a larger polynomial parity-check matrix with better performance and the possibility of high parallelism decoding. 
B. Decoder Architecture
The overall block diagram of the proposed decoder is shown in Fig. 3 . It is composed of 4 processors and 27 memory banks on a chip. Each processor has 7 sub-variable node unit (SVNU) array (one array contains 6 SVNUs), 6 check node units (CNUs) with one pipelining stage inside, as well as interrow permutation network and reverse inter-row permutation network to deal with different permutation parameters between rows of parity-check matrix.
The decoding flow is carried out as follows. In the beginning, the check node to variable node (CtoV) messages are read from MEM-G1 to MEM-G8 and delivered to correct processors through inter-processor permutation networks. Then the CtoV messages are summed up with channel values in SVNU array to form the newest variable node to check node (VtoC) messages. One pipelining stage is inserted inside CNU to balance the critical path. After CNU update, output CtoV messages are combined with original messages, reversely permuted to the form of next access, and stored back to memories. As shown in Fig. 4(a) , there are four pipelining stages. Since the memory read/write operations are simultaneously occurred, two-port memories are adopted in our implementation. Note that the addresses of memory access are pre-computed according to the parity-check matrix. Once the stored messages are accessed, the corresponding alignment can be performed by adaptively shifting channel values. )RXU 3LSHOLQLQJ 6WDJHV (a) Pipelining
T F T T T F T T T F T T 3
$GGU $GGU Fig. 5 is bit-error-rate (BER) performance of the proposed TB-LDPC-CC with three frame sizes 1248, 2496, and 4992 under AWGN channel. The performance curves of layered decoding in LDPC-CC [3] and LDPC-BC [13] are also listed for comparison. It is observed that the proposed decoder with 4 iterations can achieve similar performance to [3] and [13] with 5 iterations. Furthermore, if better performance is required, the non-terminated LDPC-CC decoders, e.g. [3] and [5] , will need to serially connect as many processors as the demanding maximum iteration number, indicating the performance is gained by linear growth of hardware and power consumption. On the other hand, the proposed decoder can keep on iterative decoding without hardware overhead until maximum iteration is reached or early-termination condition is satisfied. It is a kind of trade-off between error-correcting performance and decoding time. The BER performance with 20 iterations is closed to their mother code (48,24,21,1) LDPC-CC. 
C. Error-correcting Performance
IV. IMPLEMENTATION RESULTS
Fabricated in UMC 90nm 1P9M CMOS process, our test chip with 4 pipelining stages integrates the layered scheduling with adaptive channel value addressing, parallel architecture for QC expansion, and inter-processor permutation network dealing with multiple frame sizes. The chip micrograph and the key features of test chip are shown in Fig. 6 . The proposed decoder can support three frame sizes 1248, 2496, and 4992. Moreover, since the LDPC-CC code is properly tail-bitten, every processor can decode individual codeword for different iterations. The performance of our test chip are compared with stateof-the-art designs in Table I . The decoder including 104.8Kb two-port SRAMs occupies 2.18mm 2 area (excluding three IO buffers). Measurement results in Fig. 7 show that the decoder draws 275mW under 0.8V supply voltage while operating at 305MHz. Since the expansion factor equals 6, the information throughput of the TB-LDPC-CC decoder achieves 1.83 Gb/s under 4 decoding iterations with an energy efficiency of 37.6pJ/bit/proc. When supply voltage is scaled to 1.2V as shown in Fig. 7 , the throughput can be enhanced to 2.1 Gb/s with larger power consumption.
Compared with other LDPC-CC decoders [3] , [5] , this work provides similar performance but more flexible frame sizes than [3] ; higher throughput, less area, and better energy efficiency than [5] . Compared with the LDPC-BC decoder [13] , this work has scalable error-correcting performance and 19% less normalized area (2.18mm 2 vs 2.68mm 2 ) with higher chip utilization (90.2% vs 73.3%). Compared with the Turbo decoder [14] , this work achieves much higher throughput with lower power and less die area. In summary, our proposed TB-LDPC-CC decoder outperforms state-of-the-art designs and is an excellent error-control code for MIMO-based communication systems. 
V. CONCLUSION
In this paper, a TB-LDPC-CC decoder design targeting highthroughput, low-cost, and better energy-efficiency is presented. The test chip supporting three frame sizes is implemented in 90nm CMOS technology. The decoder with 4 processors occupies 2.18mm 2 and achieves maximum throughput 3.66Gb/s under 0.8V supply with 18.8pJ/bit/proc energy efficiency. The proposed solutions would make LDPC-CC more competitive to the other error-control codes.
