Abstract-This paper proposes the architecture of partial sum generator for constituent codes based polar code decoder. Constituent codes based polar code decoder has the advantage of low latency. However, no purposefully designed partial sum generator design exists that can yield desired timing for the decoder. We first derive the mathematical presentation with the partial sums set β c which is corresponding to each constituent codes. From this, we concoct a shift-register based partial sum generator. Next, the overall architecture and design details are described, and the overhead compared with conventional partial sum generator is evaluated. Finally, the implementation results with both ASIC and FPGA technology and relevant discussions are presented.
I. INTRODUCTION
Recently, polar code [1] has received increasing attentions because it is the first code which provably achieves the channel capacity. Its low-complexity encoding and decoding schemes make it very promising for practical application. There are three widely known algorithms for polar codes decoding. E. Arikan in [1] presents a successive cancellation (SC) algorithm which can successively accomplish decoding with recursive cancellation. I. Tal [2] makes the SC algorithm more competitive by exploring more paths among the codewords tree; this method is referred as list successive cancellation (LSC). Also, N. Hussami et al. in [3] shows that the belief propagation (BP) can be applied as decoding algorithm.
Although many efforts have been made for BP decoder [4] , [5] and [6] , the BP decoding still suffers from the problem of high computing complexity. Thus, SC and LSC attract more studies especially on their hardware architecture [7] [8] [9] [10] [11] [12] . SC decoding is based on the feedback, which is also called partial sum, from decoded codewords. A partial sum generator (PSG) is needed for each SC decoder. The partial sum needs to be calculated at the same clock cycle when the codewords are determined. Thus, the calculation of partial sum is on the critical path of the decoding and can affect the maximum frequency of the decoder. Some works have been done for a good PSG design. C. Leroux [7] proposed an indicator function based PSG (IF-PSG). C. Zhang [8] proposed a PSG with feedback part (FB-PSG). J. Lin [13] proposed a hybrid PSG for LSC. G. Berhault proposed a shift-register-based PSG (SR-PSG) [14] [15] , which is able to increase the timing performance and reduce the hardware complexity. Y. Fan [16] proposed a similar architecture with SR-PSG however with higher level simplification.
Both SC and LSC suffer from the long latency problem. The constituent code based decoding has been studied recently since it is capable of significantly reducing decoding latency [17] [18] [19] . All the aforementioned PSGs are capable of increasing the timing performance of SC decoder. However, none of them has considered the constituent codes based decoding. Since introducing the concept of constituent codes into decoding processing can significantly reduce the latency, it is reasonable and necessary to design a constituent-codescompatible PSG. In this paper, we propose an efficient PSG for constituent code based SC decoding, and this is the first architecture of PSG for constituent code based SC decoder. First, we derive the mathematical presentation for constituent based PSG. This derivation is based on the SR-PSG for conventional SC decoder. Next, the overall hardware architecture and design details are proposed. The timing and hardware complexity are evaluated as well. Finally, the implementation result are presented. This architecture is implemented with both VLSI and FPGA technology. The relevant discussions are also mentioned as well. This paper is organized as follows. The relative background is reviewed in section II. In following, the proposed design including the mathematical derivation are described in section III. After that, the implementation results and reverent discussions are presented in section IV. Finally, this paper is concluded in section V.
II. BACKGROUND

A. Polar Code
As introduced by E. Arikan [1] , we can construct polar code by successively performing channel polarization. Fig. 1a shows an example of the construction of 8-bit polar code. Mathematically, polar codes are linear block codes of length n = 2 m . The coded codeword x (x 1 , x 2 , · · · , x n ) is computed by x = uG where G = F ⊗m , and F ⊗m is the m-th Kronecker power of F = 1 0 1 1 . Each row of G is corresponding to an equivalent polarizing channel. For an (n, k) polar code, k bits that carry source information in u are called information bits. They are transmitted via the most k reliable channels. While the rest n − k bits, called frozen bits, are set to zeros and are placed at the least n − k reliable channels.
Polar codes can be decoded by recursively applying successive cancellation to estimateû i from the channel output y m . α stands for the soft reliability value, typically is log-likelihood ratio (LLR). Each left and right child nodes can calculate the LLR for current node via f and g functions, respectively [7] . However, in order to solve g function, a feedback β l from left child of the same parent node is needed. This kind of feedback is called partial sum. At stage 0, β of a frozen node is always zero, and for information bit its value is calculated by threshold detection of the soft reliability according to
At intermediate stages, β can be recursively calculated by
B. Constituent codes based SC decoding
SC decoding generally suffers from the high latency due to its inherent serial property. The processing of obtaining the partial sum from each node significantly constrains the decoding speed. Thus, in order to reduce the latency caused by partial sum calculation, constituent code based SC decoding has been proposed [17] , [18] . By finding some certain patterns in the source code, some part of the codeword and their corresponding partial sums can be estimated immediately without traversal. This method significantly reduces the partial-sumconstrained latency. N 0 , N 1 , N SP C and N REP are the four commonest constituent code. N 0 and N 1 only contain either frozen bits or information bits, respectively. For N 0 codes, we can set the corresponding partial sums to 0 immediately. For N 1 node, the partial sums can be directly determined via threshold detection Eq. (1) . N SP C and N REP contain both frozen bits and information bits. In the N SP C codes, only the first bit is frozen. It makes the length n constituent codes as a rate (n − 1)/n single parity check (SPC) code. This kind of code can be decoded by performing parity check with the least reliable bit. Typically it is the one with the minimum absolute value of LLR. In the N REP codes, only the last bit is information bit. In this case, all the corresponding partial sums should be the same since they all are the reflection of the last information bit. Thus, the decoding algorithm starts by summing all input LLRs and the partial sums are calculated by performing the hard detection to the final summary. Fig. 2 shows an example of how constituent β0,0=u0 β1,1=u1 β2,0=u2 β3,2=u3 β4,0=u4 β5,1=u5 β6,0=u6 u7
Step 1 and 5
Step 2 and 6
Step 3 and 7
Step 4 and 8
Step i+1
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Step 7
Step 8
Matrix Generation Unit code can simplify the SC decoding tree. According to T. Che's implementation of constituent code based SC decoder [19] , the latency of length n constituent code can be reduced from 2n − 2 to 1, 1, log 2 n + 1 and log 2 n for
and N REP codes, respectively. In order to further optimize the performance constituent codes based decoder, a specific designed PSG for it is very necessary.
C. Shift-register-based partial sums generator
Among all the aforementioned PSGs design, shift-registerbased PSG (SR-PSG) has a better performance in terms of both the timing and hardware complexity. For length n polar code decoder, it consists of n registers and some other simple combination logic. Along with the estimation of eachû i , the registers perform shift calculation and the partial sums can be obtained from their corresponding register. Its architecture is illustrated in Fig. 3 . This architecture is built according to the following rule:
where · and ⊕ stand for and and exclusive-or operation, respectively. In Fig. 3 , R k means the kth register,û i means the ith estimated bit. β i,j means the jth partial sum in stage i. c i,k means the ith row and kth column in the generate matrix G. The matrix generation unit is able to generate c i,k with very simple logic. The SC decoder consists of many basic computation parts called processing unit (PU). Each partial sum needs to be feed into the corresponding PU. The shift register based architecture can guarantee that all partial sum required by a PU are all generated in the same register, which can avoid any extra routing logic in the circuit.
Such architecture is able to receive the estimated bit and update the corresponding partial sum by every valid cycle, which is highly consistent with SC decoding processing. However, this architecture is not suitable for constituent codes based SC decoder since some partial sums are obtained directly instead of calculating from estimated bits. Thus, a PSG for constituent codes based SC decoder should have the capability to generate the new partial sums from either the directly got intermediate partial sums or the estimated bits, and to maintain the coherence of them.
III. PROPOSED DESIGN
In this section, we first derive the mathematical presentation of constituent code based partial sum from Eq. (3). Then, the overall hardware architecture and subsequent design details are presented. For k n and k ∈ [a · n, (a + 1) · n − 1], a = 1, 2, . . ., according to Eq. (3), we have
A. Mathematical Presentation
. . .
As we know, c i,k is the element of generation matrix G which is the Kronecker power of F = 1 0 1 1 . Combine this property with our observation on the matrix, we conduct the following rule which is also noted in Fig. 4a .
According to the definition of generation matrix and concept of constituent code, when c i,k = 0, the right part of Eq. (5) is equal to a all zero vector, and when c i,k = 1 the right part of Eq. (5) is equal to the (n − (k mod n) − 1)th column in the generation matrix for length n polar code. According to the definition of partial sum and Eq. (2), we get
where p(k) = (n − (k mod n) − 1).
Now we apply the above observation back to Eq (4). We define the vector R a = [R a·n , · · · , R a·n+n−1 ] and 
For the consistent with Eq. (3), we rewrite Eq. (7) as follow:
where & stands for the bit-wise and operation.
For 0 k < n, similar to Eq. (4), we have
According to the definition of G and constituent codes, we can conduct that for any length n constituent codes, the first n columns of its corresponding rows in G should also be a generation matrix G n for length n polar code. As described in Fig. 4b , the diagonal cycle shift is same as each correspond column, and consider the G n is a lower triangular matrix, we get
Thus, Eq. (9) can be rewritten as: Thus, combining Eq. (11) and Eq. (8), we derive the mathematical presentation for partial sum of constituent based polar decoder as follow:
B. Proposed architecture
According to Eq (12), the shift-register constituent-code based partial sum generator (SR-CB-PSG) is proposed as in Fig. 5 . Compared with Fig. 3 , there are three differences. The first difference is the input. For SR-PSG, only current estimated bit is sent into, which means the input is only from the P U from stage 0. However, for SR-CB-PSG, the inputs are from P U s of any stage, depending on the length of constituent code. Thus, a multiplexing networking is needed to route all the inputs values to the right registers. The second difference is the shift function. According to Eq (12), instead of just shifting by one bit, the shifter should have the capability to shift nbit where n is the length of constituent code. According to the definition of constituent code, n should be the any power of 2. Thus, A specific design (2 m − 1)-bit shifter is proposed. The control signals for both the muxing networking and shifter are from the Control Signal Generator(CSG) with simple logic. The last difference is matrix generation unit. For each constituent code, its corresponding c i,j should be the ith row of the generation matrix, where i is the index of the last bit in the constituent code. Due to the irregularity of the constituent code, it's unnecessary to build an online generator for that. Thus, a pre-calculated ROM is placed. It is a trade-off between design complexity and hardware resource. It can be replaced by a re-configurable memory device like RAM for flexibility. the multiplexing networking to route the partial sums to the each right register. For length n polar code, there are log 2 n stages in the decoder and n/2 registers in the SR-CB-PSG. If the multiplexing networking is built from the basic 2-bit MUX, each register is assigned an identical MUXs networking made by (log 2 n − 1) MUXs. All the networkings share the same control signal. According to its architecture, the control signals are the direct binary mapping of its stage index. In total, n/2 · (log 2 n − 1) MUXs are needed. Since the multiplexer networking needs to wait each PU finish computing to get the valid inputs, it is on the critical path of the decoder. Thus, it causes additional ⌈log 2 (log 2 n)⌉ · △(M U X) delay, where △(M U X) is the delay for a single MUX.
For the (2 m −1) shifter, we proposed a barrel-shifter-based architecture. For length n polar code, m (log 2 n − 1). The shifter performs logic right shift. For k < n, where k is the index of the register and n is the length of the current constituent code, zeros are added to the left. For k n, we do shift. Those behaviors satisfy the first and second in Eq (12). Fig. 7 shows an example of (2 m − 1) shifter for 16-bit polar code decoder. All the MUXs in the same row can shall the same control signal. Those signals are generated by a k to 2 k decoder, where k = ⌈log 2 (log 2 n)⌉ for length n polar code. For length n polar code, there are (n/2−1)·(log 2 n − 1) MUXs are needed for the shifter. Since the shifter can start shift data without waiting P U to finish computing, it is not on the critical path. Thus, it should not deteriorate the timing performance of the decoder at all.
IV. IMPLEMENTATION RESULTS AND DISCUSSIONS
To the best of our knowledge, the proposed design is the first PSG design especially design for constituent codes based SC decoder. Thus, there is no reference design we can directly compare with. In this section, we list all the results we have and presents some relevant discussions. Table I shows the critical path comparison between proposed PSG and the PSG in [14] . We can tell the delay overhead comes from the muxing network. Ideally, the maximum frequency of constituent codes based SC decoder is lower than that of conventional SC decoder. However, after taking the latency reduction into account, as shown in Table II, constituent codes based SC decoder is able to achieve much higher throughput. The conventional SC decoder is referred from [9] which is the lowest latency conventional SC decoder to the best of out knowledge. Table III shows the resource consumption estimation of proposed SR-CB-PSG for length n polar code decoder and the comparison with other two conventional PSG. The most resource consumption part is the M U X since it used in both multiplexer networking and shifter. The estimation for the ROM size is based on the average calculation since the decoding latency changes along with the code rate. The proposed design can be targeted on either ASIC or FPGA. We synthesized both with Nangate FreePDK 45nm process and on Xilinx Kintex-7 FPGA KC705 Evaluation board. Table IV shows the hardware resource of SR-CB-PSG for 1024 code length polar code decoder on both of them. Noticeably, the architecture we discussed in this paper is based on the consideration for the worst case, which is that the maximum length of constituent codes could be n/2. However, for practical application, the maximum length of constituent is fix for certain code rate and usually cannot approach n/2. For those case, the logic of both the multiplexer networking and shifter could be even simpler, which will result in a better timing and silicon area performance.
V. CONCLUSION
This paper proposed an efficient PSG hardware design for constituent code based SC decoder. Conventional PSG is not compatible with the constituent code based SC decoder. This is because that the conventional one is only capable of taking estimated bit one by one but the constituent code based decoder is generated the intermediate partial sum directly. To solve this problem, we first derive the mathematical presentation for constituent code based PSG from the SR-PSG for conventional SC decoder. Then, the overall hardware architecture and design details are proposed. Finally, the implementation result with both VLSI and FPGA technology are presented, and the relevant discussions are carried out.
