ABSTRACT Polar codes have drawn much research attention in the last ten years for their capacity-achieving property. However, their conventional successive cancellation decoding method performs not well at a short or moderate length. In order to improve the performance, concatenation with other error-correction codes has been proved an effective approach, whereas current concatenation schemes using rate-optimized method are too complex to implement with long decoding latency. In this paper, we propose a critical set protected BCH-Polar code with its corresponding decoding architecture. In the proposed concatenation scheme, we only provide extra protection to partial information bits in the critical set, which is constructed based on the channel reliability. For its corresponding decoding architecture, we redesign some components and adopt the Look-up table decoding method for BCH codes, resulting in much degradation of decoding latency. Compared with existing decoders, the hardware implementation shows low decoding latency and high throughput-area efficiency.
I. INTRODUCTION
Polar codes have attracted much attention since their invention [1] for their capacity-achieving property and have been selected for the control channel in the 5G enhanced Mobile BroadBand (eMBB) scenario [2] . As for the channel codes in the other scenarios, such as ultra-reliable low-latency communications (URLLC) and massive machine-type communications (mMTC), polar codes are considered as one of the possible channel codes. For the URLLC scenario, its criteria focus on channel codes with high reliability at small to moderate code lengths and low decoding latency. However, at these interest lengths, the conventional successive cancellation (SC) decoding of polar codes falls short in error-correction performance when compared with the Turbo or the low-density parity-check (LDPC) codes. One reason for this phenomenon is that the SC decoding is susceptible to the error propagation, while the other reason is that the polar codes themselves are weak for its incomplete polarization at these lengths.
The associate editor coordinating the review of this manuscript and approving it for publication was Donatella Darsena.
To improve the error-correction performance of polar codes, many different decoding methods have been proposed. The successive cancellation list (SCL) decoding proposed in [3] and its variant decoders [4] - [6] were proved that it could outperform the current LDPC codes with the aid of cyclic redundancy check (CRC). However, the better performance was got at the cost of high complexity and long decoding latency. Its derivative decoding methods, the successive cancellation stack (SCS) decoding [7] and the successive cancellation hybrid (SCH) decoding [8] , are proposed with reduced decoding complexity. However, they still have long decoding latency for their sequential decoding nature. At the same time, the belief propagation (BP) decoding of polar codes was proposed in [9] , with the intrinsic advantage of parallel processing. Therefore, compared with the SC-based decoding method, BP decoding is more attractive for low-latency scenarios. Despite its high throughput and low latency, BP decoder suffers from performance degradation at higher SNR regime when compared with SC decoder. The soft cancellation (SCAN) decoding, a variant of BP decoding with sequential message schedule, was proposed in [10] with better performance. However, its performance is VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ still inferior to that of SCL decoding. Aiming at improving the performance of polar codes with low complexity, the successive cancellation flip (SCF) decoding was proposed in [11] with the capability of providing error-correction performance close to that of SCL decoding with small list size. However, its decoding latency is uncertain, which is not suitable for the time-sensitive scenarios. Another approach to improve performance is to concatenate polar codes with other error-correction codes. By carefully design, the concatenated polar codes can inherit the low encoding complexity of the inner polar code while achieve improved error-correction performance. In current researches, the decoding methods for inner polar codes are different for the different concatenation schemes. As for adopting SC decoding as the inner decoder, the concatenated polar codes with outer Reed Solomon (RS) codes were proposed in [12] . It was proved that it could reduce the error rates nearly exponentially with the code length. However, the cardinality of the RS codes increases exponentially with the length of the polar codes. To solve this problem, an improved RS-polar concatenation scheme was proposed in [13] and [14] . Nevertheless, since its non-binary nature of outer RS codes, the performance improvement is limited. The analysis of this will be given later. In [15] and [16] , the concatenated polar codes with BCH codes and convolutional codes were proposed with better performance than that of RS-polar codes. However, for the concatenation with convolutional codes, its decoding latency is long, and its performance improvement is limited by the constraint length. To shorten the length of the outer code, the codes concatenated with Spinal codes was proposed in [17] , while the complex decoding of outer spinal codes result in long overall decoding latency. As for adopting the BP decoding as the inner decoder, the concatenated polar codes with LDPC codes was proposed in [18] by taking advantage of their similar message-passing scheme. In the concatenation, LDPC codes are used to protect the intermediate subchannels of polar codes. The effectiveness of protection was improved in [19] and [20] with explicitly designed interleaver and irregular LDPC codes. However, its decoding latency is varied with the channel condition.
Although there are already so many concatenated polar codes, these codes are still unsatisfactory for the URLLC scenario, which needs stable low decoding latency and high reliability. Considering the requirements of URLLC, it is necessary to design a low latency concatenated polar code with balanced error-correction performance. In this work, we present an improved concatenation scheme of BCH-polar codes and its corresponding decoding architecture. BCH codes are selected as the outer codes since they are the binary codes with the largest bounded-distance error-correction capability at a specified code rate. Instead of using the rate-optimized scheme widely adopted in SC-based concatenated codes [15] , [16] , we only protect some unreliable subchannels of polar codes, which leads to a better performance than the rate-optimized method.
The remainder of this work is organized as follows: in Section II, an overview of construction and SC decoding of polar codes are presented, together with the decoding method of BCH codes. In Section III, the encoding and decoding method for proposed concatenated BCH-polar codes are detailed. Its corresponding low latency decoding architecture is described in Section IV. Section V reports the simulation results and the comparison of hardware implementations. Then conclusions are drawn in Section VI.
II. PRELIMINARY
In this section, we review the construction procedure and the successive cancellation decoding algorithm of polar codes, together with the decoding method of BCH codes.
A. CONSTRUCTION OF POLAR CODES
Polar codes characterized by (N , K , I) can achieve channel capacity via the phenomenon of channel polarization [1] . As the channel polarization theorem states, a completely polarized subchannel becomes either a noiseless channel or a pure noise channel when the blocklength N goes to infinity, meaning the error probability of which is close to 0.5 or 0 respectively. By transmitting information bits over the noiseless subchannels and transmitting frozen bits which are known by both transmitter and receiver over the noisy subchannels, polar codes can achieve the channel capacity. Hence, constructing a polar code is equivalent to finding the K most reliable subchannels over which the information bits are transmitted, with a set I indicating these subchannel positions. Many construction methods have been proposed to calculate the reliabilities of subchannels. In this paper, we adopt the Gaussian approximation (GA) of density evolution (DE) method proposed in [21] to calculate the reliability for its good tradeoff between the complexity and performance.
The error probabilities of subchannels at E b /N 0 = 2dB got by GA method for polar codes with N = 128 and K = 64 are shown in Fig.1 . As the figure depicted, the channel polarization is not adequate at short or moderate length, so we have to use unreliable subchannels to transmit the information bits. Meanwhile, as the code rate of polar code increasing, more unreliable subchannels will be used, which leads to the degradation of error-correction performance. After the selection of information subchannels, the encoding process of a polar code can be expressed as a matrix multiplication like
where vector u, holding the information bits and the frozen bits, denotes the source bit sequence to be encoded, while vector x denotes the encoded codeword. G N is the generator matrix, while ⊗ denotes the Kronecker product. Moreover, B is a bit-reversal permutation matrix. More information about this encoding process can refer to [1] .
B. SUCCESSIVE CANCELLATION DECODING
As for the polar decoding, we denote by y the data received from the channel detection and use them as the decoder inputs. The outputs of the decoder are denoted by the vector u N 1 , whereû i is the estimation of the bit u i by hard decision. This hard decision is made according to its corresponding
) and the function h:
where sign(L i ) = ±1. For the SC decoding method, the i-th LLR of different decoding stage l can be computed iteratively by the following functions:
Moreover, in the LLR domain, the function f and g perform the following calculations by giving inputs LLRs L a and L b :
In the (3) and (5), theŝ or u s denotes the partial sum ofû i−1 1 , which are the bits have been decoded previously. Due to the procedure of u s and its crucial role in the function g, the SC decoder has to decode sequentially. Moreover, this decoding dependency leads to its susceptible to error propagation. For this reason, preventing errors from propagation by outer codes in each step is the main principle of current SC-based concatenated polar codes. As for the non-binary outer codes, there may be an error propagation in each decoding symbol. The effectiveness of outer protection will be weakened and delayed for the discontinuous distribution of information bits in polar codes. This is why the performance improvement of the non-binary outer code is limited. So it is better to use binary outer codes to provide extra protection for polar codes.
Besides, for the SC-based concatenated polar codes, the efficiency of the outer decoder and partial sum calculation makes a big difference in the overall decoding latency of concatenated codes. The decoding schedule of concatenated polar codes is depicted in Fig.2 . As shown in the figure, the decoding of outer codes and the calculation of partial sum lie before the function g, which constitutes the critical path in the decoding architecture. 
C. BCH CODES
The BCH code characterized by (m, k, t) is an important class of multiple-error-correcting linear cyclic codes, which can be constructed using polynomials over a Galois field GF(2 m ). One of the essential features of BCH codes is that there is a precise control over the number t of correctable errors during its construction. In our work, we adopt the primitive BCH codes as the outer codes. These BCH codes are called primitive codes since they are built using a primitive element of GF(2 m ) with code length M = 2 m −1. BCH codes can also be built using non-primitive elements, but the block length M is typically less than 2 m − 1, and the correcting performance is inferior to primitive one.
Another advantage of BCH codes is their explicit decoder. Regarding the decoding of BCH codes, though the ML decoding method can approach good performance, it is too complex for general applications. Among all the bounded distance (BD) decoding algorithms for BCH codes, the Berlekamp's iterative algorithm with the Chien's search algorithm is the most efficient one. In this decoding procedure, the first step is to compute the syndrome from the received vector. And then determine the coefficients of the error location polynomial from the syndrome components by Berlekamp's algorithm. At last, the Chien's search algorithm is used to find the error locations to correct errors. In these three steps, the calculation of coefficients is the most complicated part. Though the Berlekamp's algorithm can be carried out by the repetitions of relatively simple calculations of syndromes, the control of calculations is too complicated for concatenated polar codes, which leads to long decoding latency.
Based on the general decoding procedure above, we can conclude that as long as we know the syndromes, we can VOLUME 7, 2019 get the exact error locations. For the BCH codes with a short length and limited correctable capability, there is a low latency decoding method, called Look-up Table ( As shown in Fig.3 , the LUT decoder of BCH code is mainly composed of three parts. The syndrome generator is a combinatorial circuit, whose inputs are the received vector, while its outputs are the addresses of the error pattern table. The table storing the error patterns can give out the error pattern once the address is given. Then the error pattern XOR with the received vector to give out the corrected BCH code. By using this decoding method for BCH codes, the decoding latency of outer code can be much reduced.
III. CRITICAL SET PROTECTED CONCATENATED BCH-POLAR CODES
In this section, we first introduce the concept of critical set based on the investigation of the error distribution. Then we propose the encoding process of our concatenated BCH-polar codes. At last, the corresponding decoding schedule and complexity are described.
A. CONSTRUCTION OF CRITICAL SET
In conventional concatenated polar codes [15] - [17] , a rateoptimized method is always adopted to design the code rate of inner and outer codes in order to achieve the minimum frame error rate for a fixed overall rate. However, the rate-optimized method maybe not the best approach to reduce the frame error rate. As described in Section II-A, there is a large difference between the reliabilities of different subchannels after the channel polarization process. However, the outer codes with limited error-correction capacity cannot compensate for the difference of subchannel reliability to realize the equal block error rate design rule. This phenomenon leads to that if we use outer codes with the low error-correction capacity to provide extra protection to some better subchannels, the code rate of inner polar code will increase, which results in worse frame error rate. Thus, in the design of concatenated polar codes, we should reduce the code rate of inner polar codes and make the error-correction capability of outer code matching the error probability of inner subchannels.
Inspired by the approach used in [22] , for improving the performance of concatenated polar codes, we only need to correct the errors induced by channel noise. Firstly, we use the genie-like decoder called SC-Oracle decoder to analysis the distribution of errors caused by channel noise. It can be noticed from Fig.4 that the information bits have different error probabilities, and a subset of them have much higher probabilities than the remaining ones. Considering this, not all subchannels in I require extra protection from outer codes. So we propose to construct a set S c that includes the subchannels having a high probability of being interfered by channel noise, based on the observation of channel-induced error distributions. For ease of exposition, this set S c is referred to as critical set through the rest of this paper. By constraining the protection range, in our proposed concatenated polar codes, we only provide extra protection for the subchannels in the critical set in order to reduce the code rate of inner polar codes with a fixed overall code rate. In the following, we describe the construction method of critical set based on the reliability of each subchannel for the target channel condition. We adopt the GA method to calculate the mean value m i of the LLR L i to represent the reliability of each subchannel in the construction. A larger mean value m i represents a more reliable subchannel. The construction process is illustrated by using a concatenated BCH-polar code with inner length N = 8, outer length M = 7, and overall code rate 2) Calculate the number L info of subchannels needed to hold the information bits without extra outer protection. As for outer length N = 8 and overall rate R c = 1/2, the number L info is 4, which means the last four subchannels are used to transmit information. 3) Set the subchannel reliability that is inferior to the last reliable information subchannel as the basic reliability Q b . That is to say Q b = 1.54. 4) From the basic subchannel, the subchannels with the mean value m i less than the α * Q b are included into the critical set S c , (α > 1). Here we use α = 3.05 and obtain the critical set as S c = {4, 5}. 5) Use outer codes with the most powerful error-correction capacity t max to protect the subchannels in the critical set S c , and calculate the information bits K * that could be loaded. For BCH code with M = 7, the most powerful error-correction code is BCH(3,4,1) with t max = 1, while the information bits could be loaded is K * = 2 * 4 + 3 * 7 = 29. 6) If K * ≥ K c , the construction process ends; if not, set the basic reliability as the more inferior subchannel and return to 3). Here, the number of information bits needed to transmit is K c = 28. So K * ≥ K c , the process ends with L info = 5 and S c = {4, 5}. In order to constrain the protection scope, we introduce the scale parameter α, which influences the size of critical set S c . When a large α is adopted, more subchannels with low reliability are used, while a small α is adopted, the protection capacity of outer BCH codes cannot be fully utilized, both of which will result in a degradation of overall performance. In practice, we use the Monte-Carlo simulation method to approach the optimized value. In the simulations later, we adopt the value α = 3.05. By using this construction method, the available overall code rate R * of the concatenated code can be calculated as
where R B denotes the code rate of outer BCH codes, and | * | denotes the size of the set. There are two reasons for adopting the outer codes with the same code rate. One is that, based on the analysis of error distribution, low rate codes may not provide enough protection to the unreliable channels. Another one is that the codes with the same rate can much reduce the decoding latency by adopting the LUT decoding method, as described in Section II-C. This concept of critical set is similar to the concept of intermediate channels used in [17] , [18] , in which are the BP-based concatenated codes. However, their criterion for selecting intermediate channels is vague, which cannot be used in our scheme. The position index and the average performance of critical set protected BCH-polar codes are shown in Fig.5 . In the figure, the blue points represent the error probabilities of information subchannels without protection, while the golden ones and the violet ones represent the rate-optimized protected [15] , [16] and critical set protected subchannels respectively. Moreover, the horizontal lines in different colors denote their corresponding average error probabilities and the locations of the protected subchannels. We can observe that the critical set protection method uses less unreliable subchannels and has a lower average error probability than that of the rate-optimized method at the same overall code rate.
B. ENCODING PROCESS FOR BCH-POLAR CODES
The encoding of proposed concatenated BCH-polar codes is determined by the length parameter N , M , and the overall code rate R c . For the encoder, the number of overall information bits that need to be transmitted can be calculated by K c = N · M · R c (note that R c ≤ R * ). Then the reliability of each subchannel can be obtained by GA method for the target channel condition. Based on the reliability, the critical set of subchannels that need to be protected can be constructed by using the method above and the number of information bits that need to be protected can be calculated by |S c | · R B · M . As illustrated in Fig.6 , the primary information vector u is divided into two parts (u b , u p ). Firstly, the information part u b are encoded by BCH encoder. Then the obtained BCH codewords x b combined with the rest information bits u p are put into the interleaver to obtain the inputs u c of the polar encoder. Finally, these codes u c are sent into the polar encoder to get the concatenated BCH-polar codes x c .
As the example depicted in Fig.6 , the inner polar code length is N = 16 while the outer BCH code length is M = 7, and the overall code rate is R c = 1/2. So the length of primary information vector is 56 bits. In the example, the locations of the critical set are got by the method above for 16-bits-length polar code at E b /N 0 = 2dB. The size of the critical set is 6. Then, they are protected by the BCH codes with parameter (3,4,1). The last 21 bits of information vector are selected to be protected. Since the length of BCH inputs needs to be 24 bits, so 3-bits zero are padded to the information bits. As the outputs of BCH encoder, the 42-bits-length BCH codes combined with the remaining 35-bits information are put into the interleaver. After the interleaving, the interleaved 77 bits are sent into the polar encoder. After encoding process of polar codes, the length of encoded concatenated BCH-polar codes is 112 bits. In this encoding procedure, the code rate of polar codes is 11/16, while the effective overall code rate is 1/2.
C. JOINT DECODING METHOD
As mentioned in Section II-B, the SC decoding procedure is sensitive to the error propagation for its sequential nature. Therefore, in the conventional concatenated polar codes based on SC decoding, the joint iterative decoding algorithm is adopted to make sure that the previous bits decoded by SC decoding can be corrected by outer codes. However, existing joint decoding procedures perform not well when considering the impact of outer decoding algorithms on the overall decoding performance and latency. For non-binary outer codes, in the joint decoding, additional storage space is required to store the uncorrected state of the decoded bits, since the distribution of information bits in polar codes are not consequent. The concatenations with non-binary outer codes would undoubtedly result in a degradation of error-correction performance and an increase in decoding latency. Therefore, we choose the binary BCH code as the outer codes.
Since the previously decoded bits would be used in the sequential calculation of function g, the decoding complexity and latency of outer code have a significant influence on the overall decoding latency. As mentioned in Section II-C, the current BD decoding method for the rate-optimized 
D. DECODING COMPLEXITY
In this section, the decoding complexity of the proposed concatenated BCH-polar codes is demonstrated. Since adopting the SC decoding method for the inner polar code, the complexity of parallel decoding polar codes is O(MNlogN ), where M denotes the length of outer BCH code. With respect to the decoding of the BCH code, the complexity of the LUT decoding method is O(2t), where 2t denotes the address width of the error pattern table. Hence, the overall complexity of decoding proposed BCH-polar codes is O (MNlogN + 2t) , which is much less than the complexity
i ) of rate-optimized BCH-polar codes with BD decoding.
IV. ARCHITECTURE OF BCH-POLAR DECODER
In this section, we will introduce the corresponding hardware architecture of the decoder for our proposed concatenated BCH-polar codes. Aiming at designing a decoder with low latency and high throughput-area efficiency for concatenated BCH-polar, we redesign some components to reduce decoding latency and share the common parts between parallel inner polar decoders to lower area cost.
A. GENERAL ARCHITECTURE Fig.7 depicts an overview of the concatenated BCH-polar decoder architecture, which comprises five main units: the SC decoders, the frozen bit ROM, the matrix generator, the controller, and the BCH decoder. The concatenated decoder contains only one controller, which is responsible for generating all control signals, as well as one matrix generator which is a part of the partial sum network unit and will be described in more detail in the next section. Regarding the polar decoders, there are M max parallel working SC decoders. For these SC decoders, we choose the architecture described in [23] as our basic inner polar code decoder and redesign its several components.
In each SC decoder, the channel LLRs received from the detector are firstly stored in the channel buffer. Then these inputs are used by process elements for calculating intermediate LLRs. As described in Section II-B, the intermediate LLRs are calculated recursively, and the results are stored in the intermediate LLR RAM or directly bypassed to the process elements for the calculation in the next cycle. In the hard decision steps, the set of frozen bit I is read from the frozen bit ROM, where a 1 or 0 indicates that the current decoding bit corresponds to a frozen bit or an information bit, respectively. Since all SC decoders work in parallel, the decision result of each SC decoder can be combined as one BCH code at the same cycle for the sequential BCH decoding. In order to support outer code with different length, the BCH decoder contains different error pattern tables as shown in Fig.7 . Then the outputs of BCH decoder are sent back to each SC decoder for calculating partial sum u s , correspondingly. By adopting this design, the overall decoding latency can be reduced, and the error-correction capability of the outer code can be fully utilized.
B. PARTIAL SUM NETWORK
The partial sum network (PSN) unit, working after the outer BCH decoder, also lies in the critical path of the overall concatenated decoder, as shown in Fig.2 . Aiming at designing a low latency PSN unit, the calculation circuit of u s in our design is a completely combinatorial circuit as depicted in Fig.8 . Different from the combinatorial Feed Back structure described in [24] , the path delay of each partial sum bit in our design is the same, which is important for the calculation of u s with a large scale. As for the control signal of this calculation, we adopt the matrix generator proposed in [25] for its hardware efficiency. Moreover, since the SC decoders all work in parallel, we only implement one matrix generator in our design. 
V. SIMULATION AND HARDWARE IMPLEMENTATION
In this section, the error-correction performance of our proposed concatenated BCH-polar codes is evaluated by comparison with other concatenated polar codes. Then its corresponding hardware implementation results are presented as well.
A. THE ERROR-CORRECTION PERFORMANCE OF THE PROPOSED BCH-POLAR CODES
The error-correction performance of our proposed concatenated BCH-polar codes is evaluated via Monte-Carlo VOLUME 7, 2019 FIGURE 9. BER performance of stand-alone polar codes, rate-optimized BCH-polar codes and our proposed critical set protected BCH-polar codes, for inner length N = 64, outer length M = 15 and overall length N * = 960 with different code rate. The stand-alone polar codes is of length N = 1024.
simulations. We carry out simulations based on the AFF3CT [26] software, which is extended with our proposed concatenated codes and joint decoding method. In these simulations, all transmissions are run on binary phase-shift keying (BPSK) modulation and additive white Gaussian noise (AWGN) channel. All polar codes are constructed targeting the channel condition E b /N 0 = 2.0dB, while the locations of frozen bits are yielded by GA construction method. The CRC-aided polar codes are concatenated with a 16-CRC, whose generator polynomial is g(D) = D 16 + D 12 + D 5 + 1. Firstly, we compare the BER performance of our proposed BCH-polar with stand-alone polar codes and rate-optimized concatenated BCH-polar codes proposed in [15] , [16] . The stand-alone polar codes are the mother codes of the punctured polar codes with the same code rate and length with the concatenated codes. In Fig.9 , our proposed critical set protected BCH-polar codes are denoted by CP-BCH-polar for simplicity, while RO-BCH-polar denotes the rate-optimized BCH-polar codes. In this comparison, it can be observed that proposed CP-BCH-polar codes perform better than RO-BCH-polar codes at moderate and high E b /N 0 regime. Notably, at the BER of 10 −3 , the proposed concatenation scheme has about 0.5dB gain over the rate-optimized scheme. However, the proposed scheme performs not well at low E b /N 0 regime, even worse than that of stand-alone polar codes when having a high code rate. This is due to the fact that for high code rate, the proposed concatenation scheme has to use more unreliable subchannels and even the outer BCH codes with strongest error-correction capability cannot provide enough protection to inner polar codes, especially under the condition of low E b /N 0 . Besides, the performance of concatenated polar codes with different code rates tends to be consistent under good channel conditions, since the number of errors is always within the error-correction capability of outer BCH codes.
Then we compare the FER performance of our concatenated BCH-polar codes with the SCF decoding method, which also aims to provide improved performance with low complexity. At each decoding attempt of SCF, the concatenated CRC is verified to check the effectiveness of bitflipping. For a fair comparison, the standalone polar codes used in SCF decoding are punctured using the method proposed in [27] in order to make the code rate of punctured polar codes is equal to that of concatenated polar codes. As depicted in Fig.10 , the codes concatenated with longer outer codes have better error-correction performance than the short ones for their more powerful error-correction capability. Besides, the gap between the standalone polar codes and the concatenated polar codes grows as the inner polar code length increases, since more good subchannels are used for longer polar codes and their error probability are within the error-correction capability of outer codes. 
B. HARDWARE IMPLEMENTATION RESULTS
In this section, we present the corresponding ASIC implementation results of our proposed concatenated BCH-polar codes. The ASIC designs are implemented and optimized for 65nm CMOS technology. We use registers instead of RAM to implement the memories in design as we do not have access to the RAM compiler, which leads to a little larger memory area. Besides, the quantization value in our design is 6 bits for intermediate LLR values with two fractional bits, which is the same with that in [28] .
The layout of decoder for concatenated BCH-polar code (64-15), with 64-bits length polar code and 15-bits length BCH code, is shown in Fig.11 . In the layout, the control unit, BCH decoder, and other common parts, colored in deep purple and orange, are placed in the center of layout to reduce the path delay. Around the center, 15 polar decoders, colored differently, are placed. By adopting this layout, the path delay between the common parts and the parallel polar decoders could be much reduced. For our design, the silicon area increases linearly with the number of inner polar decoders. This can be verified both visually in Fig.11 as well as numerically in Table. 1. As shown in the Table. 1, by the comparison of our designs with different concatenation parameters (64-15), (128-15) and (128-31), we can conclude that the polar decoders occupy the majority area and increase linearly with polar code length, while the controller unit and frozenbit_rom unit remain nearly constant. As for the BCH decoder, the area occupation changes greatly with its code length and correction capability. Generally, the total area occupation of our design is determined by the length of the inner polar code and the level of parallelism of polar decoder, i.e., the length of BCH code.
Then we compare our design with the state-of-the-art architectures in [28] - [32] . For a fair comparison, the code lengths and the code rates of other polar decoders are the same with our concatenated code. The results in [29] - [31] have been converted to 65nm technology. The comparison results are provided in Table. 2. In the table, the implementations of BP decoder are not compared since their performance changes greatly with channel condition, which may leads to substantial performance degradation and unfair comparison.
In the table, the decoding throughput of the decoder is calculated as Throughput = N Latency and the area efficiency is defined as Area Efficiency = Decoding Throughput Area .
From Table. 2, we can observe that the implementation of our design shows much lower latency and higher throughput-area efficiency. Considering the short length of the inner code in our design, the semi-parallel decoding method proposed in [33] is not adopted in our design since that will lead to a little more longer decoding latency. The designs in [28] and [29] using different SCL methods have high error-correction performance at the cost of higher decoding latency and lower throughput. As for our design, due to adopting parallel architecture and the LUT decoding method for outer BCH codes, the implementation has the lowest decoding latency and comparable high throughput. Moreover, its area efficiency is comparable to the design in [29] , since some common components are shared between different inner polar decoders.
The designs, RCSC, and RLSC in [30] , are the hardware implementation of SCAN decoding method with modifications for a low area or low latency. As shown in Table. 2, their implementation results are gotten at the E b /N 0 = 4.7dB with 1.025 times average iteration number, which leads to the low latency result. However, the average iteration number would increase obviously at low SNR regime, which yields exponential reduction in throughput. That is to say, the area efficiency will be much lower than the results shown in Table. 2.
The result shown in [31] is the only implementation of SCF decoding in current researches. The implementation results shown in Table. 2 were gotten at the E b /N 0 = 4dB. Similar to the SCAN decoding, its decoding latency increases with the degradation of SNR, which also leads to a lower area efficiency. At last, we compare our design with that in [32] , which is the optimal hardware implementation for primary SC decoding method. This design adopts the complex combinatorial logic to improve throughput. However, its design method leads to a lower frequency or higher area, respectively. With respect to that design, our design has 40% lower latency and comparable area efficiency.
VI. CONCLUSION
In this paper, we propose a critical set protected BCH-polar codes with its corresponding decoding architecture. Compared with the conventional rate-optimized method, our proposed concatenation scheme has a better error-correction performance due to the degradation of the code rate of inner polar codes. Extra protections are only provided for the subchannels in the critical set, which is constructed based on the reliabilities of subchannels. By adopting the parallel decoding architecture and LUT decoding method for outer BCH codes, the overall decoding latency is much reduced, which is more suitable for time-sensitive application scenarios. By redesigning some components lying on the critical path and sharing the common parts between parallel inner polar decoders, our reference ASIC implementation achieves a high throughput-area efficiency. He was a Professor with the School of Computer, National University of Defense Technology, one of the Academic Leader of high performance computer architecture and micro-electronics and solid electronics, and a Doctoral Tutor. He has been engaged in teaching and research on computer science for over twenty years and responsible or take part in over 20 important projects, including Galaxy and TH-1/1A/2 series high-performance supercomputers design and FT series high-performance general-purpose CPU design, the National Natural Science Foundation, National Defense Pre-research funds, and so on. His research interests include microprocessor architecture design, 5G wireless communications, and VLSI architecture design for communication. VOLUME 7, 2019 
