Abstract-This work is on fast encoding and decoding of polar codes. We propose and detail 8-bit and 16-bit parallel decoders that can be used to reduce the decoding latency of the successive-cancellation decoder. These decoders are universal and can decode flexible-rate and flexible-length polar codes. We also present fast encoders that can be used to increase the throughput of serially-implemented polar encoders.
I. INTRODUCTION
P OLAR codes are capacity-achieving block codes that are recently introduced by Arıkan [1] . Due to their provable capacity-achieving performance with low-complexity encoding and decoding, they have gained significant interest [2] , [3] . In particular, polar codes can be encoded and decoded in a recursive fashion, which results in encoding and decoding complexities of O(N log 2 N ), where N is the code length [1] .
One of the main challenges associated with polar codes is their high decoding latency and low throughput [2] , [4] . Since the serial nature of decoding bottlenecks fast implementation of polar coding, researchers have introduced novel ways to reduce the decoding time [2] - [6] . For example, [5] introduces the notion of rate-zero and rate-one nodes to reduce the decoding depth of the successive cancellation (SC) decoder. The resulting simplified SC (SSC) decoder reduces the decoding latency up to 20 times [6] . The decoding latency can further be improved by identifying single-parity-check (SPC), repetition (REP) and REP-SPC nodes in the decoding tree of polar codes and implementing their fast decoders [2] .
The key idea behind the above-mentioned strategies is to increase the decoding speed by reducing the decoding depth and implementing fast parallel decoders for some particular frozen-bit sequences. As such, these schemes work only on specific codes. In particular, the decoding tree of a given rate and length polar code is first constructed, and then the afore-mentioned nodes are identified and implemented in hardware. Changing the code length or rate will necessitate reidentification of these nodes. As such, these schemes are not suitable for variable-rate and/or variable-length polar codes.
The new radio access technology will use a variable rate and length polar code for downlink control information to fully utilize the physical resources [7] - [9] . As such, implementing fast polar decoders that work for any rate and length is of practical interest. One such decoder, that reduces the decoding M. Hanif and M. Ardakani are with the Department of Electrical and Computer Engineering, University of Alberta, AB, Canada (email: mhanif@uvic.ca, ardakani@ualberta.ca).
depth by one was proposed in [10] . In particular, the authors proposed to decode two bits in parallel by implementing four decoders corresponding to all frozen-bit sequences. The same authors then extended this concept to larger power of 2 block sizes in [11] , [12] , but the extension results in huge hardware cost as the number of all frozen-bit sequences grows exponentially with the block size. Secondly, their decoding methodology does not identify and utilize the code structure for different frozen-bit sequences to reduce the hardware area or computational complexity.
In this paper, we present fast decoders for variable-rate and/or variable-length polar codes. In particular, we borrow the idea presented in [10] - [12] of implementing R-bit (R = 2, 4, 8, 16 ) parallel decoders at the last stage. But, unlike [10] - [12] , we do not implement 2 R parallel decoders for decoding each contiguous block of R bits. Rather, we rely on a key characteristic of properly-designed polar codes, domination contiguity of the set of good bit-channel indexes [3] , to significantly reduce the required number of parallel decoders for each block. Additionally, we use the minimum distance of the polar code corresponding to each domination-contiguous set to further reduce the number of required decoders. For example, we implement only 21 instead of 2 16 parallel decoders for R = 16, which implies that the required number of decoders is reduced by 99.97% compared to a simple application of ideas presented in [11] , [12] .
Achieving hardware-area reduction is not the only aim of our proposed decoding strategy. We also reduce the decoding complexity by relying on specific structure of the polar code corresponding to each bit-channel index set. We aim to minimize computationally-intensive operations, such as check-node operations, while ensuring that the proposed parallel decoders do not tangibly alter the performance of the SC decoder.
Unlike decoding, which is serial in nature, encoding of polar codes can be done in parallel. In fact, the seminal work on polar codes presented a fully-parallel encoding architecture for non-systematic polar codes [1] . Although very high encoding speed can be achieved by implementing a fully-parallel encoder, such an implementation is highly disadvantaged due to large memory size and number of XOR gates, especially for long polar codes. As such, folded or serial implementations of polar-code encoders are implemented to reduce the hardware area [13] . Moreover, systematic polar codes, as proposed in [14] , are serial by nature.
Like decoding, the serial implementation of polar encoder results in higher encoding latency. Similar to our proposed decoding strategy, we can increase the encoding speed by implementing R-bit parallel encoders at the last stage. Such an implementation is particularly helpful for flexible-rate and/or flexible-length systematic polar codes as they are non-trivial to be parallelized due to their bidirectional information transfer [2] , [3] , [14] .
Our main contributions are summarized in the following.
1) We present fast parallel decoders for variable-rate and variable-length polar codes. In particular, we detail 9 (instead of 2 8 ) decoders for 8-bit parallel decoding and 21 (instead of 2 16 ) decoders for 16-bit parallel decoding of polar codes. These decoders accommodate all frozenbit sequences that can occur in a block of 8 or 16 bits. 2) Secondly, our scheme improves the encoding speed of serially-implemented variable-rate or variable-length polar codes. Our scheme is particularly useful for flexiblerate or flexible-length systematic polar codes as their encoding is hard to be parallelized.
In the following, we first provide a background on polar codes in Section II, primarily to establish some notations and explain the challenges associate with variable-rate polar codes. We also review the domination contiguity of the set of good bit-channel indexes in Section II. We then present our proposed R-bit parallel encoder/decoders in Section III. Afterwards, we present some numerical results for corroboration of our proposed scheme, which will be followed by concluding remarks in Section V.
Although we will be focussing mainly on the systematic polar codes (due to their superior performance and difficulty in parallelization [3] , [14] ), similar results and conclusions can be drawn for non-systematic polar codes with little or no modifications.
II. BACKGROUND

A. Encoding Polar Codes
Polar codes defined on the binary field, F 2 , are block codes, which can be mathematically described as
where x ∈ F N 2 is the codeword of length N = 2 n , u ∈ F N 2 is the input vector comprising information and frozen bits, and
denotes the generator matrix. The matrix G N for non-reversed polar codes 1 is G N = F ⊗n , where F ⊗n is the nth tensor power of F defined as
with F ⊗0 = 1. Denoting the left and right halves of u (x) by u 0 and u 1 (x 0 and x 1 ), respectively, (2) implies x 1 = u 1 G N/2 , and x 0 = u 0 G N/2 + x 1 . Consequently, a message vector of length N can be encoded by encoding two message vectors of length N/2 each. Repeating this process n times results in encoding N message bits individually. As such, polar codes have a low encoding complexity of O (N log 2 (N )).
B. Constructing Polar Codes
Polar codes rely on the phenomenon of channel polarization, which constructs N polarized bit channels out of N independent copies of a given binary memoryless channel, W . In particular, N copies of W are combined and then split to a set of N binary-input channels, W
(i)
N , where i = 0, 1, · · · , N − 1, such that the symmetric capacity of W (i) N tends towards either 0 or 1 as N becomes large. Bit channels having nearunity symmetric capacity are identified as 'good' channels, whereas the others are classified as 'bad' channels and are frozen to zero [1] . Mathematically, denoting the set of 'good' and 'bad' bit-channel indexes by A and A c , respectively,
C. Decoding Polar Codes
Since a polar code can be encoded recursively, it can be represented by a binary tree, where each node represents a codeword [2] , [5] . Fig. 1 (a) shows such a tree corresponding to a polar code of length 16, where the white and black leaves correspond to frozen and information bits, respectively. We explain different decoding algorithms using the tree representation as follows. In the SC decoder [1] , the root node receives y, the channel log-likelihood ratios (LLR). Denoting the left and right halves of y by y 0 and y 1 , respectively, the root node sends the outputs of check-node operations between y 0 and y 1 to its left child. Mathematically, the left child receives y 0 = y 0 y 1 , a vector of N/2 real numbers whose ith element, y 0i , is computed as
where y 0i and y 1i are the ith element of y 0 and y 1 , respectively. Afterwards, the left child performs decoding on y 0 (which we will explain later) and returns a binary vector, x 0 , to the root node. The root node then computes y 1 and sends it to the right child. The ith element of y 1 , y 1i , is computed as
where x 0i is the ith element of x 0 . The right child then performs decoding similar to the left child and returns a binary vector, x 1 , to the root node. The root node then returns an estimate of the codeword based on x 0 and x 1 as
where the vector addition is performed in F N/2 2 . For non-systematic polar codes, the left and right children also return u 0 and u 1 , respectively, to the root node. These binary vectors are estimates of the left and right halves of the input, u. The root node, in addition to computing x, computes an estimate of u as u = u 0 u 1 .
Since each child of the root node can be considered as the root node of a subtree, each child performs exactly the same operation of its parent node. This process continues until the leaf nodes receive real-valued messages from their parents. Since leaf nodes do not have any child, they either send 0 or hard-decision estimates based on the received LLRs to their parents depending on the frozen-bit sequence.
The decoding latency of the SC decoder depends on the computation time of the check-node operation and the decoding-tree depth. The SSC decoder [5] improves the latency by identifying and removing descendants of rate-0 and rate-1 nodes in the code tree as shown in Fig. 1 (b) . Here, instead of traversing (which involves performing check-node operations) the subtree rooted in a rate-0 node, an all-zero vector is sent to its parent node. Similarly, a rate-1 node computes and sends the binary vector(s) to its parent node without traversing its descendants.
The fast-SSC decoder [2] further prunes the decoding tree by identifying the SPC, REP and REP-SPC nodes in the tree. For example, in Fig. 1 (c) , the subtrees of two SPC nodes are removed from the tree resulting in the decoding depth of only two.
As clear from Fig. 1 , both the SSC and fast-SSC decoders require identification of some special nodes based on the frozen-bit sequence, which can be used to eliminate corresponding subtrees in the code tree. In polar codes with flexible rate or length, the frozen-bit sequence changes with the code rate and length (see Fig. 2 ). Hence, these algorithms are not directly applicable.
For variable-rate or variable-length polar codes, R-bit parallel decoders as proposed in [10] - [12] can be used to improve the decoding speed, where R = 2, 4, 8, · · · . Fig. 2 shows such a case where a 16-bit parallel decoder is implemented to decode a variable-rate polar of length N = 64. Here, the frozen-bit sequence is represented in hexadecimal notation. For example, when the frozen-bit sequence is FFFE, all bits except the last one are frozen to zero in a block of 16 bits.
Observe that, unlike Fig. 1 (b) and (c), the code-tree structure remains the same regardless of the code rate. Secondly, the code tree is also complete, i.e., all nodes except the leaf nodes have two children. A similar observation can be made about Fig. 1 (d) and (e), where 8-bit and 16-bit parallel decoders are used to improve the decoding speed, respectively.
The R-bit parallel decoders must be able to decode all the frozen-bit sequences that can occur in a block of R bits. One way to satisfy this requirement is to implement 2 R parallel sub-decoders and select the appropriate one corresponding to the frozen bit sequence as proposed in [10] - [12] . Unfortunately, this solution becomes impractical even for small values of R. However, we can reduce the required number of decoders by using a key characteristic of properly-designed polar codes, domination contiguity, as described below.
D. Domination Contiguity
Let { {N} } denote the set {0, 1, · · · , N − 1}, where N = 2 n , and n is a positive integer. For i ∈ { {N} }, we use i 2 to represent the n-bit binary representation of i, i.e., i 2 = i n i n−1 · · · i 1 , where i n is the most-significant bit. For i, j ∈ { {N} }, i is said to binary dominate j, denoted by i j, if and only if i k ≥ j k for all 1 ≤ k ≤ n, where i k and j k are the kth least-significant bits in the binary representations of i and j, respectively. A set, S ⊆ { {N} }, is domination contiguous if h, j ∈ S and h i j implies i ∈ S.
For a properly-designed polar code, the set of good bitchannel indexes, A, must be domination contiguous [3] . Since not all subsets of { {N} } are domination contiguous, the cardinality of the set of all possible frozen-bit sequences of a properly-designed polar code is less than 2 N . In the following, we will explain how this key characteristic can be used to significantly reduce the number of required decoders from 2 R for R-bit parallel decoders.
III. PROPOSED ENCODER/DECODER STRUCTURE
The proposed encoders/decoders rely on parallel encoding/decoding of R bit channels simultaneously, where R = 2 t , and t is a positive integer. Specifically, we divide N -bit u into N/R consecutive groups each containing R bits, i.e., u = [u 0 · · · u N/R−1 ]. Then we encode/decode R bits in each group simultaneously to reduce the encoding/decoding depth of the code tree by t levels.
Quite intuitively, the greater the value of R, the faster the encoder/decoder will be. On the downside, we require more hardware to implement all the encoders/decoders for a block size of R. Theorem 1 describes inheritance of domination contiguity of A to the domination contiguity in each block of R bits, which will be used later to optimize the number of required encoders/decoders for an R-bit parallel decoder. Theorem 1. LetÃ i denote the set of good bit-channel indexes corresponding to u i , i = 0, 1, · · · , N/R − 1. For a properlydesigned polar code,Ã i is domination contiguous.
Proof: Observe that all elements of eachÃ i have the same n − t most-significant bits, where n = log 2 (N ), and t = log 2 (R). Further, the n − t most-significant bits corresponding to eachÃ i differ in at least one bit from that ofÃ j , where j = 0, 1, · · · , N/R − 1, and j = i. Domination contiguity of A implies that if h, j ∈Ã i and h k j then k ∈ A. But h and j have the same n − t most-significant bits. As such, h k j implies that k also has the same n − t most-significant bits. Therefore, k ∈Ã i .
An immediate consequence of the domination contiguity of A i 's is that the number of required encoders/decoders can be reduced from 2
R . The following theorem shows that the number of all domination-contiguous sets becomes significantly less than 2 R as R grows to infinity.
universal set containing all the good bit-channel indexes of the ith R-bit block, u i , for i = 0, 1, · · · , N/R − 1. Denoting the cardinality of a set S by |S|, the ratio |U
R is a decreasing function of R, and in the limit as R goes to infinity, the ratio |U i R |/2 R goes to zero.
Proof: Without loss of generality, we consider the first R-bit block (i = 0). Furthermore, we let P R = |U 0 R | and split the elements of anyÃ 0 ∈ U 0 R into two sets, S 0 and S 1 , where S 0 contains those elements ofÃ 0 that are less than R/2, and S 1 contains the remaining.
Theorem 1 asserts that the domination contiguity ofÃ 0 begets domination contiguity of S 0 and S 1 . Consequently, U 0 R ⊆ B, where B = {B :
, and as such, |U
. Equivalently, A 0 = {} implies S 1 = {}. Conversely, if S 1 = {} theñ A 0 = {} and S 0 = {}. But B contains those sets which correspond to S 1 = {} and S 0 = {}. Therefore, U 0 R ⊂ B, and
, where t = log 2 (R), and t = 1, 2, · · · . Using the fact that 0 < a t < 1 and a t+1 < a 2 t , we get lim t→∞ a t = lim R→∞ P R 2 R = 0. Theorem 2 shows that the ratio P R /2 R is a decreasing function of R. But it does not imply that P R does not increase rapidly. In fact, P R can be shown to be equal to 2, 3, 6, 20, 168, and 7581 for R = 1, 2, 4, 8, 16, and 32, respectively. Clearly, implementing P R parallel decoders to increase the decoding speed becomes impractical even for moderate block lengths. Fortunately, we can reduce the number of required decoders further by eliminating some frozen-bit sequences depending on the minimum distance of the corresponding code and the position of frozen bits in the sequence as explained below. Note that the minimum required number of encoders/decoders corresponding to each block of R bits is R + 1 because the number of frozen bits in a block of R bits can vary from 0 to R. Hence, at least R + 1 encoders/decoders are needed. A solution based on R + 1 encoders/decoders, if exists, has minimized the hardware area.
In the following, we present encoders for all the 20 cases corresponding toÃ i for R = 8. Later, we reduce this number to the minimum, i.e. 9, based on the position of frozen bits and the minimum distance of the polar code corresponding to each frozen-bit sequence. To do so, we define the notion of maxfrozen set, which will be used to eliminate some candidate sets.
Let π n : { {N} } → { {N} } denote a bit-permutation function that maps j ∈ { {N} } to π n (j) such that the bits in π n (i) 2 are a permutation of the bits in i 2 . Further, for D ⊆ { {N} }, we define π n (D) = {π n (d) : d ∈ D}. For example, the bitreversal permutation [1] for N = 4 maps j to π 2 (j), where
Also, we observe that binary domination is invariant to bit permutations, i.e., h k j implies π
] denote the permutation matrix corresponding to the bit-permutation function π n (·).
Here, e l denotes a column vector whose lth element is 1 and the remaining N −1 elements are 0. It was shown in [15] , [16] that P N G N = G N P N . Therefore, uP N G N = xP N . Denoting uP N = u πn and xP N = x πn , we get u πn G N = x πn . Consequently, if x is a polar code for input u then permuting the bit positions of u permutes x in the same manner. We refer these codes to be the conjugates of the original code. Observe that the set of good bit-channel indexes of a conjugate polar code is π n (A).
Lemma 1 confirms that, similar to A, π n (A) is domination contiguous. Further, when π n (A) differs A, the SC decoding shows worse performance [16] . However, the SC decoder can be modified to decode in the permuted order [15] , [16] to achieve the same performance. Consequently, for a given decoding order, only one set of good bit-channel indexes will show the best performance. Quite intuitively, the set which results in early 'decoding' of the most frozen bits will outperform others [17, Section 7.4.3] . We call this set the maxfrozen set. As such, for a given decoding order, the number of required decoders can be reduced by implementing decoders only for the max-frozen sets and ignoring their distinct bitpermuted sets.
In the following, we present encoders and decoders for R = 8. For the sake of clarity, we will drop the subscript i fromÃ i . Also, the set of frozen bit-channel indexes will be represented byÃ c . Lastly, we will consider the natural-order decoding of polar codes to eliminate all the bit-permuted sets of the maxfrozen set. 
A. Block Size 8 Encoders/Decoders
Table I enlists all the 20 frozen-bit sequences, f , that can occur in a block of 8 consecutive bit channels of a properlydesigned polar code. The ith component of f is 1 if the ith bit channel is frozen and is 0 otherwise. These sequences are grouped into 9 different cases depending on the number of information bits in the block of 8 bits. The corresponding set of good bit-channel indexes are also tabulated. Observe that all the sets are domination contiguous. Lastly, the systematic polar code, x, corresponding to each frozen-bit sequence is also mentioned along with its minimum distance, d min . As we explain below, the code structure and its corresponding frozen-bit sequence and d min will be used to further reduce the number of possible cases from 20 to 9. For the sake of brevity, we have used the notation x abc to denote x a +x b +x c .
In the following, we discuss each individual case and explain why a particular frozen-bit sequence is kept in each case.
1) Case 0: This case corresponds the rate-0 node introduced in [5] , and the optimal decoder assigns an all-zero vector to the output.
2) Case 1: This is an (8, 1) repetition code, and the optimal maximum-likelihood (ML) decoder will add the LLRs of all the channel outputs and perform threshold detection on the sum [18] . The same decoder is used in [2] , where they have outlined some low-latency decoding strategies for improving the decoding speed.
3) Case 2: All the three cases are (8, 2) repetition codes and are conjugates of one another. For the SC decoder with natural-order decoding, the first case is the max-frozen set. Consequently, other cases will not occur. The optimal decoder, like Case 1, will add the LLRs of four outputs to estimate x 7 and other four LLRs to estimate x 6 . 4) Case 3: These codes are concatenated (8,4) repetition and (4,3) single parity-check (SPC) codes and are conjugates of one another. SinceÃ = {5, 6, 7} is the max-frozen set, only the first case will occur in practice. The decoding can be carried out by first adding the LLRs of the outputs corresponding to the same bits. As such, we are left with the LLRs of a (4,3) SPC code. The optimal ML decoder of the SPC codes, Wagner decoder [19] , makes hard-decision estimates of x i 's and flips the least-reliable bit if the parity check is not satisfied.
5) Case 4:
The number of good bit-channel indexes,Ã, is 4 when k = 4. Amongst them, three correspond to (8, 4) repetition codes. As such, their d min = 2. In only one case, A = {3, 5, 6, 7}, the minimum distance turns out to be 4. Since code performance heavily depends on d min , only this case will occur in practice. In fact, this is an extended Hamming code [20] and is equivalent to the repetition-SPC code introduced in [2] . Although the optimal ML decoder for such a code can be implemented easily [18] , a low-complexity decoder of this code was mentioned in [2] . Furthermore, the bit-error rate (BER) performance of the code is not considerably altered by implementing the low-complexity decoder instead of the optimal decoder [2] . For completeness, we briefly mention the low-complexity decoder below.
First, observe that (x 0 + x 4 , x 1 + x 5 , x 2 + x 6 , x 3 + x 7 ) constitute a (4,1) repetition code, while (x 4 , x 5 , x 6 , x 7 ) is a (4,3) SPC code. The repetition code can easily be decoded by adding the LLRs, resulting in x 8 , a hard-decision estimate of x 8 = x 3 +x 7 . Afterwards, additional LLRs for (x 4 , x 5 , x 6 , x 7 ) are trivially computed either by keeping or switching the sign of the LLRs of (x 0 , x 1 , x 2 , x 3 ) depending on the value of x 8 . After adding the LLRs of (x 4 , x 5 , x 6 , x 7 ), we are left with a (4,3) SPC code, which can be decoded by the Wagner decoder.
6) Case 5:
Observe that all the codes are conjugate of one another, andÃ = {3, 4, 5, 6, 7} is the max-frozen set. Thus, only the first case will occur in practice. By introducing an additional node, x 8 = x 3 + x 7 , a cycle-free Tanner graph of the code can be obtained as shown in Fig. 3 . As such, a noniterative optimal maximum-a-posteriori (MAP) decoder can be implemented [21] .
Like Case 4, a low-complexity sub-optimal decoder can also be implemented by first making a hard decision about x 8 . Afterwards, depending on x 8 , LLRs of (x 0 , x 1 , x 2 , x 3 ) can be added or subtracted to that of (x 4 , x 5 , x 6 , x 7 ). Hard decisions of (x 4 , x 5 , x 6 , x 7 ) can then be carried out, which along with x 8 can be used to find estimates of (x 0 , x 1 , x 2 , x 3 ). The decoding latency can be reduced by implementing two parallel decoders assuming x 8 equals 0 or 1 and selecting the appropriate output depending on the actual value of x 8 . Also, as verified by the simulation results, the BER performance is not tangibly degraded by implementing the sub-optimal decoder instead of the optimal MAP decoder. 7) Case 6: All of the codes conjugates of one another. But the first case will occur in practice asÃ = {2, 3, 4, 5, 6, 7} is the max-frozen set. Since this code is just two (4,3) SPC codes put together, two Wagner decoders can be used to optimally decode it.
8) Case 7: This is an (8,7) SPC code, and the Wagner decoder can be used to optimally decode it. Note that this code is equivalent to the one corresponding to the SPC node mentioned in [2] , which were introduced to increase the decoding speed, especially for high-rate polar codes. 9) Case 8: This case is equivalent to the rate-1 node introduced in [5] , and a hard decision of the observed outputs gives the decoded output.
Remark 1. For non-systematic codes, exactly the same decoders can be used to find a hard-decision estimate of x, which can be used to decode the input, u. For example, for f = (1, 1, 1, 1, 1, 1, 0, 0 ), x = (u 6 + u 7 , u 7 , u 6 + u 7 , u 7 , u 6 + u 7 , u 7 , u 6 + u 7 , u 7 ), which is an (8,2) repetition code. Like in Case 2, we can find hard-decision estimates of u 6 + u 7 and u 7 , which can be used to find a hard-decision estimate of u 6 .
Remark 2. For permuted polar codes, similar conclusions can be drawn as the set of good bit-channel indexes is also domination contiguous for permuted (systematic/nonsystematic) polar codes. In particular, the max-frozen sets for an individual case will be determined according to the permuted decoding order. Also, the presented low-complexity decoders can trivially be modified to decode the polar codes corresponding to the max-frozen sets.
Remark 3. Implementing the proposed low-complexity decoders reduces the number of check-node operations significantly. In particular, the SC decoder uses 12 check-node operations for decoding 8 bits for each of the afore-mentioned cases. In our case, however, baring Case 4 and Case 5, check-node operations are not used. Furthermore, decoders of Case 4 and Case 5 use only 4 check-node operations, which are implemented in parallel. As such, compared with the SC decoder, where 12 check-node operations are used and are carried out sequentially, the proposed decoders require less check-node operations and are faster as these operations can be implemented in parallel.
B. Block Size 16 Encoders/Decoders
Having discussed all the possible cases for R = 8, we now consider polar codes for R = 16. Similar to R = 8 case, we can significantly reduce the required number of parallel encoders/decoders (from 168 to just 21). Table II enlists these 21 cases along with the d min of the corresponding codes. Here, we have used hexadecimal indexes for the sake of brevity. The appendix provides a detailed description of the proposed decoders for the cases tabulated in Table II .
It is worth noting that at least 17 encoders/decoders are required for encoding/decoding flexible-rate polar codes. So only four extra encoders/decoders are needed to ensure that the polar codes designed for any rate, length and channel can be encoded/decoded. The following theorems assert that the encoders/decoders for f = FFC0, f = FF80, f = FCC0, and f = C0C0 are not required when polar codes are designed for a binary-erasure channel (BEC) or by Huawei formula [7] . Proof: This assertion can be proved by noting that, regardless of the value of erasure probability, the aforementioned four cases do not occur when N = 16. The result immediately follows by noting that the frozen-bit sequence for each of 16-bit block is generated for a BEC [1] .
Recently, in 3GPP RAN1 #87 meeting, an agreement was reached to use variable-rate polar codes for uplink control channel [7] , [9] . Since polar-code design is channel dependent, and the location of frozen bits varies with the channel conditions, Huawei presented a channel-independent reliability metric for constructing polar codes [7] . In particular, each polarized bit-channel, W (j) N , is assigned a reliability metric, Q j , computed as x 7BC , x 7BD , x 7BE , x 7BF , x 7CF , x 7DF , x 7EF , x 7 , xBCF, xBDF, xBEF, xB, xC, xD, xE, xF 4 7 FF80 x 9ABCDEF , x 9 , xA, xB, xC, xD, xE, xF, x 9ABCDEF , x 9 , xA, xB, xC, xD, xE, xF
, xA, xB, xC, xD, xE, xF 2 C0C0
x 246 , x 357 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , xACE, xBDF, xA, xB, xC, xD, xE, xF 2 13 E000 x 3478BCF , x 3579BDF , x 367ABEF , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 , x 9 , xA, xB, xC, xD, xE, xF 2 14 C000 x 2468ACE , x 3579BDF , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 , x 9 , xA, xB, xC, xD, xE, xF 2 15 8000 x 123456789ABCDEF , x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 , x 9 , xA, xB, xC, xD, xE, xF 2 16 0000
where j k is the kth least-significant bit in the n-bit binary representation of j, i.e., j 2 = j n j n−1 · · · j 1 .
The following theorem asserts that the extra cases are not required when polar codes are constructed by Huawei formula. Proof: Observe that
Next, we partition { {N} } into N/16 consecutive groups, each containing 16 numbers. By denoting them with G i , where
Further, observe that Q j of all the elements in G i have exactly the same value of T 2 . As such, inclusion of j ∈ G i inÃ i depends solely on T 1 , or the last four bits of j 2 . Letting j denote the decimal number corresponding to the last 4 bits of j, we observe that the values of T 1 decreases monotonically when j takes on values from Q 15 0 = (15, 14, 13, 11, 7, 12, 10, 9, 6, 5, 3, 8, 4, 2, 1, 0) left to right. As such, for Case k (k = 0, 1, · · · , 16), the first k bitchannel indexes are selected in Q 15 0 for information transfer, and the rest are frozen to zero. Consequently, only 17 unique cases will occur, and the afore-mentioned four extra cases will not occur. 
IV. RESULTS
In this section, we compare the proposed decoding strategy with the SC decoder in terms of the bit-error-rate and decoding latency performances. Fig. 4 compares the BER performance of the proposed 8-bit parallel decoders with that of the SC decoder. Both optimal and low-complexity sub-optimal decoders are implemented for the proposed decoders. Here, the polar codes are constructed for the binary-erasure channel with the erasure probability of e −1 0.37 [22] , [23] . Some interesting observations can be made by analyzing the figure. First, the proposed decoders do not deteriorate the performance of the SC decoder. Second, the performance gap between the optimal and the sub-optimal decoders is negligibly small, which implies that the low-complexity decoders can be used instead of the optimal ones. Last, the proposed schemes can be used for both systematic and non-systematic polar codes.
A. BER Performance
B. Decoding Latency
One way to approximate the decoding latency of different polar decoders is as follows. We presume that bit operations and addition/subtraction of real numbers can be carried out in one clock cycle, whereas a check-node operation takes T c and finding the minimum of a list takes T m clock cycles. Note that finding a minimum requires significantly less computations than performing a check-node operation [24] .
For a block of R bits, the SC decoder performs R/2 checknode operations in each of log 2 (R) stages. But all the checknode operations can be performed in parallel in the first stage, whereas the last stage requires all check-node operations to be performed in a sequential manner. In general, the number of parallel check-node operations performed in the mth stage is R/2 m , where m = 1, 2, · · · , log 2 (R). Consequently, a decoding latency of T c (R/2 + R/4 + · · · + 1) = (R − 1)T c cycles is incurred in performing check-node operations for the SC decoder. Similarly, the number of binary and real additions involved in decoding R bits be shown to be 1 and R − 1, respectively. Therefore, an SC decoder will take R+(R−1)T c clock cycles to decode a block of R bits. Note that this decoding latency is fixed regardless of the frozen-bit sequence.
For the proposed schemes, the decoding latency varies depending on the frozen-bit sequence. For example, for R = 8, Case 0 can be executed instantaneously, whereas Case 4 incurs the highest decoding latency of 1 + max {T c , T m } cycles, which is calculated as follows. The check-node operations are performed is parallel, so they take only T c cycles for execution. Meanwhile, two outputs can be generated corresponding to x 8 = 0 and 1 by two Wagner decoders (T m + 1 clock cycles), and, depending on the value of x 8 (computation time is 1 cycle), one of the outputs is selected. Therefore, the decoding latency for Case 4 is max {1 + T c , 1 + T m }.
Observe that even the highest decoding latency of the proposed decoders is much smaller than that of the SC decoder. Hence, our proposed scheme will significantly improve the decoding speed of variable-rate polar codes. Table III shows the decoding latency, L, of the proposed low-complexity decoders for blocks of length R = 8 and R = 16 bits. Observe that the proposed decoders have significantly less decoding latency compared to the SC decoder, which has the decoding latency of R + (R − 1)T c cycles for all the cases.
V. CONCLUSION
In this work, we presented fast 8-bit and 16-bit parallel decoders that can reduce the decoding-tree depth of the decoding tree of variable-rate and variable-length polar codes. They can reduce both the decoding latency and hardware complexity without deteriorating the bit-error-rate performance of the successive-cancellation decoder.
APPENDIX BLOCK SIZE 16 DECODERS
This appendix details the proposed low-complexity parallel decoders for the cases tabulated in Table II. 1) Case 0: The optimal decoder assigns an all-zero vector to the output.
2) Case 1: This is a repetition code, and the optimal decoder makes a hard decision on the sum of the LLRs of the received bits.
3) Case 2: This is a (16,2) repetition code, and the optimal ML estimates of x E and x F can be found by making harddecisions on the LLR sums of even-indexed and odd-indexed bits, respectively. 4) Case 3: This is a (4,3) SPC code concatenated with a (16,4) repetition code. The optimal decoder will first add the LLRs of the received bits corresponding to x DEF , x D , x E , and x F before finding their hard estimates with the Wagner decoder.
5) Case 4: This is an (8,4) extended Hamming code concatenated with a (16, 8) repetition code. Decoders mentioned in Case 4 for R = 8 can be used after adding the LLRs of the first half to the second's.
6) Case 5: Although an exhaustive-search based ML decoder can be implemented, we can reduce the decoding complexity by introducing a new variable, z = x 7 + x F . By noting that x 0 + x 8 = x 1 + x 9 = · · · = x 7 + x F = z, we first find z, a hard-decision estimate of z. Specifically, we add y 0 y 8 , y 1 y 9 , · · · y 7 y F to get an LLR for z and make a hard decision on the LLR to compute z.
Afterwards, the decoder computes y 0 ±y 8 , y 1 ±y 9 , · · · , y 7 ± y F , where addition is performed when z = 0, and subtraction otherwise. These values are then input to one of the decoders of Case 4 for R = 8 to get estimates of x 8 , x 9 , · · · , x F . Lastly z is added to x 8 , x 9 , · · · , x F to compute x 0 , x 1 , · · · , x 7 .
Implementing two parallel decoders corresponding to z equalling 0 and 1 and choosing an appropriate output after computing z can reduce the decoding latency. 7) Case 6: For the first case, f = FFC0, an optimal decoder can be implemented by noting that the codeword comprises two concatenated (4,3) SPC and (8,4) repetition codes. Two separate Wagner decoders can be used to find hard estimates of the transmitted bits after adding the LLRs of the repeated bits.
For the second case, f = FEE0, the decoder presented in Case 5 (R = 16) can be used to decode the received LLRs with the exception that the decoder of Case 4 (R = 8) is replaced with the low-complexity decoder of Case 5 (R = 8). 
8) Case 7:
The first case, f = FF80, corresponds to a concatenated (8,7) SPC and (16, 8) repetition code. Therefore, the code can be decoded optimally by adding the LLRs of x 0 , x 1 , · · · , x 7 to that of x 8 , x 9 , · · · , x F and finding hard estimates by the Wagner decoder.
For the second case, f = FEC0, the decoder presented in Case 5 (R = 16) can be used to decode the received LLRs with the exception that the low-complexity decoder of Case 6 (R = 8) replaces the decoder of Case 4 (R = 8).
9) Case 8: For the first case, f = FE80, the decoder presented in Case 5 (R = 16) can be used to decode the received LLRs with the exception that the Wagner decoder is used instead of the decoder of Case 4 (R = 8).
For the second case, f = FCC0, the even-indexed and the odd-indexed bits constitute two separate (8,4) extended Hamming codes. Therefore, the decoders for Case 4 (R = 8) can be used to decode the received code vector. 10) Case 9: A low-complexity decoder can be implemented by defining z 0 = x 6 + x E and z 1 = x 7 + x F . Observe that x 0 + x 8 = x 2 + x A = x 4 + x C = x 6 + x E = z 0 , and x 1 + x 9 = x 3 + x B = x 5 + x D = x 7 + x F = z 1 . The proposed decoder makes hard decisions on y 0 y 8 + · · · + y 6 y E and y 1 y 9 + · · · + y 7 y F and respectively assigns them to z 0 and z 1 , hard estimates of z 0 and z 1 , respectively. Afterwards, additional LLRs for x 8 , x 9 , · · · , x F are computed by using z 0 and z 1 . For example, the additional LLR of x 8 is y 0 when z 0 is 0 and −y 0 otherwise. After adding the additional LLRs to y 8 , y 9 , · · · , y F , Wagner decoder is used to find hard estimates of x 8 , x 9 , · · · , x F , which along with z 0 and z 1 are used to estimate x 0 , x 1 , · · · , x 7 .
The decoding latency can be reduced by implementing four Wagner decoders (corresponding to four possible values of ( z 0 , z 1 )) and using the computed value of ( z 0 , z 1 ) to select the output of corresponding Wagner decoder. 11) Case 10: A low-complexity decoder can be implemented by defining four variables: z 0 = x 0 + x 8 = x 4 + x B , z 1 = x 1 + x 9 = x 5 + x C , z 2 = x 2 + x A = x 6 + x D , and z 3 = x 3 + x B = x 7 + x F . The decoder will first compute z i 's, hard estimates of z i 's, where i = 0, 1, 2, 3, from the LLRs obtained by adding LLRs of the output of check-node operations. For example, LLR of z 0 is computed by adding y 0 y 8 and y 4 y B , where y i denotes the LLR of x i . Depending on the value of z i 's, additional LLRs for x 8 , x 9 , · · · , x F are obtained from y 0 , y 1 , · · · , y 7 . For example, additional LLR for x 8 is y 0 when z 0 = 0 and is −y 0 when z 0 = 1. After adding the additional LLRs to the received LLRs, the decoder finds hard estimates of x 8 , x 9 , · · · , x F using Wagner decoder. Finally, these hard estimates are used along with z i 's to estimate x 0 , x 1 , · · · , x 7 .
12) Case 11: A low-complexity decoder can be implemented by observing that z 0 , z 1 , · · · , z 7 constitute an (8, 4) extended Hamming code, where z 0 = x 0 + x 8 , z 1 = x 1 + x 9 , · · · , z 7 = x 7 + x F . As such, the decoders of Case 4 for R = 8 can be used to find z i 's, estimates of z i 's for i = 0, 1, · · · , 7. Then, depending on the values of z i , additional LLRs for x 8 , x 9 , · · · , x F are obtained from y i 's, where i = 0, 1, · · · , 7. For example, the additional LLR for x 8 is y 0 if z 0 = 0 and −y 0 otherwise. After adding the LLRs, Wagner decoder is used to compute estimates of x 8 , x 9 , · · · , x F . The decoded bits along with z 0 , z 1 , · · · , z 7 are then used to compute estimates of x 0 , x 1 , · · · , x 7 .
13) Case 12: For the first case, f = E800, the decoder of Case 11 can be used to decode the received LLRs, with the exception that the Wagner decoder is not used at all. Rather, hard decisions are made on the updated LLRs of x 8 , x 9 , · · · , x F .
In the second case, f = C0C0, the code word consists of four separate (4,3) SPC codes, which can be individually decoded by the Wagner rule.
14) Case 13: Introducing a new variable, z = x 3 + x 7 + x B + x F , results in a cycle-free Tanner graph as shown in Fig.  5 . As such, a non-iterative optimal MAP decoder can easily be implemented for this case.
A low-complexity decoder can also be constructed for this code. For example, making a hard decision on z results in four separate SPC codes, which can be decoded by the Wagner rule.
15) Case 14: This code consists of two (8,7) SPC codes, which can be optimally decoded by two Wagner decoders.
16) Case 15: This is a (16,15) SPC code, and Wagner decoder can be used to decode the received LLRs optimally.
17) Case 16:
The optimal decoder will make hard decisions on the LLRs of the received bits. 
