Abstract-Arıkan has shown that systematic polar codes (SPC) outperform nonsystematic polar codes (NSPC). However, the performance gain comes at the price of elevated encoding complexity, i.e., compared to NSPC, the available encoding methods for SPC require higher memory and computation. In this letter, we propose an efficient encoding algorithm requiring only N bits of memory and having N 2 log 2 N XOR operations. Moreover, the auxiliary variables in the algorithm can share the memory to reduce extra memory requirement. Furthermore, a parallel 2-bit encoding algorithm is also presented to improve the encoding throughput. Remarkably, we show that parallel encoding can be implemented with the same number of XOR operations and memory bits. Finally, the proposed encoding algorithm can be directly used for NSPC with the same complexity.
I. INTRODUCTION
Polar codes, originally proposed by Arıkan in [1] , have gained enormous interests due to a number of distinctive features. For instance, polar codes have explicit coding structure and can achieve the capacity of symmetric binary memoryless channels (S-BMC). Moreover, polar codes with finite length yield competitive performance when compared to LDPC [2] and Turbo codes [3] in addition to having low encoding and decoding complexity.
The standard polar codes are in nonsystematic form where both frozen bits and information bits (also referred to as user bits) are placed on the polarized bit-channels of the polarization structure and the user bits do not appear in the polar codeword. However, information bits as part of the codeword are required in some scenarios, such as the famous Turbo codes [4] whose component codes are systematic codes that can exchange information between modules in turbo decoding. To construct systematic polar codes (SPC), Arıkan proposed the idea of shifting the user bits from polarized bit-channels to unpolarized bit-channels [5] , which makes the frozen and user bits lie on two different extremes of [9] No N (1 + log 2 N ) N 2 log 2 N EncoderB [9] Yes 2N − 1 N (1 + log 2 N ) EncoderC [9] Yes N N (1 + 2 log 2 N ) NSPC [9] Yes/No 2N N 2 log 2 N Proposed SPC No N N 2 log 2 N polarization structure. Arıkan showed that systematic polar codes outperform nonsystematic polar codes (NSPC) in terms of bit error ratio (BER) and the performance have also been investigated in [6] . Recently, SPC as component codes of concatenated codes have been investigated in [7] and [8] . Compared to the NSPC with the same polarization structure, SPC is inherently more complex. Hence, to facilitate the application of SPC, the key challenge is to find an efficient encoding method. In [5] , Arıkan presented a recursive method for SPC encoding with α N 2 log 2 N (α > 1) XOR operations, where Arıkan also suggested using successive cancellation (SC) decoder as an encoder for SPC. Following this suggestion, an SPC encoder which facilitates easy parallelization was proposed in [10] , [11] , with the limitation of executing SC algorithm twice and constrained frozen bits. Another SPC encoding algorithm in the recursive implementation with elimination method was presented in [12] . Most recently, the authors of [9] proposed three encoding algorithms for SPC with memories of N (1 + log 2 N ), 2N − 1 and N bits and XOR operations of N 2 log 2 N , N (1 + log 2 N ) and N (1 + 2 log 2 N ), respectively. However, the major drawback of the above discussed encoding methods is the high requirements on memory or computation, which may not be suitable for devices with small size and limited power. Motivated by this, in this letter, we propose a new efficient encoding algorithm for SPC requiring only N bits of memory (excluding the input/output) and N 2 log 2 N XOR operations. To the best of the authors' knowledge, the proposed algorithm requires the minimum memory as well as XOR operations compared to the known encoding methods, as illustrated in Table I . In addition, to further improve the encoding throughput, a parallel 2-bit encoding algorithm is also discussed, which shows that the parallel encoding can be accomplished without incurring additional cost in terms of XOR operation and memory bit.
II. NONSYSTEMATIC AND SYSTEMATIC POLAR CODES
For polar codes with codeword length N (= 2 n , n ≥ 1) and kernel matrix F = 1 0 1 1 , the polarization transformation matrix G can be written as G = F ⊗n where ⊗ denotes the Kronecker power operation. Let u = (u 0 , u 1 , · · · , u N −1 ) and x = (x 0 , x 1 , · · · , x N −1 ) be the bit vectors on the left and right side of the encoder shown in Fig.1 , respectively, then we have
For NSPC, u is the only input of the encoder, i.e., u includes both the frozen and user bits, and the encoding is performed from left to right according to the coding structure shown in Fig.1 , where A denotes the index set of the user bits.
However, for SPC, both u and x are inputs of the encoder. To construct systematic polar codes, Arıkan proposed to place the user bits on the right side of the encoder and keep the same indices for the bits as illustrated in Fig.1 with N = 2 3 and A = {1, 3, 5, 6, 7}, where the left extreme node with hollow arrow denotes the frozen bit while the right extreme node with solid arrow denotes the user bit.
The index set for the frozen bits is the complementary set of A, i.e., A c = {0, 1, · · · , N − 1} − A. Now, denote u A and u A c as the bit vector with elements u i , i ∈ A and i ∈ A c , respectively, and the similar denotation is also for x A and x A c , then, Equation (1) can be rewritten as
where G AA c is a sub-matrix of G with elements G i,j , i ∈ A and j ∈ A c , and G AA , G A c A and G A c A c are defined in the same fashion. The objective of SPC encoding is to obtain x A c given the inputs u A c and x A .
Since matrix G AA is invertible, x A c can be computed by [5] :
As discussed in the Introduction section, the known methods in literature to compute x A c requires relatively large memory and heavy computation. Motivated by this, the main objective of this letter is to find an efficient algorithm to compute x A c . 
III. EFFICIENT ENCODING ALGORITHM FOR SPC
The proposed encoding algorithm is similar to the algorithm EncoderA in [9] where the encoding is implemented from the bottom horizontal connection to the top horizontal connection and the calculation for each horizontal connection starts from known node (i.e., one of elements of x A or u A c ) and moves from one node to the next and to the other side of the polarization structure, as shown in Fig.1 . Due to the fact that each node is allocated with one bit memory in EncoderA, the total memory requirement of EncoderA is N (1 + log 2 N ) bits. The key feature of the proposed encoding algorithm is to reduce the memory requirement from N (1 + log 2 N ) bits to N bits while maintaining the same computation load, i.e., N 2 log 2 N XOR operations.
A. Encoding algorithm
To exploit the recursive nature of polar codes, the polarization structure with N = 2 n is divided into n layers labeled by 0, 1, · · · , (n − 1) from right to left as shown in Fig.1 . And for layer λ (λ = 0, 1, · · · , n − 1), the N nodes (includes the corresponding operations) are separated into 2 (n−1)−λ blocks from top to bottom and each block contains 2 λ+1 elements. As an illustration, the dashed box in the right-bottom corner of Fig.1 represents one of the first-layer blocks.
To analyze the memory requirement of the encoding algorithm, let us first consider the blocks in the same layer. The key observation is that the blocks in same layer are independent and there is no information exchange between the blocks in the same layer, which indicates that the memory used for one block can be recycled for the other blocks in the same layer when the encoding proceeds from the bottom up. In addition, a close inspection reveals that, in each block, only the lower half of the elements need to be stored in memory for the encoding process, and the outcomes of the XOR operations in the upper half can be stored in the associated lower half. Then, it is easy to show that only 2 λ bits of memory are required for the encoding process in layer λ. Hence, the total required memory of all layers is 2 n − 1 = N − 1 bits. To corroborate the above argument, let us consider the following illustrative example. For notational convenience, we define D a λ ,λ as the memory address used for layer λ, where a λ (a λ = 0, 1, · · · , 2 λ − 1) is the index of bit memory. We focus on the bottom block in layer 0 highlighted by the dashed box in Fig. 1 . In this block, x 6 and x 7 are the known bits and x 7 will be first processed. The first step is to copy x 7 to D 0,0 , and then to D 1,1 and D 3,2 . Since the stored value in D 0,0 is the same as x 7 , the XOR operation between x 6 and x 7 can be replaced with x 6 and D 0,0 and the outcome of the XOR operation can be stored in D 0,0 , since the previous value in D 0,0 is obsolete. The next step is to copy the value of D 0,0 to D 0,1 . Once done, D 0,0 is released since its value is no longer required in the remaining process, hence can be recycled for use in the next block in layer 0. As such, only 1 = (2 0 ) bit memory is required for the encoding process in layer 0. Finally, the above process is extended to other layers.
In the encoding process of SPC, there exist two different operations, namely, directly copying and XOR operation. Hence, it is of significant interest to obtain a fast method to determine the proper operation. Here, we present a simple method to address this issue. Let φ (φ = 0, 1, · · · , N − 1) denote the index of current horizontal connection for top to bottom and denote the binary expression of φ as b n−1 · · · b 0 . Then, directly copying is performed in layer λ when b λ = 1 and XOR operation occurs when b λ = 0, as illustrated in Fig.  1 .
For the propagation from left to right, the case for b λ = 0 is more complex. Let us take the information propagation of u 0 for example. Since both u 0 and D 0,2 are needed for the XOR operation u 0 D 0,2 at the same time, we can not copy u 0 to D 0,2 . To circumvent this problem, we introduce a temporary variable t, and set t = u 0 . Hence, the XOR operation x 0 D 0,2 can be replaced with t D 0,2 , and the corresponding outcome can be assigned to t as well, i.e., t = t D 0,2 . When b λ = 1, the directly copying operation is to copy t into D a λ ,λ , which implies that t and D a λ ,λ share the same value after the directly copying, hence t can then be used for the following operations instead of D a λ ,λ . Due to the introduction of a temporary variable, the total required memory bits for the proposed encoding process is N . It is also worth emphasizing that the number of XOR operations in the proposed encoding process is only N 2 log 2 N . The pseudocodes of the proposed encoding of SPC is listed in Algorithm 1 where ← is the assignment operator. Lines 6-15 are for the propagation from right to left, and lines 17-27 are for the propagation from left to right. After each process of propagation, a λ will be updated from the next propagation, which is shown in lines 28-31.
Comparing the pseudocodes between EncoderA in [9] and Algorithm 1, it can be found that the encoding processes of both algorithms work in the same serial fashion and are implemented from horizontal connection (N − 1) to horizontal connection 0 one by one, which indicates that both algorithms have the same number of XOR operations and directly copying. However, our proposed algorithm repeatedly utilizes the N -bit memory while EncoderA requires N (1 + log 2 N ) bits of memory. Moreover, the XOR operation in the proposed algorithm only requires two operands and is performed in place while the XOR operation in EncoderA has three operands including one destination and two sources, which may incur extra computation. And we will show in the next subsection that the updating of b λ and a λ does not need extra computation. Therefore, the efficiency of the proposed encoding algorithm has not degraded in comparison to EncoderA in [9] .
It is also worth pointing out that the proposed algorithm can also be used as an encoding algorithm for NSPC where the encoding only has the propagation from left to right based on the polarization structure similar to Fig.1 . This indicates that the minimum requirement of NSPC encoding is also N 2 log 2 N XOR operations and N bits of memory.
B. Simplification for SPC encoder
One might think that we need extra bit memory for b λ and a λ and extra computation for the updating of a λ . In fact, φ, b λ and a λ can share the same memory and only updating φ is enough for all updating. We will show this in the following.
In hardware implementation, φ is expressed in binary as be bitwise visited as one switch to select the corresponding operation for the propagation from right to left (or vice versa). In layer λ, 2 λ bits memory are required, which means a λ must be a number of λ bits. Note that (b λ−1 · · · b 0 ) 2 has the same value (in decimal) as a λ (λ > 0). Thus, a λ can be obtained by selecting b λ−1 · · · b 0 from φ without extra memory as shown in Fig.2 . In layer 0, a 0 is fixed to be 0 due to that only one bit is required.
In Algorithm 1, if a full word u is not required, line 15 can be deleted. The encoding algorithm described in Algorithm 1 works in a serial fashion. To further improve the encoding throughput, we discuss the implementation of a parallel encoding with 2 bits at a time in this section.
Similar to the previous algorithm, the encoding is processed from bottom to top as shown in Fig.1 . Let 2ψ and 2ψ + 1 denote the indices of the two horizontal connections being processed at a time, respectively. Here ψ ∈ {0, 1, · · · , N 2 − 1}. From Fig.1 , it can be found that there are four different cases of information propagation for two known bits in the encoding process as depicted in Fig.3 . However, it turns out that case (d) never happens for polar codes constructed on symmetric binary memoryless channels (S-BMC).
Proposition 1.
For N = 2 n (n ≥ 1) polar codes with coding structure similar to Fig.1, the (2ψ) We now elaborate on the parallel encoding algorithm and the corresponding memory requirements, where φ = 2ψ and ψ decreases from ( to denote the two-bit vectors (D a λ ,λ , D a λ +1,λ ), (u φ , u φ+1 ) and (x φ , x φ+1 ), respectively.
As shown in Fig.1 , in layer 0 of case (a), with parallel processing, the direct copying operation and XOR operation are performed simultaneously. As such, one additional memory bit, denoted as D 1,0 , is required and then the above operations are done as D +1 0,0 ← (x φ ⊕ x φ+1 , x φ+1 ). In layer λ(> 0), it can be noticed that the operation types of the φ-th and (φ + 1)-th user bits are the same and will be decided according to b λ , where the operations of D Consider polar codes with N = 1024 and code rate 1/2 constructed at signal-to-noise ratio 2dB under additive white Gaussian noisy channel as an illustration, the number of case (a), (b) and (c) are 135, 135 and 242 respectively, that is, 754 horizontal propagations will be implemented with our proposed parallel encoding, which can obtain about 36% gain in throughput with comparison to Algorithm 1.
From the above description, it can be noted that a new bit memory, i.e. D 1,0 , is introduced in the parallel 2-bit encoding algorithm while the temporary variable t in Algorithm 1 is no longer required. Hence, the requirement of bit memory for the parallel 2-bit encoding algorithm is the same as Algorithm 1. Also, it can be easily verified that the number of XOR operations remains unchanged.
Limited to the two opposite information propagations of frozen bits and user bits in the SPC encoding, parallel multiple-bit encoding for SPC is more complex and the storage memory and computation will also increase, which is beyond the discussion of this paper and will be left for future work.
V. CONCLUSIONS
We have presented an efficient encoding algorithm for SPC which need N bits memory and N 2 log 2 N XOR operation, a minimum requirement for the available encoding methods. By sharing memory, our analysis shows the algorithm can be further simplified. To improve encoding throughput, a parallel 2-bit encoding algorithm has also been discussed, which shows the parallel encoding algorithm can be achieved with the same memory bits and XOR operations.
