The rapid development in the digital circuit design enhances the applications on very large scale integration era. Encoders are one among the digital circuits found in all communication systems. The polar encoding is mainly meant for its channel achieving property. It finds its application in communications, sensing and information theory. This coding proposed by Erdal Arikan is significant because of its zero error floors and simple architecture for hardware implementation. In this paper, a folded polar encoder is designed to start from the fully parallel architecture and proceeds with its data flow graph, delay requirement calculation, lifetime analysis and register allocation, which results in a very large scale integration architecture with minimum hardware utilization. The results are simulated for 4 and 8 parallel folded 32-bit polar encoder using Xilinx 14.6 ISIM and implemented in Virtex 5 field programmable gate array. A comparison is made on fully parallel and various folding techniques based on their resource utilization.
Introduction
The polar code belongs to the class of linear block codes. The encoding process can be characterized by the generator matrix. The generator matrix G N for code length N or 2 is obtained by applying the n th kronecker power of the kernel matrix [1] .
Given the generator matrix, the code word x is computed by x = u•G N , where u denotes the information. The information vector u is arranged in natural order, whereas the code vector x is in a bit reversed order. The encoding complexity of straight forward fully parallel encoder architecture is in the order of (NlogN) for the polar code of length N and takes n stages. When N = 2 n , polar code with the length of 32 bit is implemented with 80 ex-or gates and processed in five stages as shown in Figure 1 .
In VLSI architecture, reduction focuses on the minimization of the size of the components. Many techniques are involved in the minimization process. Some of the addressable techniques are k-map based Boolean expression method and block optimization method. In general, pipelining can be used in the context of architecture design [2] - [4] . The pipelining transformation leads to a reduction in the critical path, which can be exploited to either increase the clock speed or sample speed or to reduce the power consumption at the same speed. This in turn reduces the effective critical path by introducing pipelined latches along the data path. The pipelined technique can be broadly classified as feed forward and feedback path. The feed forward pipelined encoder structure consists of 2D commutator followed by ex-or and pass gates for achieving high throughput [5] .
The feedback pipelined polar encoder favors for high hardware efficiency rather than high throughput. The number of ex-or gate is equal to the number of processing stages, whereas the number of delay elements gets reduced. In parallel processing, multiple outputs are computed in parallel in a clock period. Therefore, the effective sampling speed is increased by the level of parallelism [6] . This increases the sampling rate by replicating the hardware so that the several inputs can be processed in parallel and several outputs can be produced at the same time.
The polar code architecture is discussed [7] using channel combining phase and channel splitting phase. It incorporates punctured encoding to shorten the length of polar codes. It is observed that reduction of the memory constraints can be achieved during practical applications. The application of polar codes and polarization phenomenon for various problems like wire tap channels [8] , multiple access channels [9] [10], data compression [11] , and broadcast channels [12] were successful. In addition to the capacity achieving capability, polar codes have interesting properties like good error floor performance [13] . This suggests that the combination of polar coding with other coding schemes could eliminate the shortcomings of both, resulting in a powerful coding paradigm. Furthermore, there are many applications for concatenated codes like deep-space communications, optical transport systems, and magnetic recording channels [14] .
In this paper, the parallelism and pipelining have been combined to achieve an effective encoder structure with minimum registers. This implementation proceeds from the conventional fully parallel 32 bit architecture and transforming as a data flow graph (DFG), delay requirement table, linear lifetime chart and register allocation [15] . The flow of this paper proceeds as follows. Section 2 describes the design of four folded 32 bit polar encoder architecture with each step in detail. Section 3 discusses the architectural design of eight folded polar encoder. Finally the comparative results for fully parallel, four and eight folded architectures with respect to resource utilization are discussed.
Four Parallel Folded 32 Bit Polar Encoder
The polar encoder relies on the principle of channel polarization. It is a recursive method used to define the polar codes. A class of codes that can provably achieve the capacity of several classes of channels. It comes under linear codes. The phenomenon of channel polarization includes channel combining and channel splitting. The channel W N can be measured up with two parameters namely mutual information which defines the information capacity and Bhattacharya parameter measures the reliability of the channel.
In synthesizing DSP architectures, it is important to minimize the silicon area of the integrated circuits, which is achieved by reducing the number of functional units, multiplexers, interconnection wires. This in turn may lead to an architecture that uses a large number of registers. To avoid this, various techniques can be used to minimize the number of registers. Folding transformation reduces the hardware utilization by time multiplexing several operations of the functional unit [16] . The DFG of the 32 bit fully parallel polar encoder can be given as shown in Figure 2 .
Folding Transformation
The DFG of the 32 bit polar code is similar to Fast Fourier Transform (FFT), and it uses the kernel matrix instead of butterfly operation. The 4-parallel folded architecture can be realized by placing 2 functional units in each stage, since each of the functional units compute two bits at a time. Let us consider the four parallel input sequences in natural order. The initial folding sets can be given as: For stage 1: {P 0 , P 2 , P 4 , P 6 , P 8 , P 10 , P 12 , P 14 }, {P 1 , P 3 , P 5 , P 7 , P 9 , P 11 , P 13 , P 15 }. In this, the two functional units of stage 1 namely P 0 and P 1 execute simultaneously at the beginning and P 2 and P 3 at the next cycle. The stage whose index s is less than or equal to log 2 P, where P is the level of parallelism and has the same folding set as that of the previous one. The stage 2 has the same order as those of stage 1, since it performs the operation within the same four inputs. At later stages, the folding sets are computed by, the property that the functional unit that process a pair of inputs whose indices differ by 2 (s−1) is exploited [15] . Thus the folding set of stage 2 can be given as {Q 0 , Q 2 , Q 4 , Q 6 , Q 8 , Q 10 , Q 12 , Q 14 }, {Q 1 , Q 3 , Q 5 , Q 7 , Q 9 , Q 11 , Q 13 , Q 15 }. In the stage 3, the indices of the two data differ by a factor of four, 
Delay Requirement Calculation
The number of delay element required in the folded architecture [3] can be computed
where W ij is an edge from the functional unit S to the functional unit T, having the delay d where t and s denote the position in the folding set corresponding to T and S respectively. The delay requirement of four folded 32 bit polar encoder can be given as shown in Table 1 . In Table 1 , some edges have negative delays. To have a feasible folded architecture the delay should be greater than or equal to zero for all the edges. Thus we prefer re-timing or pipelining methods for the fully parallel structure to ensure non negative delays. In this, the negative delay should be compensated by inserting at least one delay element to make the value of Equation (2) non negative. The polar encoder utilizes the ex-or operation, and the two inputs should pass through the same number of delay elements. If they are different, additional delays are included to match them. The DFG is pipelined by inserting delay elements, and the red line indicates the pipeline cut-set associated with 4-folded architecture. Thus the recalculated delay is shown in Table  2 .
The 32 bit polar encoder with four parallel folded structure can be implemented with 10 functional units and 28 delay elements.
Lifetime Analysis
The number of delay elements can be reduced by implementing lifetime analysis for the folded architecture [16] . The linear variable chart represents the lifetime of every variable as in Figure 3 . In this, all the edges starting from stage 1 have zero delay. Therefore W 2j , W 3j and W 4j are presented. Here W 3,0 is alive for two cycles. Hence, they are produced at stage 1 and consumed in stage 3. The W 2j starts at time 0 and proceeds so on. It is taken in the same order as that in DFG. The W 3j starts at the next cycle at time 1 and proceeds as j = 0, 1, 4, 5, etc. The next stage W 4j starts at time 2 and proceeds as j = 0, 1, 8, 9, etc. in the forward manner. The number of variables alive in each cycle is given at the right side of the chart. Thus the maximum number of live variables is 28, which implies that the four folded 32 bit polar encoder can be implemented with 28 delay elements.
Register Allocation
In computing the minimum number of registers required, each variable is allocated to a register. The register allocation table is utilized [17] to verify the allocation in all the 28 registers, and every row describes how registers are allocated at each cycle. The indication in bold font implies that the variable gets consumed at the particular stage. The register allocation for 32 bit four parallel folded polar encoder is shown in Table 3 .
Proposed Architecture
The four folded parallel pipelined structure for 32 bit polar encoder is shown in Figure 4 . It consists of 10 functional units and 28 delay elements. Each stage has two functional units. Stages 1 and 2 include no delay elements. Stages 3, 4 and 5 have several multiplexers placed in front of each functional unit to configure the inputs of the functional units. The proposed architecture continuously processes four samples/cycle, according to folding sets and register allocation table. In this, the inputs are in the natural order and the outputs are in the bit reversed order.
Eight Parallel Folded 32 Bit Polar Encoder
The design of eight parallelism considers eight inputs at a time. Hence the stages are split up into four folding sets. The same procedure is applied for eight folded parallelism with the stages depicted below.
Stage 1: {P 0 , P 4 , P 8 , P 12 } {P 1 , P 5 , P 9 , P 13 } {P 2 , P 6 , P 10 , P 14 } {P 3 , P 7 , P 11 , P 15 The corresponding cut-set is shown in Figure 2 , with blue connected lines. The delay requirement table D (W ij ) is filled up using Equation (2) . This table contains negative delays; it can be set right by using the recalculated delay requirement for eight parallelisms as depicted in Table 4 .
The linear lifetime chart is drawn for W 3j and W 4j , since there exists no cross over with other stage inputs on the W 2j stage. This chart minimizes the registers to a count of 24 as shown in Figure 5 .
This register count has been used to perform register allocation as illustrated in Table 5 .
In the folded architecture the stages 1 and 2 include zero delay and hence no registers are needed. The stage 3 requires eight registers as shown in the linear lifetime chart. The stage 4 requires sixteen registers to obtain the encoded output. Figure 6 depicts the pipeline implementation of the proposed architecture.
Results and Discussion
The above designs of 32 bit polar encoder for fully parallel, eight and four folded architectures are simulated using Xilinx 14.6 ISE and implemented in Virtex 5 FPGA and the corresponding outputs are obtained.
The simulation of 32 bit fully parallel architecture for a polar encoder with the input of 32'h FFFFFFFF using Xilinx 14.6 ISIM results in an output of 32'h 80008000 as depicted in Figure 7 .
The four parallel folded architecture is simulated with the same input stream as given for fully parallel and verifies the same results as shown above in Figure 8 . The synthesis results for 32 bit polar encoder using four folded structure can be done by eight successive executions on the same architecture utilizes 224 registers for execution.
The 8-parallel folded architecture is simulated with the same input stream given in fully parallel results in the same output verifying the functionality of polar code as in Figure 9 . The synthesis results for 32 bit polar encoder using eight folded structure can be done by four successive executions on the same architecture utilizes 96 registers.
The partially parallel implementation in [15] concludes that P parallelism of (N, K) polar encoder will arrive at ex-or gates formulated as
In addition, there exist N-P delay elements with the throughput of P bits/cycle. Thus the 32 bit polar encoder design for four and eight folding matches exactly with the same criteria. The comparison of resource utilization for the 32 bit polar encoder for fully parallel, four and eight folded architectures is depicted in Table 6 .
Conclusion and Future Scope
This paper is focused to minimize the hardware resources for the 32 bit polar encoder. Many optimization techniques are implemented in steps to arrive at the proposed architecture for various folding levels. The simulation results show that the folded structure abides the polar encoder functionality. The implementation in Virtex 5 FPGA shows that the folding decreases the functional blocks (ex-or) operations, but needs trade off in the number of delay elements, registers and speed of execution.
