Abstract-Polar codes, discovered by Arıkan, are the first error-correcting codes with an explicit construction to provably achieve channel capacity, asymptotically. However, their errorcorrection performance at finite lengths tends to be lower than existing capacity-approaching schemes. Using the successivecancellation algorithm, polar decoders can be designed for very long codes, with low hardware complexity, leveraging the regular structure of such codes. We present an architecture and an implementation of a scalable hardware decoder based on this algorithm. This design is shown to scale to code lengths of up to N = 2 20 on an Altera Stratix IV FPGA, limited almost exclusively by the amount of available SRAM.
I. INTRODUCTION
Since their introduction in 2008, polar codes [1] have attracted a lot of attention from the information theory community, as they are the first codes to provably achieve channel capacity, asymptotically in code length.
Although initially only defined for the binary erasure channel (BEC), they were later extended to other models, such as the additive white Gaussian noise (AWGN) channel [2] .
Their recursive construction was shown to support lowcomplexity implementations of the successive-cancellation (SC) algorithm in hardware [3] [4] . Those low-complexity decoders can in turn be used as components in more complex schemes, such as list decoding [5] [6] and concatenated coding [7] , which improve the error-correction performance of polar codes at finite lengths.
The remainder of this paper is structured as follows. Section II provides background information on polar codes and SC decoding. Then, Section III details the proposed architecture. Section IV analyzes FPGA implementation results, while Section V concludes this work.
II. BACKGROUND
Polar codes are a class of linear block codes based on a recursive definition. They are constructed using a generator matrix G, obtained from the base matrix F 2 = 1 0 1 1 , using
where N = 2 n is the code length, and ⊗ represents the Kronecker product. In this paper, we use u to denote an information vector, x for a codeword, y for a received vector, andû for the information vector estimated by the decoder. These codes can be decoded using a recursive, multistage structure featuring n stages of N/2 nodes, yielding a complexity O(N log N ) [1] .
To simplify their implementation in hardware, decoding can be carried out in the log-likelihood-ratio (LLR) domain, where the SC equations become the standard sum-product algorithm (SPA) equations, which can be approximated using the wellknown min-sum algorithm (MSA) [8] :
whereŝ designates a partial sum. This approximation yields a performance degradation of ∼0.1dB over SPA, as illustrated in Figure 1 , although this gap tends to shrink for higher-rate (R = k/N ) codes. 
III. ARCHITECTURE
The architecture presented in this paper is based on the semi-parallel decoder of [3] , and introduces modifications aiming to improve its scalability with respect to code length. This decoder uses a fixed datapath, and operates under resource constraints, where only P N/2 processing elements (PE) are implemented. This limitation, however, only impacts throughput minimally [8] . Figure 2 provides a top-level overview of the redesigned decoder architecture, while its various changes are discussed in the following sections.
A. Memory Improvements
Unlike [3] , which makes use of a single SRAM to store all LLRs, this improved architecture relies on two separate types of memories: channel and internal. This separation allows fullthroughput operation of the decoder by supporting the loading of a subsequent frame into the channel memory, without write contention, while the previous one is still being processed. This is made possible by the fact that, per the structure of the decoding graph, channel LLRs are not directly required in the second half of the decoding process, i.e. after bit i = N/2 and stage l = (n − 1).
Furthermore, the improved design does away with asymmetric read/write ports in its SRAMs. Those memories are replaced by pairs of P -LLR wide SRAMs, whose outputs are concatenated into 2P -LLR words consumed by the processing elements, whose own P -LLR outputs are written to each SRAM in sequence. Note that the & operator used in Figure 2 symbolizes concatenation, with sign extension if needed.
B. Quantization
The separation of the channel and internal memories, described in Section III-A, also makes it possible to use distinct quantization levels for each memory. This enhancement is suggested by a characteristic of the successive-cancellation algorithm, namely that (2) affects the range of the computations in each successive stage, while their precision remains unchanged by both operations. It follows that the values processed by lower-indexed stages require more range than those in the higher ones. Since the decoder must retain an entire N -LLR frame in memory, the channel SRAMs account for nearly half of the decoder's soft information storage requirements [3] . A lower quantization for this memory therefore reduces the decoder area significantly.
Quantization is denoted using shorthand (Q i , Q ic , Q f ), which indicates the number of integer bits for internal LLRs, integer bits for channel LLRs, and fractional bits for both types, respectively; Q = Q i + Q f and Q c = Q ic + Q f are also used to refer to the total number of quantization bits in each case.
Simulations showed that full-range quantization does not benefit error-correction performance; much lower levels can match a floating-point implementation. Specifically, we carried out those simulations for codes of length N = 2 15 , with R ∈ {0.25, 0.50, 0.75, 0.90}; results are summarized in Table I . We found that, for those codes, 6-8 bits of quantization suffice for good error-correction performance (within ∼0.1dB of floating point MSA), depending on their rate, as shown in Figure 1 . We also noticed that higher-rate codes tend to require fewer bits of fractional precision and integer range for internal LLRs, but more bits of integer range for channel ones. 
C. Chained PE
This architecture makes use of a chained PE in stage 0, carrying out functions λ f and λ g in a single clock cycle (CC). The concept behind this improvement was introduced in [9] , while the restricted implementation used in this paper, targeting only stage 0, was independently proposed in [4] .
This chained PE relies on the specific schedule of the polar decoding graph, in which stage 0 is always activated twice in a row, using the same operands: first for function λ f , and then for function λ g , using the result of λ f . By chaining both operations in a special PE, we can output two decoded bitŝ u {i,i+1} at once, yielding a (N/2)-CC reduction in decoding latency.
This behavior is illustrated in Table III , specifically in clock cycles {3, 6, 13, 16}. In those cases, the computations of functions λ f and λ g are performed in the same clock cycle, yielding two decoded bits simultaneously.
The chained PE does not incur any overhead over the regular PE. The data dependency present in-between functions λ f and λ g , satisfied by the sign of λ f , occurs late in the processing of λ g , and can be computed very rapidly.
D. Semi-Parallel Partial-Sum Encoder
The main factor limiting the scalability of [3] is the growing complexity of its partial-sum update logic. In this paper, we introduce an encoder-based alternative inspired by the design of [9] , which proposed a fully-parallel partial-sum computation module. Our implementation extends this encoder, adapting it to a novel semi-parallel architecture. This architecture operates over multiple clock cycles and uses a fixed datapath, removing it from the decoder's critical path altogether.
This encoder is triggered after decoding-stage 0, and processes two decoded bits at a time. Figure 3 illustrates its structure, a mirrored version of the decoding graph, in which thef nodes are defined as binary additions (XOR), and theĝ nodes, as pass-through connections:
As in the decoding graph, the nodes are associated into N/2 pairs per stage. Those pairs are processed by the P/2 encoding PEs. In order to make the design scalable, a semi-parallel architecture was chosen for the encoder. Since the encoding graph mirrors the decoding graph, their schedules are very similar. The encoding schedule is illustrated in Table III , where e denotes the activation of encoding stage l enc . Due to the semiparallel nature of the encoder, stages which are handled in multiple clock cycles are denoted using a subscript, e.g. e 0 . In Figure 3 , the subgraph highlighted in bold illustrates the nodes activated to calculate partial sumsŝ 1,0 andŝ 1,2 . Those two values are subsequently used to evaluate λ g nodes in stage l = 1 of the decoding graph.
The partial-sum encoder follows a schedule similar to that of the decoding, although with half as many processing elements; those processing elements produce two values instead of one, since they are not restricted by a data dependency as the decoding PEs are. The encoder thus increases latency by ( N P (P − 1) + N P log 2 ( N 4P ) − log 2 P + 2) CC, or ∼67% for P = 64, but allows higher operating frequencies, for a net throughput gain.
Using P/2 encoding PEs, the encoder can make use of P -bit wide words in theŝ SRAMs, allowing the decoder to retrieve P partial sums simultaneously during decoding, in a single clock cycle. Furthermore, because of the specific structure of the encoding graph, the values stored in memory are properly aligned for direct consumption by the decoding PEs, via a fixed datapath.
Note that the internal partial sumsŝ 0,j correspond toû i , where i is the bit-reversed [1] value of j. Furthermore,ŝ n,j yields an estimation of codeword valuex i , where i is again bit-reversed j. As part of the encoding process resulting in this estimated codewordx, the encoder creates internal estimationŝ s l,j , which are required by λ g during the decoding process.
In a non-systematic polar decoder, it is not necessary to evaluatex completely, which saves a final encoding stage afterû N −2 andû N −1 are decoded. However, in a systematic decoder [10] , those extra steps could be carried out to obtainx, which is required to retrieve the original information vector, while avoiding the need for extra hardware to perform the additional encoding step.
IV. EXPERIMENTAL RESULTS
The various characteristics of this architecture, explored in Section III, are summarized in Table IV. In this table, latency TABLE III  SCHEDULE OF THE PROPOSED Stage / CC  0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22 
e e e e lenc = 1 e0 e1 e0 e1 lenc = 2 e0 e1 e2 e3 Outputû0û1û2û3û4û5û6û7 Decoding latency (CC)
s SRAMs (bits) P
3N 2P
+ 2 log 2 P − 4
takes into account the semi-parallel schedule, as well as the chained PE and partial-sum encoder. The LLR SRAMs entry combines both the channel and the internal LLR memories.ŝ SRAMs store internal encoding estimations, but not the whole estimated codewordx. Finally, throughput is estimated for P = 64, a common value, to simplify its representation.
Table II then presents implementation results targeting an Altera Stratix IV FPGA. Maximum frequencies are reported for the slow 900mV 85
• C timing model. This table starts by presenting implementation results for the four N = 2 15 codes described in Section III-B. It then explores the scalability of our design with respect to two parameters: code length and quantization. Finally, it compares this work with [3] .
Note that, as in [3] , our decoder architecture is not affected by code rate, as the choice of a specific code only modifies the contents of a ROM. Code rate is thus only reported in the first section of this table.
Those results show that the improved architecture retains a high clock frequency over a wide variety of code lengths, due to its fixed datapaths; the decreases observed are mostly due to routing delays, as more SRAM elements are used on the FPGA. Compared to [3] , this new design scales much better with respect to all parameters; its higher memory use could be compensated, in an actual decoder, by Q ic < Q i , while it is set to the same value here, for fair comparison.
The register, logic and memory use of the decoder targeting the N = 2 20 code amount to 0.5%, 2%, and 72% of the resources available on the selected FPGA, respectively. Additionally, register and logic use grow roughly linearly in the number of PEs and quantization bits, but are mostly unaffected by code length. Therefore, we can state that this architecture will scale to extremely long codes, limited almost exclusively by the amount of SRAM available on the FPGA.
At N = 2 17 , the largest code length supported by our previous-generation decoder, this improved architecture uses 81 times less look-up tables (LUT), 104 times fewer flip-flops (FF), has a maximum operating frequency 16 times higher, and a throughput 11 times greater, using the same parameters.
V. CONCLUSION
In this paper, we presented a scalable architecture for SC decoding of polar codes. This decoder features a semiparallel, encoder-based partial-sum update module. This module utilizes SRAM for storage, and makes use of a fixed datapath. Additionally, this architecture leverages a multi-level quantization scheme for LLRs, decreasing memory use and decoder area. This state-of-the art decoder was synthesized for an Altera Stratix IV FPGA target up to N = 2 20 , limited almost exclusively by the amount of available SRAM.
