Abstract-This is the first in a two-part series of papers on unrolled polar decoders. In this paper (Part I), we present a family of architectures for hardware polar decoders using a reduced-complexity successive-cancellation decoding algorithm. The resulting fully-unrolled architectures are capable of achieving a coded throughput in excess of 400 Gbps on an FPGA, two orders of magnitude greater than current state-of-the-art polar decoders. Moreover, the proposed architectures are flexible in a way that makes it possible to explore the trade-off between resource usage and throughput.
than the state of the art. The fastest of these architectures is implemented on an FPGA for a code of length 2048. Moreover, we present a family of architectures that offer a flexible trade-off between throughput and area.
In Part II, a new list decoding algorithm is proposed and an unrolled software list decoder is implemented and shown to offer an order-of-magnitude improvement in decoding speed.
We start this paper with Section II in which we briefly review polar codes, their construction, their representation as graphs and trees, and then the successive-cancellation decoding algorithm. Section III provides the necessary background about the simplified sucessive-cancellation and Fast-SSC decoding algorithms. A brief review of state-of-the-art polar decoder architectures is done in Section IV. Section V presents the proposed family of architectures, and the operations and processing nodes used. Section VI discusses the implementation and presents FPGA results for various code lengths and rates. Results are compared against state-of-theart polar decoders. The coded throughput of our decoders is shown to be in excess of 250 Gbps for a (1024, 512) polar code and of 400 Gbps for a (2048, 1024) polar code, two orders of magnitude over the current state of the art. Finally, Section VII concludes this paper.
Preliminary results of this work will appear as a letter [15] . In this paper, we generalize the architecture into a family of architectures offering a flexible trade-off between throughput and area, give more details on the unrolled architecture and provide more results. We also significantly improve the (1024, 512) fully-unrolled deeply-pipelined polar decoder implementation results on all metrics.
II. Polar Codes

A. Construction
Polar codes exploit the channel polarization phenomenon by which the probability of correctly estimating codeword bits tends to either 1 (completely reliable) or 0.5 (completely unreliable). These probabilities get closer to their limit as the code length increases when a recursive construction such as the one shown in Fig. 1a is used, where ⊕ represents a modulo-2 addition (XOR). Under successive-cancellation decoding, polar codes were shown to achieve the symmetric capacity of memoryless channels as their code length N tends to infinity [1] .
A (N, k) polar code of length N carries k information bits. The remaining N −k bits are called frozen bits. The frozen bits are usually set to zero. The grayed u i 's where i ∈ {0, 1, 2, 4} on Depending on the type of channel and its conditions, the optimal location of the frozen bits varies and can be determined using the method described in [16] for example. Fig. 1a illustrates the construction of a non-systematic polar code. In [17] , a method to construct systematic polar codes was introduced. Systematic polar codes offer an improved bit-error rate over their non-systematic counterpart without affecting the frame-error rate. In this paper, we use systematic polar codes.
B. Decoder Tree Representation
Encoding and decoding of polar codes is often represented as a graph, as in Fig. 1a . As polar codes are built recursively, it was proposed in [6] to represent them as binary trees. Fig. 1b shows such a representation, called a decoder tree, for the same polar code illustrated as a graph in Fig. 1a . In the decoder tree, white and black leaves represent frozen and information bits respectively. Moving up in the decoder tree corresponds to the concatenation of constituent codes. For example, the concatenation operation circled in blue in Fig. 1a corresponds to node labeled v in Fig. 1b .
C. Successive-Cancellation Decoding Algorithm
The SC decoding algorithm works by doing a depth-first traversal of the decoder tree like the one of Fig. 1b . Using log-likelihood ratios (LLRs) for soft inputs and the min-sum approximation, all operations are carried out using additions, substractions or logic operators [18] . Traversing the tree requires four operations in total.
Moving down along a left edge, the F operation generates the LLRs α for the left child l using the LLRs from the root node:
where α v are LLRs from the root node (or input LLRs to the node), and N v is the node input length.
Once a leaf node is reached, the bit estimate is calculated using the input LLR α v :
0, when u i is a frozen bit; 1, when α v < 0; 0, otherwise.
Moving down the tree along a right edge, the G operation generates the LLRs α for the right child r using the bit estimates from the left one:
where β l is the bit estimate vector from the left-hand-side sibling subtree. Going up the decoder tree, bit estimates are combined to form the bit estimate vector β v :
where β l and β r are bit estimates emanating from the left and right child nodes, respectively. Regardless of the frozen bit locations, the SC decoding algorithm fully traverses the decoder tree.
III. Simplified Successive-Cancellation and Fast-SSC Decoding Algorithms
A. Simplified Successive-Cancellation (SSC)
As mentioned above, a polar code is the concatenation of smaller constituent codes. Instead of using the successivecancellation algorithm on all constituent codes, the location of the frozen bits can be taken into account to use more efficient, lower complexity, algorithms on some of these constituent codes. In [6] , decoder tree nodes are split into three categories: Rate-0, Rate-1, and Rate-R nodes.
1) Rate-0 Nodes: are subtrees whose leaf nodes all correspond to frozen bits. We do not need to use the SC algorithm to decode such a subtree as the exact decision, by definition, is always the all-zero vector.
2) Rate-1 Nodes: are subtrees where all leaf nodes carry information bits, none are frozen. The maximum-likelihood decoding rule for these nodes is to take a hard decision on the input LLRs:
With a fixed-point representation, this operation amounts to copying the most significant bit of the input LLRs. With a systematic polar code, no further action is required. In the non-systematic case, the β v vector above is an intermediate result that needs to go through an encoder of length N v to generate the final bit estimate vector β v . 3) Rate-R Nodes: lastly, Rate-R nodes, where 0 < R < 1, are subtrees such that leaf nodes are a mix of information and frozen bits. These nodes are decoded using the conventional SC algorithm until a Rate-0 or Rate-1 node is encountered.
As a result of this categorization, the SSC algorithm trims the decoder tree of Fig. 1b into the one illustrated in Fig. 2a . Rate-1 and Rate-0 nodes are shown in black and white, respectively. Gray nodes represent Rate-R nodes. Trimming the decoder tree leads to a lower decoding latency and an increased decoder throughput.
B. Fast-SSC
Some Rate-R nodes, corresponding to specific frozen-bit location, can be decoded using algorithms with lower latency than SC decoding. Many of the nodes and operations introduced in [10] take advantage of this information. The subset of nodes and operations used in our proposed family of architectures are briefly reviewed in this section.
1) F Operations:
The F operation is among the functions used in the conventional SC decoding algorithm. It calculates the soft message (LLR), denoted α, to be sent along a left edge to a child node l in a decoder tree. It is calculated using (1) .
2) G and G_0R Operations: Like the F operation, G is also a function taken from the conventional SC decoding algorithm. It calculates the soft message to be sent along a right edge to a child node r in a decoder tree. It is calculated using (2) .
G_0R, named G0R in Figs. 5 and 6, is a special case of the G operation where the left child is a frozen node. The G_0R operation is thus similar to the G operation, except β l is an all zero vector of length N v .
3) Combine and Combine_0R Operations: As defined by (3), the Combine operation generates the bit estimate vector at the root of a subtree by combining the bit estimate vectors from its children nodes.
A Combine_0R operation, called C0R in Figs. 5 and 6, is a special case of the Combine operation (3). It is used when the left-hand-side constituent code, β l , is a Rate-0 node.
4) Repetition Node:
In this node, all leaf nodes are frozen bits, with the exception of the node at the most significant position. At encoding time, the only information bit gets repeated over the N v outputs. The information bit can be estimated by using threshold detection over the sum of the input LLRs α v :
5) Single-parity-check (SPC) Node: An SPC node is a node such that all leaf nodes are information bits with the exception of the node at the least significant position. To decode an SPC code, we start by calculating the parity of the input LLRs:
The estimated bit vector is then generated by reusing the calculated β v above unless the parity constraint is not satisfied i.e. is different than zero. In that case, the estimated bit corresponding to the input with the smallest LLR magnitude is flipped:
Fig. 2b presents the decoder tree where the Repetition and SPC nodes presented above are used. The Repetition node is shown in striped green while the SPC node is in cross-hatched orange.
Our proposed decoder borrows from the Fast-SSC algorithm in that it uses specialized nodes and operations described above to reduce the decoding latency. However, the family of architectures we propose greatly differs from the processorlike architecture of [10] . Moreover, [10] proposes hybrid nodes types combining the ones above in order to further reduce the decoding latency. We do not use those hybrid nodes in this paper.
IV. State-of-the-Art Architectures
A. Decoders Focusing on Resource Minimization
Most hardware polar decoder architectures presented in the literature, [3] - [5] , [7] - [10] , use the SC decoding algorithm or an SC-based algorithm. These decoders mainly focus on minimizing logic area (CMOS) or resource usage (FPGA). As an example, the fastest of these SC-based decoders, the Fast-SSC decoder of [10] , utilizes a processor-like architecture where the different units are used one to many times over the decoding of a frame. With the algorithmic improvements reviewed in Section III-B, the Fast-SSC decoder was shown to be capable of achieving a 1.1 Gbps information throughput at 108 MHz on an FPGA for a polar code with a length N = 2 15 .
B. Decoders Focusing on Throughput Maximization
Recently, two polar decoders capable of achieving an information throughput greater than 1 Gbps with a short (1024, 512) polar code were proposed. An iterative belief propagation (BP) fully-parallel decoder achieving an information throughput of 2.34 Gbps at 300 MHz on a 65 nm CMOS application-specific integrated-circuit (ASIC) was proposed in [19] . More recently, a fully-combinational, SC-based decoder with input and output registers was proposed in [20] . That decoder reaches an information throughput of 1.43 Gbps at 2.79 MHz on a 90 nm CMOS ASIC and of 600 Mbps at 596 kHz on a 40 nm CMOS Xilinx Virtex 6 FPGA.
While these results are a significant improvement, their information throughput, less than 3 Gbps, is still an order of magnitude less than the projected peak throughput for future 5G communications [11] - [13] . Therefore, in this paper we propose a family of architectures capable of achieving one to two orders of magnitude greater throughtput than the current state-of-the-art polar decoders.
V. Architecture, Operations and Processing Nodes Similar to the decoders focusing on throughput maximization presented in the previous section, in order to significantly increase decoding throughput, our family of architectures does not focus on logic reuse but fully unrolls and pipelines the required calculations. A fully-unrolled decoder is a decoder where each and every operation or node required in estimating a codeword is instantiated with dedicated hardware. As an example, if a decoder for a specific polar code requires two executions of an F operation with a length of 8, a fullyunrolled decoder for that code will feature two F modules with inputs of size 8 instead of reusing the same block twice.
The idea of fully unrolling a decoder has previously been applied to decoders for other families of error-correcting codes. Notably, in [21] , [22] , the authors propose a fullyunrolled deeply-pipelined decoder for an LDPC code. Polar codes are more suitable to unrolling as they do not feature a complex interleaver like LDPC codes.
In this section, we provide details on the proposed family of architectures, and describe the operations and processing nodes used.
A. Fully Unrolled (Basic Scheme)
Building upon the work done on software polar decoders [14] , [23] , we propose fully-unrolled hardware decoder architectures built for a specific polar code using a subset of the low-complexity Fast-SSC algorithm.
In the fully-unrolled architecture, all the nodes of a decoder tree exist simultaneously. Fig. 3 shows a fully-unrolled decoder for the (8, 4) polar code illustrated as a decoder tree in Fig. 2b . White blocks represent operations in the Fast-SSC algorithm and the subscripts of their labels correspond to their input length N v . Rep denotes a Repetition node, and C stands for the Combine operation. Grayed rectangles are registers. The clock and enable signals for those blocks are omitted for clarity. As it will be shown in Section V-C, even with the multi-cycle paths, the enable signals for that decoder may always remain asserted without affecting the correctness as long as the input to the decoder remains stable for 3 clock cycles. This constitutes our basic scheme. It takes advantage of the fact that registers are available right after LUTs in FPGA logic blocks, meaning that adding a register after each operation does not require any additional logic block.
The code rate and frozen bit locations both affect the structure of the decoder tree and, in turn, the number of operations performed in a Fast-SSC decoder. However, the growth in logic usage is expected to remain on the order of N log 2 N, where N is the code length. 
B. Deeply Pipelined
In a deeply-pipelined architecture, a new frame is loaded into the decoder at every clock cycle. Therefore, a new estimated codeword is output at each clock cycle as each register is active at each rising edge of the clock (no enable signal required). In that architecture, at any point in time, there are as many frames being decoded as there are pipeline stages. This leads to a very high throughput at the cost of high memory requirements. Some pipeline stage paths do not contain any processing logic, only memory. They are added to ensure that the different messages remain synchronized. These added memories yield register chains, or SRAM blocks, as will be shown in Section VI-A.
The unrolled decoder of Fig. 3 can be transformed into a deeply-pipelined decoder by adding four registers. Two registers are needed to retain the channel LLRs, denoted α c in the figure, during the 2 nd and 3 rd clock cycles. Similarly, two registers have to be added for the persistence of the harddecision vector β 1 over the 4 th and 5 th clock cycles. Making these modifications results in the fully-unrolled deeplypipelined decoder shown in Fig. 4 . Fig. 5 shows another example of a fully-unrolled deeply-pipelined decoder, but for a (16, 14) polar code featuring more operations and node types compared to Fig. 4 , where I denotes a Rate-1 node.
For this architecture, the amount of memory required is quadratic in code length and, similarly to resource usage, affected by rate and frozen bit locations. As will be shown in Sect. VI, this growth in memory usage limits the proposed deeply-pipelined architecture to codes of moderate lengths, under 4096 bits, at least for implementations using the target FPGA.
Information throughput is defined as P f R bps, where P is the width of output bus in bits, f is the execution frequency in Hz and R is the code rate. In a deeply-pipelined architecture, P is assumed to be equal to the code length N. constrained maximum width for all processing nodes, but is less than N log 2 N. In our experiments, with the operations and optimizations described below, the decoding latency never exceeded N /2 clock cycles.
C. Partially Pipelined
In a deeply-pipelined architecture, a significant amount of memory is required for data persistence. That memory quickly increases with the code length N. Instead of loading a new frame into the decoder and estimating a new codeword at every cycle, we propose a compromise where the unrolled decoder can be partially pipelined to reduce the required memory. Let I be the initiation interval, where a new estimated codeword is output every I clock cycles. The case where I = 1 translates to a deeply-pipelined architecture.
Setting I > 1 leads to a significant reduction in the memory requirements. To illustrate the possible savings, consider the longest register chain in a polar decoder used for the persistence of the channel LLRs α c . An initiation interval of I translates to an effective required register chain length of ⌈ L /I⌉ instead of I, where L is the length of the register chain. Using I = 2 leads to a ∼ 50% reduction in the amount of memory required for that section of the circuit. This reduction also applies to the other register chains present in the decoder.
The unrolled decoder of Fig. 3 can be seen as a partiallypipelined decoder with an initiation interval I = 3. A partiallypipelined decoder with I = 2 can be obtained for a (16, 14) polar code by removing the dotted registers in Fig. 5 , leading to the decoder shown in Fig. 6 .
The initiation interval I can be increased further in order to reduce the memory requirements, but only up to a certain limit (corresponding to the basic scheme). We call that limit the maximum initiation interval I max , and its value depends on the decoder tree. By definition, the longest register chain in a fully-unrolled decoder is used to preserve the channel LLRs α c . Hence, the maximum initiation interval corresponds to the number of clock cycles required for the decoder to reach the last operation in the decoder tree that requires α c , G N , the operation calculated when going down the right edge linking the root node to its right child. Once that G N operation is completed, α c is no longer needed and can be overwritten.
As an example, consider the (8, 4) polar decoder illustrated in Fig. 3 . As soon as the switch to the right-hand side of the decoder tree occurs, i.e. when G 8 is traversed, the register containing the channel LLRs α c can be updated with the LLRs for the new frame without affecting the remaining operations for the current frame. Thus the maximum initiation interval for that decoder is 3.
Like the deeply-pipelined architecture, the resulting information throughput is P f R/I bps, where I is the initiation interval. Note that this new definition can also be used for the deeply-pipelined architecture. The decoding latency remains unchanged compared to the deeply-pipelined architecture.
This architecture requires a controller not present in the deeply-pipelined architecture. That controller is a counter with maximum value of (I−1) which generates the I enable signals for the registers. An enable signal is asserted only when the counter reaches its value, otherwise it remains deasserted. Each register uses an enable signal corresponding to its location in the pipeline modulo I. As an example, let us consider the decoder of Fig. 6 , i.e. I is set to 2. In that example, two enable signals are created and a simple counter alternates between 0 and 1. The registers storing the channel LLRs α c are enabled when the counter is equal to 0 because their input resides on the even (0, 2 and 4) stages of the pipeline. On the other hand, the two registers holding the α 1 LLRs are enabled when the counter is equal to 1 because their inputs are on odd (1 and 3) stages. The other registers follow the same rule.
The required memory resources could be further reduced by performing the decoding operations in a combinational manner, i.e. by removing all the registers except the ones labeled α c and β c , as in [20] . However, the resulting reachable frequency is too low for the desired throughput level.
D. Operations and Processing Nodes
In order to keep the critical paths as short as possible, only a subset of the operations and processing nodes proposed in the original Fast-SSC algorithm are used. Furthermore, for some nodes, the maximum processing node length N v is constrained to smaller values than the ones used in [10] .
Notably, the Repetition and SPC nodes are limited to N v = 8 and 4, respectively. required in the original Fast-SSC algorithm, our architecture includes a Rate-1 processing node, implementing (4). That Rate-1 node is not constrained in length either. In order to reduce latency and resource usage, improvements were made to some operations and nodes. They are detailed below.
1) Combine_0R Operations:
A Combine_0R operation is a special case of a Combine operation (3) where the left-handside constituent code, β l , is a Rate-0 node. Thus, the operation is equivalent to copying the estimated hard values from the right-hand-side constituent code over to the left-hand side.
In other words, a Combine_0R does not require any logic, it only consists of wires. All occurrences of that operation were thus merged with the following modules, saving a clock cycle without negatively impacting the maximum clock frequency.
2) Rate-1 or Information Nodes: With a fixed-point number representation, a Rate-1 (or Information) node amounts to copying the most significant bit of the input LLRs. Similarly to the Combine_0R operation, the Information node does not require any logic and is equivalent to wires.
Contrary to the Combine_0R operation though, we do not save a clock cycle by prepending the Information node to its consumer node. Instead, the register storing LLRs at the output of its producer is removed and the Information node is appended, along with its register used to store the hard decisions. Not only is the decoding latency reduced by a clock cycle, but a register storing LLRs values is removed.
3) Repetition Nodes: The output of a Repetition node is a single bit estimate. The systematic polar decoder of [10] copies that estimated information bit N v times to form the estimated bit vector, before storing it. In our implementation, we store only one bit that is later expanded just before a consumer requires it. This reduces the width of register chains carrying bit estimates generated by Repetition nodes, thus decreasing resource usage.
VI. Implementation and Results
A. Replacing Register Chains with SRAM Blocks
As the code length N grows, long register chains start to appear in the decoder, especially with a deeply-pipelined architecture. In order to reduce the number of registers required, register chains can be converted into SRAM blocks.
Consider the register chain of length 6 used for the persistence of the channel LLRs α c in the fully-unrolled deeplypipelined (16, 14) decoder shown in top row of Fig. 5 . That register chain can be replaced by an SRAM block with a depth of 6 along with a controller to generate the appropriate read and write addresses. Similar to a circular buffer, if the addresses are generated to increase every clock cycle, the write address is set to be one position ahead of the read address.
SRAM blocks can replace register chains in a partiallypipelined architecture as well. In both architectures, the SRAM block depth has to be equal or greater than the register chain length. The same constraint applies to the width.
In scenarios where narrow SRAM blocks are not desirable, register chains can be merged to obtain a wider SRAM block even if the register chains do not have the same length. If the lengths of 2 register chains to be merged differ, the first registers in the longest chain are preserved, and only the remaining registers are merged with the other chain.
We have successfully used these techniques to convert register chains to SRAM blocks in cases where CAD tools did not support automatic conversion.
B. Methodology
In our experiments, decoders are built with sufficient memory to accommodate storing an extra frame at the input, and to preserve an estimated codeword at the output. As a result, the next frame can be loaded while a frame is being decoded. Similarly, an estimated codeword can be read while the next frame is being decoded. To facilitate comparison between the fully and partially-pipelined architectures, we define decoding latency to only include the time required for the decoder to decode a frame; loading channel LLRs and offloading estimated codewords are excluded from the calculations.
The quantization used was determined by running fixedpoint simulations with bit-true models of the decoders. A different number of bits is used to store the channel LLRs compared to that of the other LLRs used in the decoder. All LLRs share the same number of fractional bits. We denote quantization as Q i .Q c .Q f , where Q c is the total number of bits to store a channel LLR, Q i is total the number of bits used to store internal LLRs and Q f is the number of fractional bits in both. Fig. 7 shows that, for a (1024, 512) polar code, using Q i .Q c .Q f equal to 5.4.0 results in a 0.1 dB performance degradation at a bit-error rate of 10 −6 . Thus we used that quantization for the hardware results.
All decoders were implemented on an Altera Stratix IV EP4SGX530KH40C2 FPGA to facilitate comparison against existing decoders. Better results are to be expected if more recent FPGAs were to be targeted.
C. Effect of the Initiation Interval
In this section, we explore the effect of the initiation interval on the implementation of the fully-unrolled architecture on an FPGA. The decoders are built for the same (1024, 512) polar code used in [15] , although many improvements were made since the submission of that work (see Section V-D). Regardless of the initiation interval the decoders use 5.4.0 quantization and have a decoding latency of 364 clock cycles. Table I shows the results for various initiation intervals. Besides the effect on coded throughput, increasing the initiation interval causes a significant reduction in the FPGA resources required. While the throughput is approximatively cut in half, using I = 2 reduces the number of required lookup tables (LUTs), registers and RAM bits by 9%, 12% and 88%, respectively, compared to the deeply-pipelined decoder. With an information throughput over 25 Gbps, using an initiation interval as small as 4 removes the need for any SRAM blocks, while the usage of LUTs and registers decreases by 20% and 23%, respectively. Finally, if an information throughput of 750
Mbps is sufficient for the application, I = 167 will result in savings of 32%, 77% and 100% in terms of LUTs, registers and RAM bits, compared to the deeply-pipelined architecture (I = 1).
As expected, increasing the initiation interval I offers a diminishing return as it gets closer to the maximum of 167. Table I also shows that increasing I first reduces the maximum execution frequency but, eventually, it reincreases almost back to the value it had with I = 1. Inspection of the critical paths reveals that this frequency increase is a result of shorter wire delays. As the number of LUTs and registers decreases with an increasing I, at some point, it becomes easier to use resources that are close to each other.
D. Comparison with State-of-the-Art Decoders
In this section, we compare our work with that of the fastest state-of-art polar decoders of [10] , [19] , [20] . In [20] , results are provided for both ASIC and FPGA implementations. Table II shows that regardless of the implementation technology, our family of architectures can deliver an order of magnitude or two greater coded throughput. The decoding latency is similar to that of the Fast-SSC decoder of [10] , 28 to 33 times lower than that of the iterative BP decoder of [19] and 3.75 to 4.5 times greater than that of the ASIC implementation of the SC decoder of [20] . Table III compares our proposed fully-unrolled partiallypipelined architecture, with the maximum initiation interval I max = 167, against the fastest FPGA implementations of [10] , [20] . The work of [10] is marked with an asterisk (*) to indicate that the decoder has been resynthetized to only accommodate a polar code of length N = 1024 in order to obtain a fair comparison. The work of [20] is marked with a dagger ( †) as these results are for a different FPGA than the one we use. Note however that the Xilinx Virtex-6 XC6VLX550T used in [20] is also implemented in 40 nm CMOS technology and features 6-input LUTs, like the Altera Stratix IV FPGA.
It can be seen that a decoder built with one of our proposed architectures can achieve approximately 3 times the throughput of both [10] and [20] with a slightly lower latency. In terms of resources, compared to the decoder of [10] , our decoder requires almost 5 and 9 times the number of LUTs and registers, respectively. Note however that we do not require any RAM for the proposed implementation, while the other decoder uses 36 kbits. Compared to the SC decoder of [20] , our decoder requires less than half of the LUTs, but needs more than 7 times the number of registers. It should be noted that the decoder of [20] does not contain the necessary memory to load the next frame while a frame is being decoded, nor the necessary memory to offload the previously estimated codeword as decoding is taking place.
E. Effect of the Code Length and Rate
Results for other polar codes are presented in this section where we show the effect of the code length and rate on performance and resource usage. Tables IV and V show the effect of the code length on resource usage, coded throughput, and decoding latency for polar codes of short to moderate lengths. Table IV contains results for the fully-unrolled deeply-pipelined architecture (I = 1) and the code rate R is fixed to 1 /2 for all polar codes. Table V contains results for the fully-unrolled partiallypipelined architecture where the maximum initiation interval (I max ) is used and the code rate R is fixed to 5 /6.
As shown in Table IV , with a deeply-pipelined architecture, logic usage almost grows as N log 2 N, whereas memory requirements are closer to being quadratic in code length N. The decoders for the three longest codes of the table are capable of a coded throughput greater than 100 Gbps. Notably, the N = 2048 code reaches 400 Gbps. Table V shows that for a partially-pipelined decoder where the initiation interval is set to I max , it is possible to fit a code of length N = 4096 on the Stratix IV GX 530. The amount of RAM required is not illustrated in the table as none of the decoders are using any of the available RAM. Also note that no LUTs are used as memory. In other words, for pipelined decoders using I max as the initiation interval, registers are the only memory resources needed. Table V also shows that these maximum initiation intervals lead to a much more modest throughput. In the case of the (4096, 3413) polar code, we can see a major latency increase compared to the shorter codes. This latency increase can be explained by the maximum clock frequency drop which in turn can be explained by the fact that 94% of the total available logic resources in that FPGA were required to implement this decoder.
At some point, a fully-unrolled architecture is no longer advantageous over a more compact architecture like the one of [10] . With the Stratix IV GX 530 as an FPGA target, a fullyunrolled decoder for a polar code of length N = 4096 is too complex to provide good throughput and latency. Even with the maximum initiation interval, 94% of the logic resources are required for a coded throughput under 1 Gbps. By comparison, a decoder built with the architecture of [10] would result in a coded throughput in the vicinity of 1 Gbps at 110 MHz. Targeting a more recent FPGA may obviously lead to different results and conclusions. The effect of using different code rates for a polar code of length N = 1024 is shown in Table VI . Results illustrate a phenomenon that was also observed in [10] , [23] : with the Fast-SSC decoding algorithm, increasing the code rate leads to a smaller decoder tree. Hence, for a fixed code length, as code rate increases, latency decreases and fewer resources are required. Coded throughput remains approximately the same, as the critical paths are similar in length among all polar codes. Fig. 8 gives a graphical overview of the maximum resource usage requirements for a given achievable coded throughput. The fully-unrolled deeply-and partially-pipelined decoders were taken from Tables I and IV, respectively. The resynthesized polar decoder of [10] is also included for reference. The red asterisks show that with a deeply-pipelined decoder architecture (initiation interval I = 1), the coded throughput increases at a higher rate than the maximum resource usage as the code length N increases. The blue diamonds illustrate the effect of various initiation intervals for the same (1024, 512) polar code. We see that decreasing I leads to increasingly interesting implementation alternatives, as the gains in throughput are obtained at the expense of a smaller increase in the maximum resource usage. When it can be afforded and that the FPGA input data rate is sufficient, the extra 2.9% in maximum resource usage allows doubling the throughput, from I = 2 to I = 1. Coded T/P (Gbps) Max. Resource Usage (%) Table I  Table IV [10]* Fig. 8 : Overview of the maximum resource usage and coded throughput for some partially-pipelined (Table I ) and deeplypipelined (Table IV ) polar decoders. The resynthesized polar decoder of [10] is also included for reference.
F. I/O Bounded Decoding
The family of architectures that we propose requires tremendous throughput at the input of the decoder, especially with a deeply-pipelined architecture. For example, if a quantization of Q c = 4 bits is used for channel LLRs, for every estimated bit, 4 times as many bits have to be loaded into the decoder. In other words, the total data rate is 5 times that of the output. This can be a significant challenge.
If four fifth of the 48 high-speed transceivers featured on a Stratix IV GX are to be used to load the channel LLRs and the remainder to output the estimated codewords, the maximum theoretical input data rate achievable will be of 323 Gbps. On the more recent Stratix V GX, using 53 of the 66 transceivers at their peak data rate of 14.1 Gbps sums up to 747 Gbps available for input. However, the fully-unrolled deeply-pipelined (1024, 512) and (2048, 1024) polar decoders discussed above require an input data rate that is over 1 Tbps.
If only for that reason, partially-pipelined architectures are certainly more attractive, at least using current FPGA technology. Notice however that data rates in the vicinity of 1 Tbps are expected to be reachable in the incoming Xilinx UltraScale [24] and Altera Generation 10 [25] families of FPGAs.
VII. Conclusion
In this paper we presented a new family of architectures for fully-unrolled polar decoders. With an initiation interval that can be ajusted, these architectures make it possible to find a trade-off between resource usage and achievable throughput without affecting decoding latency. We showed that a fullyunrolled deeply-pipelined decoder implemented on an FPGA can achieve a throughput two orders of magnitude greater than state-of-the-art polar decoders while maintaining a good latency. With an achievable information throughput in excess of 400 Gbps, we believe that these architectures make polar codes a promising candidate for future 5G communications.
