Abstract-Implementing forward error correction (FEC) for modern long-haul fiber-optic communication systems is a challenge, since these high-throughput systems require FEC circuits that can combine high coding gains and energy-efficient operation. We present very large scale integration (VLSI) decoder architectures for product-like codes for systems with strict throughput and power dissipation requirements. To reduce energy dissipation, our architectures are designed to minimize data transfers in and out of memory blocks, and to use parallel noniterative component decoders. Using a mature 28-nm VLSI process technology node, we showcase different product and staircase decoder implementations that have the capacity to exceed 1-Tb/s information throughputs with energy efficiencies of around 2 pJ/b.
I. INTRODUCTION

F
ORWARD error correction (FEC) is indispensable to meet the increasing capacity demands in fiber-optic communication systems [1] . With throughput demands approaching 1 Tb/s, power and energy dissipation aspects are becoming critical to FEC implementations, especially when error-correction performance in terms of coding gain needs to be high. In a recent roadmap [2] , the strict power constraints imposed by future systems for 400 G and beyond are expected to be a major challenge in the implementation of FEC circuitry. It is well known that the highest coding gain is obtained using soft-decision (SD) decoding in which all channel information is harnessed. SD decoding, however, does not as naturally lend itself to high-throughput very-large-scale integration (VLSI) implementations as harddecision (HD) decoding does [3] . In this paper, we address the challenge to implement VLSI systems for very-high-throughput decoders and, as a consequence, we choose to focus on HD decoding.
For fundamental reasons HD decoding is somewhat limited in terms of coding gain in comparison to SD decoding, but there are HD schemes that can deliver relatively high coding gains [4] . Consider, e.g., product codes [5] for which two component codes are combined into one longer block-length code, which has higher error-correcting capability and which is decoded iteratively: In each iteration, decoding is successively done for each of the two shorter component codes, and as we increase the number of iterations we improve coding gain. If the coding gain offered by product codes is not sufficient, we can instead consider staircase codes [6] , which are a spatially-coupled generalization of product codes. By virtue of the iterative windowed decoding of the connected component codes, the coding gain can be substantially increased over product codes, at a cost in VLSI implementation complexity.
Since throughput aspects are central in this work, it is worthwhile to note that the structure of product and staircase codes lends itself to block-parallel high-throughput decoding. Because of this property, staircase codes have recently received significant attention within the fiber-optic communication research community as a promising alternative for power-constrained applications: Prior art includes staircase-code optimization, both as stand-alone codes [7] , [8] and in concatenated schemes [9] , [10] , but to the best of our knowledge, no VLSI implementations have been published in the open literature, save for our own previous work [11] .
At the core of our product and staircase decoders we find noniterative component decoders, which realize bounded-distance decoding of shortened binary Bose-Chaudhuri-Hocquenghem (BCH) codes [12] with error-correction capabilities in the range of 2-4. These component decoders are fully parallel and strictly feed-forward, which means that internal state registers can be avoided. As will be shown, this is key to high throughput and low latency. We will, e.g., showcase VLSI implementations of a number of staircase codes that are relevant for fiber-optic communication systems [7] . The implementations are capable of achieving in excess of 1-Tb/s information throughput, which is significantly higher than those of currently published state-ofthe-art FEC implementations [13] - [17] . While high throughput requirements typically make power dissipation a serious design concern, this is not the case for our decoders, which dissipate only 1.3-2.4 W (or around 2 pJ/bit), depending on configuration and assumed input bit-error rate.
II. BCH, PRODUCT AND STAIRCASE CODES
Before we describe the VLSI architectures and implementations of product and staircase decoders, we will briefly introduce 0733-8724 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information. concepts pertaining to the FEC codes used: As component code, we use BCH(n, k, t), where n is the block length, k is the number of useful information bits, and t is the number of bit errors that the code can correct. Given that GF (2 m ) is the Galois field in which computations are performed, the BCH code parameters are related as n = 2 m − 1 and n − k = mt. In addition to the definitions above, the code rate, which is the proportion of a block that contains useful information, is defined as R = k n , while the code overhead is defined as OH =
BCH codes can be shortened to allow for more flexibility in terms of code rate, i.e., this is a tradeoff between coding gain and information throughout. Shortening means a number of information-bit positions are fixed to zero and never transmitted, with the result that the code overhead is increased from the initial non-shortened BCH code. Hereon, we denote the block length and the number of information bits in the shortened codes as n s = n − s and k s = k − s, respectively, where s is the number of removed bits.
In the general case, the product code is constructed out of BCH(n s1 , k s1 , t 1 ) and BCH(n s2 , k s2 , t 2 ). The code rate of the product code is defined as R =
, and the resulting minimum distance is t 1 · t 2 . In this paper, each product code uses two shortened BCH codes that are identical. We consider decoder implementations with OH = 20-25%, which is achieved by varying the shortening, s. Table I shows the component-code parameters used for our product decoders. The 33%-OH product codes are based on shortened 255-bit BCH codes, whereas the 25% and 20% codes are based on shortened 511-bit BCH codes. Note that the t = 2 code, due to its excessively high error floor (see Section VI-A), is only considered as reference for the 33% case.
For the staircase decoders, we use BCH(324,297,3) and BCH(432,396,4) component codes. In contrast to the less complex product decoders, we will not vary the code overhead in our explorations of staircase decoders because of excessive design and simulation run times. With each block in the staircase representing an n/2-by-n/2 matrix, where n is the block length of the component code, we can define the staircase code rate as R = 1 − k n/2 [6] . Here, the shortened BCH codes presented above give an overall staircase code rate R = 0.83, which corresponds to an overhead of 20%.
III. COMPONENT DECODERS
Since the component decoders are central to efficient VLSI implementation of product-like codes, we will first introduce the BCH component decoders, with emphasis on the key-equation solver. Variants of these decoders have been used in our previous work: We described 1-and 2-error correcting decoders in [18] , while more advanced 3-and 4-error correcting implementations were used (but not described) in [11] .
A typical BCH decoder employs syndrome calculation, the Berlekamp-Massey (BM) algorithm (to find the error-location polynomial), and Chien search (to find the errors). Different optimizations of the BM algorithm, such as the simplified inversefree BM (SiBM) algorithm [19] , can improve the implementation, however, a fundamental problem is that conventional BM implementations are iterative and require at least t clock cycles to complete their operation.
Aiming for high throughput and low latency, we have developed fully parallel BCH component decoders based on the direct-solution Peterson algorithm [20] . Fig. 1 shows a productdecoder architecture in which the BCH component decoderscomprising SYN, KES (see Section III-A) and CHIEN unitsare integrated. (The memory block can just as well belong to a staircase decoder.) The component decoders are non-iterative and strictly feed-forward, thus, simplifying state-machine design and allowing for synchronous clock gating of a component decoder's pipeline. The decoders are implemented using bitparallel polynomial-base GF (2 m ) multipliers [21] . Error-free component data are detected after the first stage in the decoder pipeline, i.e., the syndrome calculation stage. To save power, we use several techniques: If all syndromes are zero, the pipeline is gated sequentially and a flag is set to allow for memory-block gating. Each column and row in the product/staircase code uses separate syndrome calculation units to reduce logic signal switching (see Section IV), while the KES and Chien search units are shared between a row/column pair.
The component codes are shortened to yield the desired overall code overhead and since all the removed bits are fixed to zero, the corresponding hardware in the fully parallel syndrome calculation and Chien search can be removed. The number of found errors after the Chien search is compared-in the COMP unit-with the degree of the error-locator polynomial, and the result is discarded if not equal. Component-code miscorrections may induce errors in the non-existing part of the code (which was removed during shortening); this comparison, thus, reduces the probability of miscorrections, improving overall bit-error rate (BER) performance.
A. Key-Equation Solver (KES)
The key-equation solver (KES) unit calculates the errorlocator polynomial from the syndromes obtained in the SYN unit. The error-locator polynomial can be expressed as
where Λ n is the error-locator polynomial coefficients calculated in the KES unit. For the case of t = 2, the polynomial is given as [18] Λ(x) = (S 3 + S
where S i are the syndromes. For the cases of t = 3, 4, we modify the equations given in [22] to remove inversions [23] . The resulting polynomial coefficients for t = 3 are
If all the resulting coefficients are 0, the equation is incorrect, and at most one error has occurred. In this case, the equation is replaced with
For t = 4, the resulting polynomial coefficients are
In this case, if the coefficients are all 0, two possibilities need to be addressed: If S 1 = 0, then no errors have occurred. If not, the equation is replaced with Eq. (2).
IV. DECODER ARCHITECTURE OVERVIEW
After initially moving the received bits into the memory block closest to the channel, product and staircase codes are decoded using an iterative scheme. For product decoders, an iteration consists of two phases: With reference to Fig. 1 , first all rows are decoded and errors are corrected, after which all columns are decoded and errors are corrected. This procedure is straightforwardly repeated for a given number of iterations.
In staircase decoding, the iteration scheme is more complex in that each component code covers two spatially coupled blocks, as shown in the simplified decoder configuration in Fig. 2 which operates on a window of 3 blocks. Additional component decoders can be chained over the decoder window, as shown for the more useful 5-block decoder configuration in Fig. 3 . Similar to the product decoding above, the staircase algorithm [6] entails decoding of rows followed by columns: After channel data have been moved into memory block 1, parallel decoding of all rows and columns of that block is followed by decoding of all rows and column in the next memory block. Successively all memory blocks get decoded and once this is completed, a new decoding phase commences in memory block 1. This process is repeated for a given number of iterations and once all iterations are completed, the data of memory block 1 are moved to memory block 2 and new channel data are moved into memory block 1, thus, shifting the decoding window. We now perform syndrome recomputation in order to avoid separate storage and multiplexers to move syndromes with the data block. (Overall, syndrome computation contributes to at most 10% of total power dissipation in all considered configurations.)
Since there are no inter-column dependencies nor inter-row dependencies during each half iteration, our staircase decoder architecture first iterates over all rows in the entire window, then over all columns in the window. This reduces iteration time and increases throughput compared to decoders that operate on memory blocks sequentially in the decoder window. No significant difference in error-correction performance of staircase codes was found when comparing two MATLAB reference implementations.
As switching power dissipation is proportional to the signal switching, we use replicated syndrome computation units for all decoders. By assigning one syndrome computation unit to rows and another unit to columns, fewer signals switch, since at most t bits are flipped in each row/column per iteration-the majority of the XOR-gates in the syndrome computation are thus kept static-reducing switching power dissipation. Consider the case of correcting one bit: One flipped bit causes at most log 2 (n) toggles in each of the XOR-trees for syndrome calculation. The KES and Chien search (the decoder back-end) units are shared between rows and columns.
Row and column syndromes indicate presence of believed errors in corresponding rows and columns. In the product decoder, the codeword is believed to be correct once all syndromes are zero. If this is the case, the memory block is clock gated to save power. The state-based clock gating is somewhat more complex in the staircase memory array: Beside gating each memory block, if no errors are found within that block during component-code decoding, the whole staircase array is gated during component-code decoding and only clocked on writeback or window shifting.
Since each full row or column is read out and decoded fully block parallel in these decoder architectures, we obtain decoders with very high throughput and low processing latency. Another advantage of using component decoders without any feedback loops is that the control state-machine implementation is straightforward.
The throughput (given a fixed clock rate) of a fully parallel decoder scales quadratically with component-code length, while the area scales slightly faster due to the arithmetic units. A fully parallel design is thus primarily suitable for shorter blocklength, higher-OH codes. Although we only consider fully parallel implementations in this work and, thus, focus on higher overhead codes, it should be noted that the strictly feed-forward decoder pipeline structure is beneficial for decoder sharing between component codes, which is a must for lower overhead codes. Since each step is non-iterative, no pipeline stalling is required and, thus, subsequent decoder back-end sharing between L rows (or columns) adds L − 1 cycles per half iteration. However, while the number of component decoder back-ends required is significantly reduced, a more complicated control state machine and multiple-cycle write-back to the decoder memory is required, which would increase power dissipation.
V. EVALUATION METHODOLOGY
The decoders were implemented using a hardware description language (VHDL) and a 28-nm fully depleted silicon-oninsulator (FD-SOI) low-leakage cell library, at the slow process corner, a supply voltage of 0.9 V and an operating temperature of 125
• C. Cadence Genus [24] was used to synthesize the VHDL implementations using physical wire models. The target clock rate varies somewhat with decoder configuration, from 500 MHz to 600 MHz (see Section VI). The synthesized gate netlists of the decoders were logically verified using simulation in Cadence Incisive with a VHDL testbench generating encoded data transmitted over a binary-symmetric channel.
Post-FEC BER was estimated using the implemented VHDL decoders, simulated using a binary-symmetric channel implemented in VHDL in order to estimate error-correction performance. The obtained post-FEC BER was extrapolated down to 10 −15 using berfit in MATLAB. We want to stress that the obtained net coding gain (NCG) should be seen as estimations, since excessive logic simulation run times limit the accuracy of the obtained low-BER extrapolations, especially for complex staircase decoders. However, the resulting estimations are consistent with earlier published results [7] , [8] , considering algorithmic differences.
A. Decoder Power Dissipation
Energy in VLSI circuits based on CMOS technology is expended when logic signals are switching state. In FEC decoders, power dissipation will therefore be a function of the probability of an error being corrected by each component decoder, giving rise to a significant dependency on input (pre-FEC) BER. In addition, for staircase decoders, component decoder power dissipation will depend on the spatial location in the decoding window. This dependence on temporal and spatial aspects of logic signal switching makes algorithm complexity a poor metric for power dissipation in VLSI circuits and, thus, rigorous netlist-based simulations are necessary.
In our evaluation, we used delay-annotated netlists which, e.g., enable us to capture power-dissipating signal glitches that are caused by delay imbalances. The netlists were simulated using a VHDL testbench that generates uniformly-distributed encoded data and can emulate varying pre-FEC BERs, ensuring realistic input-signal statistics. The resulting internal switching activity was backannotated to the netlist and power dissipation was estimated using Cadence Genus, at the typical process corner at a temperature of 25
• C and using physical wire models. Since the topology of pre-FEC and post-FEC buffers would be dependent on the overall DSP architecture, our decoders do not have any separate I/O buffers but they use block-parallel inputs and a separate output-block register. Depending on architectural requirements, additional multiplexing/demultiplexing buffers that will increase power dissipation, may be required.
By virtue of the non-iterative component decoders, the proposed architectures become intrinsically very fast. This, in turn, enables us to use low-leakage cell libraries for which static power dissipation is next to negligible. For example, leakage power makes up for less than 1.2% of the overall power dissipation in a 7-block staircase decoder, which uses t = 4 component decoders and performs 6 iterations at a pre-FEC BER of 10 −2 . With this in mind, we can focus our implementation and analysis on switching power dissipation,
where f is the clock rate, C α is the switched capacitance, and V DD is the supply voltage. During normal operation of a general FEC circuit, the actual correction of an error is, with respect to the processing of an information block, relatively rarely performed and this has repercussions on P sw . In the front-end region of a decoder circuit, FR, the signal switching activity α tends to be high as blocks of erroneous data are moved inside the VLSI circuit and decoding commences. This makes the switched capacitance C α = i∈FR C i α i high for the circuit nodes of FR. However, the back-end of the decoder is only active if errors are found and need to be corrected and switching activity is, thus, lower. Similarly, for iterative decoders, switching activities are higher for the first iterations than for later iterations, because errors are being successively corrected.
Considering that switching activities impact total power dissipation, stopping the clock-referred to as clock gating-for back-end circuit regions can be a very useful technique. This has an effect both on the clock, which is otherwise toggling in every cycle with α = 1, and the logic signals, and is effective for designing energy-efficient FEC circuits. Clock power was estimated using the clock-tree power estimation feature in Cadence Genus as well as using Synopsys PrimeTime [25] .
VI. RESULTS
Although the decoder memory architecture is somewhat different for our product and staircase decoders, they employ component decoders in the same manner. We will therefore first investigate design and implementation aspects of product decoders, with a focus on variations brought about by choice of component codes. Then we will extend the scope of our exploration to staircase decoders and design aspects of such windowed decoders.
A. Product Decoder Results
We will first present an analysis geared towards VLSI implementation aspects, such as clock gating, for 33%-OH product decoders using different component decoders (t = 2-4). Then we will extend the analysis to the system level and consider three levels of code overhead (20%, 25%, and 33%) for the configurations with the highest coding gains (t = 3, 4). Fig. 4 shows the estimated BER performance, extrapolated from VHDL simulation of 33%-OH product decoder implementations. The error floors of the codes are estimated as in [26] . While the t = 2 code performs relatively well in the waterfall region, it suffers from a high error floor.
1) Varying Error-Correction Capability (t):
Table II summarizes the implementation data for the 33%-OH product decoders. The NCG parameter is defined as the improvement in signal-to-noise ratio (SNR) over an uncoded transmission, to achieve a threshold which is equal to a post-FEC BER of 10 −15 . For t = 3, 4, the extrapolated BER is used, whereas in the case of t = 2, the estimated error floor in Fig. 4 is used to determine the 10 −15 threshold. Since the decoders are fully block parallel, all implementations have the same blockdecoding latency.
Since the input (pre-FEC) BER impacts the switching events of the decoder (as discussed in Section V-A), power was estimated for two different cases in Table II : Either the product decoders operate at the estimated 10 −15 post-FEC threshold or with a 1-dB margin to this threshold. The power dissipation is clearly affected by the input BER, especially in the case of the high-NCG t = 3, 4 implementations. Fig. 5 presents a power dissipation breakdown for three types of clock gating (see Section V-A). All decoders employ Basic gating for which non-active registers are clock gated based on the current product-decoder state. We also consider the case of synchronously-gated component decoders (Comp. code gating) (see Section III) as well as Full gating where, in addition to the component-decoder gating, the memory block ( Fig. 1) is gated if all syndromes are zero. As shown in Fig. 5 , both memory-block gating and synchronous decoder-pipeline gating are effective in reducing power dissipation. Employing both gives a significant reduction in overall power dissipation and since the different gating schemes entail similar circuit overheads, we will consequently from now on use Full gating for all implementations.
2) Varying Code Overhead: Fig. 6 and 7 show the power dissipation and energy dissipation per information bit as a function of input BER, for the considered 20%-, 25%-, and 33%-OH implementations. The product decoders based on t = 3 achieve sub-pJ/bit operation when the decoder input BER is below the decoder 10 −15 post-FEC threshold. The implementations using t = 4 provide higher NCG, at a slight increase in energy dissipation. All product decoders are capable of sub-pJ/bit operation when operating at a 0.15-dB margin to the 10 −15 threshold. Comparing the power dissipation for the 20%-33% decoders ( Fig. 6(a)-6(c) ) to the energy per bit (Fig. 7(a)-7(c) ), it is clear that while power dissipation decreases with the overhead of the codes, the decrease in energy per bit is not as significant; this is because the information throughput reduces with an increasing code overhead.
It is also interesting to note that compared to the t = 3 implementations, the power dissipation of the t = 4 implementations is a more sensitive function of the input BER. This is due to the higher energy dissipation per correction of the more complex t = 4 component decoders. Table III shows the implementation data for the considered t = 3, 4 codes. The t = 4 20%-and 25%-OH implementations together with the t = 3 20%-OH implementation all easily attain a throughput in excess of 1 Tb/s. In addition, all product-decoder implementations-even the ones with lower code overheadhave an NCG of >10 dB.
B. Staircase Decoder Results
Since we will present data for different staircase decoder configurations, we will from now on use a "shorthand" notation to define how many blocks (bl) there are in the array/window, what error-correcting capability (t) the component codes have, and how many iterations (it) are used. For example, a decoder employing t = 4 component codes, performing 6 iterations in a 7-block window will be referred to as a 7-bl t = 4 6-it decoder.
All implementations have an overall code overhead of 20% and use a clock rate of 550 MHz.
1) Power Dissipation Distribution: Fig. 8(a) and Fig. 8 threshold). It is clear that the power dissipated and the VLSI circuit area used by the staircase decoder sub-units are not very correlated. A large proportion of the power dissipation can be attributed to the decoder memory array and the clock tree. Estimations indicate that, e.g., in a 5-bl t = 3 5-it decoder, approximately 70% of all power dissipation in memory elements is caused by clocking. The syndrome computation units contribute only to a small part of overall power dissipation; syndrome recomputation with block shifting is, thus, not of any concern for overall power efficiency. The clock tree is estimated to contribute a significant amount of overall power dissipation as it is constantly toggling and, thus, causes large switching power dissipation. The early clock-tree power estimation in Cadence Genus does not provide any area estimate of the clock tree. However, clock buffers in the tree represent less than 1% of the total number of cells even in the design with the largest fraction of clock-power dissipation (which is the 7-bl t = 3 configuration), and thus the area contribution of the clock tree can be assumed to be insignificant.
2) Spatial Coupling Effects: In contrast to the product decoders, the error rate of a processed block in a staircase decoder is not only a function of the input BER and the number of previously performed iterations, but also of its current spatial placement in the staircase window. Being hardware units located far from the channel input, the KES and Chien search units of the component decoders constitute the staircase decoder back-end. Showing the back-end power dissipation for all component decoders, sorted after spatial placement in a 7-bl t = 4 4-it decoder, Fig. 9 illustrates clearly that power dissipation is a sensitive function of where a component decoder is placed and when it is used. The essence of this figure is that the power dissipation is highly dependent on the number of errors occurring in a memory array block: The major part of the power is dissipated in the component decoders that operate on the blocks that are located closest to the channel input. In addition, there is a strong dependency on input BER.
The probability of decoding an error decreases quickly as data blocks are moved through the decoding window. Accordingly, the power dissipation of the component decoders located deeper in the window decreases. It should be noted that the probability of an error propagating all the way through to the final memory array block is vanishingly small. Thus, sharing of component decoder back-ends deeper in the window may significantly reduce the overall area. This design option, however, is not further explored in this paper.
3) Energy Efficiency and Power Dissipation:
Since the component decoder's switching power depends strongly on input BER, the total decoder power dissipation is also heavily dependent on input BER. Fig. 10 shows the power and energy dissipation of a 7-block staircase decoder as a function of input BER. Here, the average power dissipation decreases as the number of iterations increases, since each iteration reduces the number of errors in the blocks; errors which would otherwise activate the component decoders.
Energy and power are terms that sometimes are used interchangeably, but the information in Fig. 10 is a good illustration on how energy efficiency and average power dissipation trends differ. As the information throughput decreases with an increasing number of iterations, the energy per information bit increases, since the loss of throughput dominates over the slightly decreasing power dissipation. Fig. 10 also shows how power dissipation increases rapidly if the input BER is too high to maintain the decoding wavefront within the staircase window. When the wave-front is lost, the entire decoder memory array is filled with data containing errors, causing a rapid increase in switching activity as Table IV and Table V present the evaluation results for the 5-block and 7-block staircase decoder, respectively. An information throughput in excess of 1 Tb/s is achieved by both 5-and 7-block decoders when performing three iterations, with an estimated NCG of 10.4 and 10.5 dB, respectively. The blockdecoding latency ranges from 181.8-483.6 ns. Considering that the refractive index of silica is roughly 1.46, the longest latency, thus, corresponds to adding 210 m of fiber to the link. The tables also show that the two design dimensions to consider regarding the iterative decoding process-the number of in-place iterations and the number of blocks in the window which are iterated over-give vastly different results in terms of implementation. Since the clock rate is kept constant, increasing the number of blocks in the memory array (and thus the length of the decoded window) increases the NCG without reducing the throughput. The area usage, power dissipation, and the latency are however increased, since the number of memory elements, as well as the in-memory switching activity due to block movements, increase. Increasing the number of performed iterations increases NCG, at the expense of reduced throughput, increased latency and energy per bit, while area remains constant.
The t = 3 staircase decoders achieve a higher energy efficiency than the t = 4 decoders, however, at the expense of a lower NCG. The t = 4 decoders offer, in our opinion, an interesting trade-off in terms of energy dissipation and NCG (assuming that the VLSI area budget is generous), especially considering the throughput.
VII. DISCUSSION
The implemented hard-decision product and staircase decoders can provide very high information throughput at low energy dissipation and are, thus, suitable for future energyconstrained high-throughput systems. A key enabler is the feedforward component decoders that allow for high throughput, while iteratively operating on a largely static decoder memory. This allows us to avoid the switching activity caused by the data movement in an iteration-unrolled architecture, where the data are constantly moved during processing.
For the configurations and clock rates considered in this work, the staircase decoders can achieve up to >1 Tb/s information throughput, while the product decoders achieve up to >2 Tb/s. While we do not explicitly consider supply voltage scaling in this work, it should be noted that switching power has a quadratic dependency on supply voltage (see Eq. 6) and can be significantly lowered if the supply voltage is reduced. However, this would be at the expense of reducing the maximum clock rate of the circuit and, thus, the throughput.
The decoders are estimated to achieve a coding gain of 10.0-10.6 dB, depending on configuration. Focusing on the estimated NCG of the 20%-OH implementations, the presented staircase decoders are estimated to perform as well as the best performing HD-decoded codes listed in [1] , while our product decoders tie for second best. Note again that our NCG estimations should be seen as approximate since low-BER statistics are limited due to long VHDL simulation run times.
There are very few high-throughput decoders published in the open literature. Compared to a recently published hard-decision product decoder [14] , our product decoders achieve more than an order of magnitude higher throughput and much lower latency, at the expense of larger area. In comparison to a recently published high-throughput LDPC decoder [15] (soft decision, 18% OH), our 20%-OH product decoder can achieve more than three times the throughput at comparable areas (assuming a 70% area utilization of library cells), at a higher coding gain (E b /N 0 = 4.6 dB at a post-FEC BER of 10 −7 for our 20%-OH product decoder, compared to 4.95 dB [15] ).
The spatial coupling of staircase codes enhances the coding gain compared to product codes, but this comes at a cost of an increase in area, latency, and power dissipation. Nonetheless, our staircase decoders dissipate as little as 1-3 pJ/bit, which is making them highly energy efficient.
VIII. CONCLUSION
We have implemented energy-efficient high-throughput VLSI decoders for product and staircase codes, suitable for future 400 G+ power-constrained fiber-optic communication systems. The decoders have been implemented and evaluated using synthesized gate netlists in a 28-nm VLSI process technology, allowing us to consider aspects of energy efficiency and related tradeoffs. Our decoders achieve more than 10-dB net coding gain and can reach more than 1-and 2-Tb/s information throughput, for staircase and product decoders, respectively. The staircase decoders have a block-decoding latency of <483.6 ns, which corresponds to adding 210 m of fiber to the link, while the product decoder latencies are <64 ns. Effective use of clock gating to inhibit signals from switching is shown to significantly reduce energy dissipation of iterative decoders, both in memory blocks and in component decoders. All considered product and staircase decoders are estimated to dissipate less than 2.4 W, demonstrating the viability of high-throughput hard-decision product and staircase decoders.
