I. INTRODUCTION
A DVANCES in physics-the invention of the laser, lowloss optical fiber, and the optical amplifier-have driven the exponential growth in worldwide data communications. However, as these technologies mature, system designers have increasingly focused on techniques from communication theory, including forward error correction, to simultaneously increase transmission capacity and decrease transmission costs.
One of the first proposals for FEC in an optical system appeared in [1] , which demonstrated a shortened (224, 216) Hamming code implementation at 565 Mbit/s. Since then, ITU-T Recommendations G.975 and G.975.1 have standardized more powerful codes for optical transport networks (OTNs). More recently, low-density parity-check (LDPC) codes [2] , [3] -which provide the potential for capacityapproaching performance-have been investigated, as aptly summarized in [4] , [5] . While implementations exists at 10 Gb/s (for 10GBase-T ethernet networks), the blocklengths of such implementations (∼ 500-2000) are too short to provide performance close to capacity; the (2048, 1723) RS-LDPC code is approximately 3 dB from the Shannon Limit at 10 −15 [6] , see also [7] . Another significant roadblock is that fiber-optic communication systems are typically required to provide bit-error-rates below 10 −15 . It is well-known that capacity-approaching LDPC codes exhibit error floors [8] , and to achieve the targeted error rate would likely require concatenation with an outer code (e.g., as in [9] ). In this work, we focus on product-like codes (by product-like codes, we mean any generalized LDPC code with algebraic component codes), since they possess properties that make them particularly suited to providing error-correction in fiber-optic communication systems. In particular, for 100 Gb/s implementations, we argue that syndrome-based decoding of product-like codes is significantly more efficient than message-passing decoding of LDPC codes.
This paper presents a new class of high-rate binary errorcorrecting codes-staircase codes-whose construction combines ideas from convolutional and block coding. Indeed, staircase codes can be interpreted as having a 'continuous' productlike construction. In the context of wireless communications, related code constructions include braided block codes [10] , braided convolutional codes [11] , diamond codes [12] and cross parity check convolutional codes [13] , each of which is related to the recurrent codes of Wyner-Ash [14] . However, these proposals considered soft decoding of the component codes, which is unsuitable for high-speed fiber-optic communications. Herein, we describe a syndrome-based decoder for staircase codes, that provides excellent performance with an efficient decoder implementation.
In Section II, we review the specifications and performance of FEC codes defined in ITU-T Recommendations G.975 and G.975.1. In Section III, we describe the syndrome-based decoder for product-like codes, and argue that it results in a decoder data-flow that is more than two orders of magnitude smaller than the message-passing decoder of an LDPC code. Staircase codes are presented in Section IV, and a G.709-compatible staircase code is proposed. In Section V, we present an analytical method for determining the error floor of iteratively decoded staircase codes, and show that the proposed staircase code has an error floor at 4.0 × 10 −21 . Finally, in Section VI, we present FPGA-based simulation results, illustrating that the proposed code provides a 9.41 dB NCG at an output error rate of 10 −15 , an improvement of 0.42 dB relative to the best code from the ITU-T G.975.1 recommendation, and only 0.56 dB from the Shannon Limit.
II. EXISTING PROPOSALS

A. ITU-T Recommendation G.975
The first error-correction code standardized for optical communications was the (255, 239) Reed-Solomon code, with symbols in F 2 8 , capable of correcting up to 8 symbol errors in any codeword. For an output-error-rate of 10 −15 , the NCG of the RS code is 6.2 dB, which is 3.77 dB from capacity.
In order to provide improved burst-error-correction, 16 codewords are block-interleaved, providing correction for 0000-0000/00$00.00 c 2011 IEEE bursts of as many as 1024 transmitted bits. A framing row consists of 16 · 255 · 8 bits, 30592 of which are information bits, and the remaining 2048 bits of which are parity. The resulting framing structure-a frame consists of four rowsis standardized in ITU-T recommendation G.709, and remains the required framing structure for OTNs; as a direct result, the coding rate of any candidate code must be R = 239/255.
B. ITU-T Recommendation G.975.1
As per-channel data rates increased to 10 Gb/s, and the capabilities of high-speed electronics improved, the (255, 239) RS code was replaced with stronger error-correcting codes. In ITU-T recommendation G.975.1, several 'next-generation' coding schemes were proposed; among the many proposals, the common mechanism for increased coding gain was the use of concatenated coding schemes with iterative hard-decision decoding. We now describe four of the best proposals, which will motivate our approach in Section IV.
In Appendix I.3 of G.975.1, a serially concatenated coding scheme is described, with outer (3860, 3824) binary BCH code and inner (2040, 1930) binary BCH code, which are obtained by shortening their respective mother codes. First, 30592 = 8 · 3824 information bits are divided into 8 units, each of which is encoded by the outer code; we will refer to the resulting unit of 30880 bits as a 'block'. Prior to encoding by the inner code, the contents of consecutive blocks are interleaved (in a 'continuous' fashion, similar to convolutional interleavers [15] ). Specifically, each inner codeword in a given block involves 'information' bits from each of the eight preceding 'outer' blocks. Note that the interleaving step increases the effective block-length of the overall code, but it necessitates a sliding-window style decoding algorithm, due to the continuous nature of the interleaver. Furthermore, unlike a product code, the parity bits of the inner code are protected by a single component codeword, which reduces their level of protection. For an output-error-rate of 10 −15 , the NCG of the I.3 code is 8.99 dB, which is 0.98 dB from capacity.
In Appendix I.4 of G.975.1, a serially concatenated scheme with (shortened versions of) an outer (1023, 1007) RS code and (shortened versions of) an inner (2047, 1952) binary BCH code is proposed. After encoding 122368 bits with the outer code, the coded bits are block interleaved and encoded by the inner BCH code, resulting in a block length of 130560 bits, i.e., exactly one G.709 frame. As in the previous case, the parity bits of the inner code are singly-protected. For an output-error-rate of 10 −15 , the NCG of the I.4 code is 8.67 dB, which is 1.3 dB from capacity.
In Appendix I.5 of G.975.1, a serially concatenated scheme with an outer (1901, 1855) RS code and an inner (512, 502)× (510, 500) extended-Hamming product code is described. Iterative decoding is applied to the inner product code, after which the outer code is decoded; the purpose of the outer code is to eliminate the error floor of the inner code, since the inner code has small stall patterns (see Section V). For an output-error-rate of 10 −15 , the NCG of the I.5 code is 8.5 dB, which is 1.47 dB from capacity.
Finally, in Appendix I.9 of G.975.1, a product-like code with (1020, 988) doubly-extended binary BCH component codes is proposed. The overall code is described in terms of a 512 × 1020 matrix of bits, in which the bits along both the rows of the matrix as well as a particular choice of 'diagonals' must form valid codewords in the component code. Since the diagonals are chosen to include 2 bits in every row, any diagonal codeword has two bits in common with any row codeword; in contrast, for a product code, any row and column have exactly one bit in common. Note that the I.9 construction achieves a product-like construction (their choice of diagonals ensures that each bit is protected by two component codewords) with essentially half the overall block length of the related product code (even so, the I.9 code has the longest block length among all G.975.1 proposals). However, the choice of diagonals decreases the size of the smallest stall patterns, introducing an error floor above 10 −14 . For an outputerror-rate of 2 · 10 −14 , the NCG of the I.9 code is 8.67 dB, which is 1.3 dB from capacity.
III. LDPC VS. PRODUCT CODES
In this section, we present a high-level view of iterative decoders for LDPC and product codes. Due to the differences in their implementations, a precise comparison of their implementation complexities is difficult. Nevertheless, since the communication complexity of message-passing is a significant challenge in LDPC decoder design, we consider the decoder data-flow, i.e., the rate of routing/storing messages, as a surrogate for the implementation complexity.
A. Decoder-Data-flow Comparison
We consider a system that transmits information at D bits/s, using a binary error-correcting code of rate R-for which hard decisions at D/R bits/s are input to the decoder-and a decoder that operates at a clock frequency f c Hz.
1) LDPC Code:
We consider an LDPC decoder that implements sum-product decoding (or some quantized approximation) with a parallel-flooding schedule. We assume q-bit messages internal to the decoder, an average variable node degree d av , and N decoder iterations; typically, q is 4 or 5 bits, d av ≈ 3, and N ∼ 15 − 25. Initially, hard-decisions are input to the decoder at a rate of D/R bits/s and stored in flip-flop registers. At each iteration, variable nodes compute and broadcast q-bit messages over every edge, and similarly for the check nodes, i.e., 2qd av bits are broadcast per iteration per variable node. Since bits arrive from the channel at D/R bits/s, the corresponding internal data-flow per iteration is then D2qd av /R, and the total data-flow, including initial loading of 1-bit channel messages, is
For N = 20, q = 4, d av = 3, F LDPC ≈ 480D/R, which corresponds to a data-flow of more than 48 Tb/s for 100 Gb/s systems. 2) Product Code: When the component codes of a product code can be efficiently decoded via syndromes (e.g., BCH codes), there exists an especially efficient decoder for the product code. Briefly, by operating exclusively in the 'syndrome domain'-which compresses the received signal-and passing only ≤ t messages per (component) decoding (for t-errorcorrecting component codes), the implementation complexity of decoding is significantly reduced.
The following is a step-by-step description of the decoding algorithm:
1) From the received data, compute and store the syndrome for each row and column codeword. Store a copy of the received data in memory R. 2) Decode those non-zero syndromes corresponding to row codewords 1 . In the event of a successful decoding, set the syndrome to zero, flip the corresponding t or fewer positions in memory R, and update the t or fewer affected column syndromes by a masking operation. 3) Repeat Step 2, reversing the roles of rows and columns. 4) If any syndromes are non-zero, and fewer than the maximum number of iterations have been performed, go to Step 2. Otherwise, output the contents of memory R.
We quantify the complexity of decoding a product code by its decoder data-flow. At first glance, it may seem that this approach ignores the complexity of decoding the (component) t-error-correcting BCH codewords. However, for relatively small t, the decoding of a component codeword can be efficiently decomposed into a series of look-up table operations, for which the data-flow interpretation is well-justified. In this section, we will ignore the data-flow contribution of the BCH decoding algorithm, but we return to this point in the Appendix, where it is shown that the corresponding data-flow is negligible.
We assume that rows are encoded by a t 1 -error-correcting (n 1 , k 1 = n 1 − r 1 ) BCH code, and the columns are encoded by a t 2 -error-correcting (n 2 , k 2 = n 2 − r 2 ) BCH code, for an overall rate R = R 1 R 2 . We assume each row/column codeword is decoded (on average, over the course of decoding the overall product code) v times, where typically v ranges from 3 to 4.
The hard-decisions from the channel-at D/R bits/s-are written to a data RAM, in addition to being processed by a syndrome computation/storage device. Contrary to the LDPC decoder data-flow, the clock frequency f c plays a central role, namely in the data-flow of the initial syndrome calculation. Referring to Fig. 2 , and assuming that the bits in a product code are transmitted row-by-row, the input bus-width (i.e., the number of input bits per decoder clock cycle) is D/(Rf c ) bits. Now, assuming these bits correspond to a single row of the product code, each non-zero bit corresponds to some r 1 -bit mask (i.e., the corresponding column of the paritycheck matrix of the row code), the modulo-2 sum of these is performed by a masking tree, and the r 1 -bit output is masked with the current contents of the corresponding (syndrome) flipflop register. That is, each clock cycle causes a r 1 -bit mask to be added to the contents of the corresponding row in the syndrome bank. Of course, each received bit also impacts a distinct column syndrome, however, the same r 2 -bit mask is applied (when the corresponding received bit is non-zero) to each of the involved column syndromes; the corresponding data-flow is then r 2 bits per clock cycle.
Once the syndromes are computed from the received data, iterative decoding commences. To perform a row decoding, an r 1 -bit syndrome is read from the syndrome bank. Since there are n 2 row codewords, and each row is decoded on average v times, the corresponding data-flow from the syndrome bank to the row decoder is r 1 n 2 vD/(Rn 1 n 2 ) = r 1 vD/(Rn 1 ) bits/s. For each row decoding, at most t 1 positions are corrected, each of which is specified by ⌈log 2 n 1 ⌉ + ⌈log 2 n 2 ⌉ bits. Therefore, the data-flow from the row decoder to the data RAM is
bits/s. Furthermore, for each corrected bit, a r 2 -bit mask must be applied to the corresponding column syndrome, which yields a data-flow from the row decoder to the syndrome bank of t 1 n 2 r 2 vD/(Rn 1 n 2 ) = t 1 r 2 vD/(Rn 1 ) bits/s. A similar analysis can be applied to column decodings. In total, the decoder data-flow is
In this work, we will focus on codes for which n 1 = n 2 ≈ 1000, r 1 = r 2 = 32, t 1 = t 2 = 3, and the decoder is assumed to operate at f c ≈ 400 MHz. For v = 4, we then have a dataflow of approximately 293 Gb/s. Note that this is more than two orders of magnitude smaller than the corresponding dataflow for LDPC decoding. Intuitively, the advantage arises from two facts. First, when R 1 > 1/2 and R 2 > 1/2, syndromes Fig. 3 . Data-flow in a product-code decoder provide a compressed representation of the received signal. Second, the algebraic component codes admit an economical message-passing scheme, in the sense that message updates are only required for the small fraction of bits that are corrected by a particular (component code) decoding.
IV. STAIRCASE CODES
The staircase code construction combines ideas from recursive convolutional coding and block coding. Staircase codes are completely characterized by the relationship between successive matrices of symbols. Specifically, consider the (infinite) sequence B 0 , B 1 , B 2 , . . . of m-by-m matrices B i , i ∈ Z + . Herein, we restrict our attention to B i with elements in F 2 , but an analogous construction applies in the non-binary case.
Block B 0 is initialized to a reference state known to the encoder-decoder pair, e.g., block B 0 could be initialized to the all-zeros state, i.e., an m-by-m array of zero symbols. Furthermore, we select a conventional FEC code (e.g., Hamming, BCH, Reed-Solomon, etc.) in systematic form to serve as the component code; this code, which we henceforth refer to as C, is selected to have block length 2m symbols, r of which are parity symbols.
Encoding proceeds recursively on the B i . For each i, m(m − r) information symbols (from the streaming source) are arranged into the m−r leftmost columns of B i ; we denote this sub-matrix by B i,L . Then, the entries of the rightmost r columns (this sub-matrix is denoted by B i,R ) are specified as follows:
1) Form the m×(2m−r) matrix, A = B 2) The entries of B i,R are then computed such that each of the rows of the matrix
That is, the elements in the jth row of B i,R are exactly the r parity symbols that result from encoding the 2m − r 'information' symbols in the jth row of A. Generally, the relationship between successive blocks in a staircase code satisfies the following relation: for any i ≥ 1, each of the rows of the matrix B suggests their connection to product codes. However, staircase codes are naturally unterminated (i.e., their block length is indeterminate), and thus admit a range of decoding strategies with varying latencies. Most importantly, we will see that they outperform product codes. The rate of a staircase code is
since encoding produces r parity symbols for each set of m − r 'new' information symbols. However, note that the related product code has rate
2 , which is greater than the rate of the staircase code. However, for sufficiently high rates, the difference is small, and staircase codes outperform product codes of the same rate.
From the context of transmitter latency-which includes encoding latency and frame-mapping latency-staircase codes have the advantage (relative to product codes) that the effective rate (i.e., the ratio of 'new' information symbols, m− r, to the total number of 'new' symbols, m) of a component codeword is exactly the rate of the overall code. Therefore, the encoder produces parity at a 'regular' rate, which enables the design of a frame-mapper that minimizes the transmitter latency.
We note that staircase codes can be interpreted as generalized LDPC codes with a systematic encoder and an indeterminate block-length, which admits decoding algorithms with a range of latencies.
Using arguments analogous to those used for product codes, a t-error-correcting component code C with minimum distance d min has a Hamming distance between any two staircase codewords that is at least d 
A. Decoding Algorithm
Staircase codes are naturally unterminated (i.e., their block length is indeterminate), and thus admit a range of decoding strategies with varying latencies. That is, decoding can be accomplished in a sliding-window fashion, in which the decoder operates on the received bits corresponding to L consecutively received blocks B i , B i+1 , . . . , B i+L−1 . For a fixed i, the decoder iteratively decodes as follows: First, those component codewords that 'terminate' in block B i+L−1 (i.e., whose parity bits are in B i+L−1 ) are decoded; since every symbol is involved in two component codewords, the corresponding syndrome updates are performed, as in Section III-A2. Next, those codewords that terminate in block B i+L−2 are decoded. This process continues until those codewords that terminate in block B i are decoded. Now, since decoding those codewords terminating in some block B j affects those codewords that terminate in block B j+1 , it is beneficial to return to B i+L−1 and to repeat the process. This iterative process continues until some maximum number of iterations is performed, at which time the decoder outputs its estimate for the contents of B i , accepts in a new block B i+L , and the entire process repeats (i.e., the decoding window slides one block to the 'right').
B. Multi-edge-type Interpretation
Staircase codes have a simple graphical representation, which provides a multi-edge-type [3] interpretation of their construction. The term 'multi-edge-type' was originally applied to describe a refined class of irregular LDPC codes, in which variable nodes (and check nodes) are classified by their degrees with respect to a set of edge types. Intuitively, the introduction of multiple edge types allows degree-one variable nodes, punctured variable nodes, and other beneficial features that are not admitted by the conventional irregular ensemble. In turn, better performance for finite blocklengths and fixed decoding complexities is possible.
In Fig. 5 , we present the factor graph representation of a decoder that operates on a window of L = 4 blocks; the graph for general L follows in an obvious way. Dotted variable nodes indicate symbols whose value was decoded in the previous stage of decoding. The key observation is that when these symbols are correctly decoded-which is essentially always the case, since the output BER is required to be less than 10 −15 -the component codewords in which they are involved are effectively shortened by m symbols. Therefore, the most reliable messages are passed over those edges connecting variable nodes to the shortened (component) codewords, as indicated in Fig. 5 . On the other hand, the rightmost collection of variable nodes are (with respect to the current decoding window) only involved in a single component codeword, and thus the edges to which they are connected carry the least reliable messages. Due to the nature of iterative decoding, the intermediate edges carry messages whose reliability lies between these two extremes.
C. A G.709-compatible Staircase Code
The ITU-T Recommendation G.709 defines the framing structure and error-correcting coding rate for OTNs. For our purposes, it suffices to know that an optical frame consists of 130560 bits, 122368 of which are information bits, and the remaining 8192 are parity bits, which corresponds to error-correcting codes of rate R = 239/255. Since (510 − 32)/510 = 239/255, we will consider a component code with m = 510 and r = 32. Specifically, the binary (n = 1023, k = 993, t = 3) BCH code with generator polynomial (x 10 + x 3 + 1)(x 10 + x 3 + x 2 + x + 1)(x 10 + x 8 + x 3 + x 2 + 1) is adapted to provide an additional 2-bit error-detecting mechanism, resulting in the generator polynomial 2 g(x) = (x 10 + x 3 + 1)(
In order to provide a simple mapping to the G.709 frame, we first note that 2 · 130560 = 510 · 512. This leads us to define a slight generalization of staircase codes, in which the blocks B i consist of 512 rows of 510 bits. The encoding rule is modified as follows: 1) Form the 512 × (512 + 510) matrix, A = B T i−1 B i,L , whereB 2) The entries of B i,R are then computed such that each of the rows of the matrix
That is, the elements in the jth row of B i,R are exactly the 32 parity symbols that result from encoding the 990 'information' symbols in the jth row of A. Here, C is the code obtained by shortening the code generated by g(x) by one bit, since our overall codeword length is 510 + 512 = 1022.
V. ERROR FLOOR ANALYSIS
For iteratively decoded codes, an error floor (in the output bit-error-rate) can often be attributed to error patterns that 'confuse' the decoder, even though such error patterns could easily be corrected by a maximum-likelihood decoder. In the context of LDPC codes, these error patterns are often referred to as trapping sets [8] . In the case of product-like codes with an iterative hard-decision decoding algorithm, we will refer to them as stall patterns, due to the fact that the decoder gets locked in a state in which no updates are performed, i.e., the decoder stalls, as in Fig. 6 .
Definition 1: A stall pattern is a set s of codeword positions, for which every row and column involving positions in s has at least t + 1 positions in s. We note that this definition includes stall patterns that are correctable, since an incorrect decoding may fortuitously cause one or more bits in s to be corrected, which could then lead to all bits in s eventually being corrected. In this section, we obtain an estimate for the error floor by overbounding the probabilities of these events, and pessimistically assuming that every stall pattern is uncorrectable (i.e., if any stall pattern appears during the course of decoding, it will appear in the final output). The methods presented for the error floor analysis apply to a general staircase code, but for simplicity of the presentation, we will focus on a staircase code with m = 510 and doubly-extended triple-error-correcting component codes.
A. A Union Bound Technique
Due to the streaming nature of staircase codes, it is necessary to account for stall patterns that span (possibly multiple) consecutive blocks. In order to determine the bit-error-rate due to stall patterns, we consider a fixed block B i , and the set of stall patterns that include positions in B i . Specifically, we 'assign' to B i those stall patterns that include symbols in B i (and possibly additional positions in B i+1 ) but no symbols in B i−1 . Let S i represent the set of stall patterns assigned to B i . By the union bound, we then have
Therefore, bounding the error floor amounts to enumerating the set S i , and evaluating the probabilities of its elements being in error.
B. Bounding the Contribution Due to Minimal Stalls
Definition 2: A minimal stall pattern has the property that there are only t + 1 rows with positions in s, and only t + 1 columns with positions in s.
The minimal stall patterns of a staircase code can be counted in a straightforward manner; the multiplicity of minimal stall patterns that are assigned to B i is
and we refer to the set of minimal stall patterns by S min . The probability that the positions in some minimal stall pattern s are received in error is p 16 . Next, we consider the case in which not all positions in some minimal stall pattern s are received in error, but that due to incorrect decoding(s), all positions in s are-at some point during decoding-simultaneously in error. For some fixed s and l, 1 ≤ l ≤ 16, there are 16 l ways in which 16 − l positions in s can be received in error. For the moment, let's assume that erroneous bit flips occur independently with some probability ζ, and that ζ does not depend on l. Then we can overbound the probability that a particular minimal stall s occurs by
In order to provide evidence in favor of these assumptions, Table I presents empirical estimates, for l = 0, l = 1 and l = 2, of the probability that a minimal stall pattern s occurs during iterative decoding, given that 16 − l positions in s are (intentionally) received in error. Note that even if a minimal stall is received, there exists a non-zero probability that it will be corrected as a result of erroneous decodings; we will ignore this effect in our estimation, i.e., we make the worstcase assumption that any minimal stall persists. Furthermore, from the results for l = 1 and l = 2, it appears that our stated assumptions regarding ζ hold true, and ζ ≈ 5.8 × 10 −4 . For l > 2, we did not have access to sufficient computational resources for estimating the corresponding probabilities. Nevertheless, based on the evidence presented in Table I , the error floor contribution due to minimal stall patterns is estimated as
where ζ = 5.8 × 10 −4 when p = 4.8 × 10 −3 . 
C. Bounding the Contribution Due to Non-minimal Stalls
We now wish to account for the error floor contribution of non-minimal stalls, e.g., the stall pattern illustrated in Fig. 7 . In the general case, a stall pattern s includes codeword positions in K rows and L columns, K ≥ 4, L ≥ 4; we refer to these as
where the lower bound follows from the fact that every row and column (in the stall) includes at least 4 positions. Note that there are
ways to select the involved rows and columns. For a fixed (K, L) = (4, 4) and a fixed choice of rows and columns, we now proceed to overbound the contributions of candidate stall patterns. Without loss of generality, we assume that K ≥ L, and note that there are L 4 K ways of choosing l = 4K elements (in the L · K 'grid' induced by the choice of rows and columns) such that each column includes exactly four elements, and that every stall pattern 'contains' at least one of these. Now, since a stall pattern includes l elements, Fig. 7 . A non-minimal stall pattern for a staircase code with a triple-error correcting component code.
the number of stall patterns with l elements is overbounded as
For a general (K, L) = (4, 4), it follows that the number of stall patterns with l elements,
Finally, over the choice of the K rows and L columns, there are
For a fixed K and L, the contribution to the error floor can be estimated as
and in Table II , we provide values for various K and L, when ζ = 5.8 × 10 −4 and p = 4.8 × 10 −3 . Note that the dominant contribution to the error floor is due to minimal stall patterns (i.e., K = L = 4), and that the overall estimate for the error floor of the code is 3.8 × 10 −21 . Finally, we note that by a similar (but more cumbersome) analysis, the error floor of the G.709-compliant staircase code is estimated to occur at 4.0 × 10 −21 .
VI. SIMULATION RESULTS
In Fig. 8 , simulation results-generated in hardware on an FPGA implementation-are provided for the G.709-compatible staircase code, for L = 7. We also present the bit-error-rate curves for the G.975 RS code, as well as the G.975.1 codes described in Section II. For an output error rate 10 −15 , the staircase code provides approximately 9.41 dB net coding gain, which is within 0.56 dB of the Shannon limit, and an improvement of 0.42 dB relative to the best G.975.1 code.
VII. CONCLUSIONS
We proposed staircase codes, a class of product-like FEC codes that provide reliable communication for streaming sources. Their construction admits low-latency encoding and variable-latency decoding, and a decoding algorithm with an efficient hardware implementation. For R = 239/255, a G.709-compatible staircase code was presented, and performance within 0.56 dB of the Shannon Limit at 10 −15 was provided via an FPGA-based simulation.
APPENDIX
This section briefly describes known techniques for efficiently decoding triple-error-correcting binary BCH codes, and discusses the data-flow associated with a lookup-table-based decoder architecture.
For a syndrome S = (S 1 , S 3 , S 5 ), S i ∈ F 2 m , we first compute D 3 = S It remains to determine the roots ofσ(x). For v = 1, it is trivial to determine the error location. For v = 2 or v = 3, lookup-based methods for solving the corresponding quadratic and cubic equations are described in [17] , [18] . In the remainder of this section, we briefly describe these methods, and discuss their data-flow.
For a quadratic equation f X (x) = x 2 + ax + b with a = 0, substitute x = ay to obtain f Y (y) = a 2 (y 2 + y + b/a 2 ).
If f Y (r) = 0 then f X (ar) = 0. Thus the problem of finding roots of f X (x) reduces to the problem of finding roots of the suppressed quadratic f Y (y), which can be solved by lookup using a table with 2 m entries, each of which is a pair of elements in F 2 m . Therefore, when v = 2, decoding requires 2m bits to be read from a lookup-table memory.
Similarly, for a cubic equation f X (x) = x 3 + ax 2 + bx + c, substitute x = y + a to obtain f Y (y) = y 3 + (a 2 + b)y + ab + c.
Note that yf Y (y) is a linearized polynomial with respect to F 2 and hence the set of zeros of yf Y (y) is a vector space over F 2 . In particular, the roots of yf Y (y), if distinct, are of the form {0, r 1 , r 2 , r 1 + r 2 }. Thus only r 1 and r 2 need to be stored in the lookup table.
Two cases arise, depending on the value of a 2 +b = D 5 /D 3 . If D 5 = 0, so that a 2 +b = 0, then f Y (y) = y 3 +ab+c, and the roots can be found by finding the cube roots of ab + c = D 3 , which requires lookup using a table with 2 m entries, each of which is a pair of elements in The roots of the suppressed cubic f Z (z) can be found by lookup using a table with 2 m entries, each of which is a pair of elements in F 2 m . Therefore, in either case, decoding requires 2m bits to be read from a lookup-table memory.
Finally, for n = n 1 = n 2 , the data-flow contribution of the lookup-table-based decoding architecture is 4mvD nR . For n = 1000, m = 10, v = 4, R = 239/255 and D = 100 Gb/s, the corresponding data-flow is 17.1 Gb/s, which is small relative to the data-flow that arises due to those effects considered in Section III-A2.
