Abstract-Digital back propagation (DBP) is often proposed and implemented offline for the mitigation of nonlinear impairments in long-haul fiber communications. However, complexity in terms of chip area and power consumption in a realistic application-specific integrated circuit implementation is yet to be determined. Here, we implement time-domain DBP (TD-DBP) in a 28-nm fully-depleted silicon-on-insulator process technology, considering digital implementation aspects such as limited-resolution arithmetic and finite-length filters. We choose as example a coherent optical transmission system, viz. a single-channel, single-polarization, and 20-GBd 16-QAM system, for which the DBP is known to perform well. For the considered system, we find that the TD-DBP can enable a reach increase from 3400 to 5300 km, at a power dissipation of <20 W (or, conversely, an energy dissipation of <230 pJ/bit), at a pre-FEC BER of 10 −2 .
I. INTRODUCTION
T HIS letter presents our efforts to increase reach of longhaul fiber-optic links by mitigating the nonlinear distortion caused by the fiber Kerr nonlinearity. Digital signal processing (DSP) for nonlinearity mitigation is recognized as a computationally overwhelming problem, and consequently recent work approaches this problem at a complexity-based level [1] - [4] , showing significant reductions in algorithmic complexity. While perturbation-based nonlinearity compensation has been demonstrated in real time [5] , papers that address how to implement nonlinearity mitigation algorithms in realtime DSP circuits are largely missing.
Digital back propagation (DBP) has been proposed as an approach to fiber nonlinearity mitigation. However, a major obstacle in real-time DBP implementation is the repeated use of fast Fourier transforms, which not only leads to very complex circuits but also introduces rounding errors in the limited-resolution arithmetic required in an applicationspecific integrated circuit (ASIC) implementation. Assuming a split-step algorithm [6] for DBP, the impulse-response lengths of the dispersive steps are small and, thus, a time-domain implementation of these step can be competitive when considering limited-resolution implementations [7] . We previously investigated how to design a DBP algorithm for real-time DSP, showing that the Time-Domain DBP (TD-DBP) algorithm [8] is suitable for limited-resolution ASIC arithmetic. We later presented a method for co-designing the quantized timedomain dispersive steps [9] , which further reduces requirements on ASIC arithmetic resources. While our previous work on TD-DBP was focusing on limited-resolution aspects, this letter focuses on aspects of ASIC implementation of TD-DBP. We will present a circuit implementation together with power dissipation and chip area results, demonstrating that TD-DBP enables nonlinearity compensation with a limited power dissipation overhead when compared to linear, chromatic-dispersion compensation.
II. THE TD-DBP ALGORITHM
The nonlinear Schrödinger equation [10] describes light propagation in an optical fiber
No analytic solutions exist in the general case, so numerical approximation is necessary. Simulations in the nonlinear regime generally use the split-step Fourier method [6] : under the assumption that the fiber is split into short propagation steps-steps which are cascaded to solve for the entire fiber link length-it is possible to make the approximation that the dispersive (D) and nonlinear (N ) steps are independent. Then, DBP uses the split-step Fourier method to estimate the transmitted signal by simulating backward propagation of a received signal, through a fiber with inverted parameters [10] . However, since split-step DBP uses many short steps, timedomain techniques will be competitive since they, in contrast to frequency-domain techniques, avoid the overhead for repeated transformations between time and frequency domains. The TD-DBP algorithm uses finite impulse-response (FIR) filters to implement per-step dispersion compensation. In discrete time, a single TD-DBP step can be formulated as follows:
where h C DC (z) is the impulse response of a discrete-time filter capable of compensating for low accumulated chromatic dispersion (CD) corresponding to the step length in DBP. 
III. FIXED-POINT IMPLEMENTATION OF TD-DBP
Prior to hardware implementation, we establish the limitedresolution (fixed-point) requirements on TD-DBP.
A. System Context
We use the same simulation setup as in our previous work [8] , [9] , i.e., single-channel, single-polarization, 16-QAM transmission. Our setup ( Here, we use a one-step-per-span (1-StPS) TD-DBP algorithm. EachD uses the least-squares constrained-optimization (LS-CO) filter [11] to compensate for accumulated dispersion corresponding to the step length, with in-band response optimized with respect to the pulse-shaped spectrum.
B. Filter Coefficient Selection
As described above, TD-DBP uses a short FIR filter to compensate for the dispersive behavior for each step of the optical fiber. Ideal compensation would result in perfect reversal of fiber dispersion, but could be approached only at the cost of large filter lengths and high-accuracy filter coefficients, each of which increase chip area and power dissipation.
We have found a filter length of 25 taps to give sufficient performance for our 1-StPS system with floating-point coefficient values. However, quantization to fixed-point coefficients introduces impulse-response and therefore frequency-response deviations from the floating-point case. The cascade structure of the TD-DBP algorithm causes the deviations to accumulate, such that, e.g., a 0.1-dB passband peak results in a 1-dB peak after ten spans. We use the overall filter gain as a design parameter to give us an extra degree of freedom when choosing the quantized tap values [9] , and select the fixed-point impulse response that best approximates the perfect reversal. For a further improvement, we alternate two different fixed-point filters in consecutive steps, selecting the best pair [9] of responses from all combinations of two filters of a given coefficient resolution. These well-chosen quantized filter versions help us achieve near-floating-point performance with short coefficient word lengths. BER as function of number of spans for signal resolutions of 8 or 9 bits and 7-9 bit coefficients. For reference, we include 1) floatingpoint implementation of TD-DBP and 2) floating-point linear CD compensation (CDC). Each case makes use of its optimal launch power.
C. Signal Resolution and Rounding
Due to the large number of cascaded steps in a DBP implementation, choice of signal resolution and type of rounding becomes very important. While truncation carries no cost in terms of hardware, it rounds in the direction towards negative infinity (e.g., 0.8 is rounded to 0, while -0.8 is rounded to -1). This behavior imparts a bias on the signal, which causes the DBP algorithm to break. A well-performing (but still low-complexity) rounding method is to add 0.5 unit of least precision to the result before truncation. This choice results in rounding to the closest representable number, with the rare ties broken in the direction towards positive infinity.
In the fixed-point implementations in this work, a first-order Taylor expansion is used for implementing theN complex exponential. Launch power was swept in 1-dB increments at 32 spans of propagation, and the optimal launch power was found to be approximately 0 dBm when employing the 1-StPS TD-DBP algorithm, -1 dBm for TD-DBP with firstorder Taylor expansion, and -4 dBm when only floating-point CD compensation (CDC) is performed. The placement of the nonlinear operator within the span was also investigated, and optimal placement was found to be at 63 % of the span for the Taylor-expanded nonlinear operator (Fig. 1) , while it was at 66 % of the span for the full exponential.
Based on their performance as compared to CDC only, we chose internal signal resolutions of 8 and 9 bits, along with best-pair optimized coefficient sets of 7-9 bits. Fig. 2 shows bit-error rate (BER) as function of number of spans for the fixed-point cases considered for implementation, at the respective optimal launch power and nonlinear operator placement. Since longer word lengths increase area and power dissipation, as short word lengths as possible are thus desirable. 
IV. IMPLEMENTATION AND EVALUATION METHODOLOGY
TheD block in Fig. 1 is implemented using parallel-input parallel-output FIR filters, with coefficient symmetry exploited to reduce complexity. Fig. 3 shows a block diagram of the TD-DBP step implementation, and the placement of pipelining registers. The first-order Taylor expansion is used for thê N complex exponential since it has a low implementation complexity. Compared to the full exponential, this hardwarefriendly solution gives a limited performance degradation as shown as FP, Taylor in Fig. 2 . The calculation of instantaneous power is fairly insensitive to rounding noise and rounding is therefore performed before squaring of the magnitude, in order to reduce the word length of the multipliers. The resulting output is multiplied with zγ (see Eq. 2), and the first-order Taylor estimation of the exponential function is performed.
The consequence of the strict DSP throughput requirement that results from the 20-GBd transmission target (Sec. III-A) is that the circuit implementation needs to use many parallel DSP lanes in order to operate with a reasonable clock rate. At the specified 2 samples-per-symbol, a 64-parallel operation yields a clock-rate requirement of 625 MHz which is reasonable for the ASIC process technology used here.
We also consider 96-parallel implementations in which the clock rate can be reduced to 416.7 MHz. The switching power dissipation of digital circuits can be described as P sw = f C α V DD 2 , where f is the clock rate, V DD is the supply voltage, and C α is the switched capacitance. This equation shows that increasing parallelism can be used to reduce power dissipation: An increasing parallelism makes the circuit throughput higher (for a certain f ). This throughput slack is then eliminated by reducing f via a reduced V DD . While an increasing parallelism leads to a linearly increasing C α , P sw depends linearly on f and quadratically on V DD . The efficacy of this tradeoff depends on context; as will be demonstrated below, here it is beneficial.
The 64-and 96-parallel TD-DBP steps are implemented using a hardware description language (VHDL), for signal resolutions of 8-9 bits and coefficient word lengths of 7-9 bits. The designs are synthesized using Cadence Genus and a 28-nm low-power fully-depleted silicon-oninsulator (FD-SOI) cell library, which was characterized at a V DD of 0.8 V (64-parallel) or 0.6 V (96-parallel), 125 • C, and worst-case transistor delays.
The synthesized TD-DBP step configurations are simulated with input data from the fiber-optic system simulation setup (Sec. III-A). From these simulations, the per-node circuit switching activity, which is necessary for calculation of an aggregated C α , is extracted and back-annotated to the circuit netlist. The ensuing power estimation is performed at 0.8 V (64-parallel) or 0.6 V (96-parallel), 25 • C, and typical-case transistor delays, again using Cadence Genus. Tables I and II show the estimated power dissipation and area for one 100-km span for the 64-and 96-parallel implementations, respectively. Increasing parallelism and decreasing supply voltage give significant power reductions, at the expense of increased chip area. While there is a relatively minor power increase when increasing coefficient resolution from 7 to 8 bits, there is a rather large cost for 9 bits. This is likely caused by an increasing delay of critical signals, which require additional logic circuits to meet timing constraints.
V. RESULTS AND DISCUSSION
To the best of our knowledge, no estimations of power dissipation for DBP have been published. Fig. 4 shows power dissipation as a function of reach at a BER of 10 −2 and 10 −3 for the considered TD-DBP configurations. While previous work on linear CD compensation (CDC) may not be representative of current systems in terms of power efficiency, they give power dissipation numbers for static equalization that help us establish the feasibility of DBP. Thus, Fig. 4 also shows CDC power dissipation, both an estimation [12] and an actual implementation [13] , scaled to the same information throughput. We have also included an estimation of a 25% reduction of V DD for CDC, under the assumption that clock rate can remain the same. The best-performing system from Fig. 4 , in terms of reach, gives a power dissipation of 19.6 W for 5 300-km propagation, resulting in a reach increase of 1 900 km compared to floating-point CDC (see Fig. 2 ).
CDC for 2 400-km propagation has been estimated to 94 pJ/bit (or 113 pJ/information bit at a code overhead of 20 %) in 28-nm CMOS [12] which is to be compared to 92 pJ/bit for TD-DBP. A 40-nm ASIC receiver implementation demonstrated 221 pJ/bit for CDC of 3 500 km of fiber [13] ; assuming scaling according to [14] , this would translate to around 147 pJ/bit in a 28-nm process technology. In contrast, compensation for 3 500 km using TD-DBP would expend 134 pJ/bit. While these comparisons are rather crude, and are not intended to provide a one-on-one complexity comparison, they nevertheless show that TD-DBP offers a practical route to implementation of digital back propagation, with power dissipation feasible for ASIC implementation. More importantly, TD-DBP makes reach distances attainable that would be impossible with CDC only, regardless of signal and coefficient resolution or power dissipation (see Fig. 2 ). Fig. 5 shows BER, power dissipation, and energy per bit, as a function of number of 100-km spans for a 96-parallel implementation at 0.6 V. Although TD-DBP is a complex algorithm, these numbers indicate that DSP implementation of nonlinearity mitigation is indeed feasible in a mainstream 28-nm process technology.
The impulse-response length of CDC filters scales quadratically with symbol rate. Sweeping the filter length with parameters as in our 20-GBd case, and not allowing in-band error power and peak out-of-band gain to increase, results in 53, 89, and 135 taps, for 30, 40, and 50 GBd, respectively, thus confirming an approximatively quadratic scaling with symbol rate. Since filtering dominates power dissipation, energy per bit will increase similarly. On the other hand, ASIC process technology has been steadily improving: Without taking into account logic optimizations made possible by faster process technologies, scaling trends of late [14] suggest that energy dissipation can be reduced by approximately 75% by scaling from 28 to 7 nm, thus, enabling > 40 GBd operation.
The nonlinearN block in Fig. 1 draws less than 10 % of total power in all considered configurations. Assuming the Manakov model, the only extra operation required for two polarizations is a single adder in each channel of the DBP algorithm [1] ; implementing dual-polarization compensation will thus approximately double power dissipation and area.
VI. CONCLUSION
We have shown that TD-DBP is feasible to implement in a mainstream 28-nm ASIC process technology. Considering the case of 5 100 km at a BER of 10 −2 , with single-channel, singlepolarization propagation, we estimate a power dissipation of 16.2 W (Fig. 5) , which corresponds to an energy dissipation of 203 pJ/bit.
