We consider time-domain digital backpropagation with chromatic dispersion filters jointly optimized and quantized using machine-learning techniques. Compared to the baseline implementations, we show improved BER performance and >40% power dissipation reductions in 28-nm CMOS.
Introduction
Fiber nonlinearities impose a fundamental limitation on transmission performance and various nonlinear compensation schemes have been proposed. Our focus is on digital backpropagation (DBP) which emulates backward fiber propagation using digital signal processing (DSP). Different optimizations of DBP algorithms have been studied [1] [2] [3] [4] but only recently have DSP hardware implementation aspects been considered [5] [6] [7] .
A major issue with DBP based on the splitstep Fourier method (SSFM) is the large complexity caused by the fast Fourier transforms (FFTs). Time-domain DBP (TD-DBP) with finite impulse response (FIR) filters may be competitive [5] [6] [7] [8] [9] [10] [11] , assuming that the chromatic dispersion (CD) steps are sufficiently short. Design methods for the FIR filters include least squares 5, 6, 12 or wavelets 9 , but accumulating truncation errors due to repeated filter use can lead to severe performance degradations. Ideally, the coefficients of all filters in the entire DBP algorithm should be optimized jointly. It has recently been shown that this can be accomplished in an efficient way using deep learning, leading to very short CD filters per step 10, 11 .
In this paper, we study TD-DBP based on deeplearned CD filters from an ASIC implementation perspective. In particular, we evaluate the finiteresolution requirements in terms of the minimum number of quantization bits for the filter coefficients and signal. Moreover, hardware synthesis results for power dissipation and chip area in 28nm CMOS are presented and discussed.
Time-Domain Digital Backpropagation
Light propagation in an optical fiber is described by the nonlinear Schrödinger equation (NLSE). In general, the NLSE needs to be solved using nu-merical methods, where, in the context of DBP, the SSFM 1,13 is the most prominent one. The SSFM divides the transmission distance into M steps of size δ , = 1, . . . , M. The solution for step is then approximated by applying a linear filtering step with frequency response H (ω) = e j β 2 2 δ ω 2 , where β 2 is the CD coefficient, and a nonlinear phase rotation step ρ (x) = xe −jγδ |x| 2 , where γ is the Kerr parameter.
In TD-DBP, the linear step is implemented as a direct convolution of the signal with a symmetric FIR filter H ( ) (z). This can be more efficient than FFT-based filtering, where the efficiency crossover point depends on implementation details. A first-order estimate based on our hardware assumptions is between 25-30 filter taps and similar values can be found in the literature 14 . However, this does not take into account fixed-point requirements which are, to the best of our knowledge, unknown for multiple cascaded FFT/IFFTs as employed in frequencydomain split-step DBP.
Joint Filter Optimization using Deep Learning
The system setup is shown in Fig. 1 , where the four quantization blocks can be ignored for now. TD-DBP 5,6 is based on a 1-step-per-span (StPS) symmetric SSFM 13 with simplified "hardwarefriendly" nonlinear steps according to a first-order Taylor expansion ρ (x) ≈ x(1 + jγδ |x| 2 ).
We use h ( ) 
to denote the coefficients of H ( ) (z), where all M = 33 filters have T = 2K + 1 taps. The coefficients h ( ) are typically optimized separately for each step, e.g., by approximating H (ω) via least squares. A different approach is to perform a joint optimization of all coefficients θ = {h (1) , . . . , h (M ) } based on a suitable system criterion. Here, deep learn- ing via stochastic gradient descent is used, similar to 10, 11 . To that end, the system in Fig. 1 is implemented in TensorFlow and the effective SNR is used as the optimization criterion. Compared to 10, 11 , a slightly different optimization procedure is employed as follows. All filters are initialized with constrained least-squares 12 (LS-CO) coefficients, with filter length T > T chosen large enough to ensure good performance. Then, the filters are successively pruned down to their target length T by forcing the corresponding outermost taps to zero at certain iterations in the gradient-descent optimization. A typical learning curve is shown in Fig. 2 . While the instantaneous SNR loss due to pruning can be large, gradient descent quickly recovers. We found that the quality of the final filters is relatively insensitive to the pruning details (e.g., which filter tap is pruned in which iteration), as long as the pruning steps are sufficiently spread out and T is not too small.
Filter Coefficient and Signal Quantization
The filter optimization assumes floating-point coefficients whereas quantized coefficients are required for the ASIC implementation. We make use of TensorFlow's fixed-point operations which allow for a joint optimization of the quantized impulse responses. This approach results in a partial cancellation of quantization-induced frequency-response errors. All steps are jointly optimized, thus allowing for improved cancellation in comparison to pair-wise optimization 6 . In particular, to find good quantized filter coefficients, TensorFlow's "fake quantization" operations are applied to the coefficient variables which are activated after the (floating-point) optimization has converged. For fake quantizations, the gradient computation and parameter updates are still performed in floating point, allowing us to continue the training for a few more iterations (500-1000) with a very small learning rate. We found that this approach results in close-to floating-point performance at very low coefficient word lengths. The locations where signal quantization occurs in our hardware model is indicated by the quantization blocks in Fig. 1 . Note that, since the resulting output word length of each multiplication is the sum of the lengths of the operands, output rounding is required. DBP is sensitive to bias in rounding due to the many cascaded steps, and truncation thus imparts a penalty. A better performing, low-complexity option is to add 0.5 unit of least precision (in regards to the target word length) before truncation. This gives a close-to unbiased rounding, with 0.5 always rounding up. We remark that signal quantization is not implemented in TensorFlow for the filter optimization.
ASIC Implementation
We implemented a 96-parallel TD-DBP, operating at 416.7 MHz. Each step consists of a reconfigurable parallel FIR filter, exploiting tap symmetry to reduce the number of multiplications. Due to the Gaussian-like statistics of the signal, clipping is performed, if necessary, to effectively use the dynamic range. Numerical accuracy was verified with respect to the reference MATLAB step.
The implemented TD-DBP step was synthesized using Cadence Genus and a low-power 28nm CMOS library, characterized at the slow process corner at 125 • C, and a supply voltage of 0.6 V. In order to generate accurate internal circuit switching statistics, the implemented netlist was then simulated using input data generated in the system model. Internal node switching activity was saved, back-annotated to the netlist, and power was estimated using Cadence Genus at the typical process corner, at 25 • C, averaging over four different impulse responses. 
Results and Discussion
As a baseline, LS-CO filters are used in all steps 5, 6 , in which case 25 taps are required for good performance. Deep learning and pruning reduce the filter length from 25 to 15, while improving floating-point performance, as shown in Fig. 3 . For the fixed-point implementation, 8-and 9-bit signals are considered. Co-optimized quantization 6 is used for the baseline filters, resulting in good performance for 8-and 9-bit coefficients. The learned filters have comparable signal resolution requirements (i.e., 8-9 bits), but much lower coefficient-resolution requirements. In particular, joint optimization gives negligible penalty using 6bit taps compared to floating-point and acceptable performance is achievable using 5-bit taps.
TD-DBP with, on average, 4 bits per tap has been shown 7 . These results rely on the specific FIR filter shape caused by a direct truncation of the inverse CD frequency response. Unfortunately, the resulting filter is very long, requiring 301 symmetric taps in the considered system 7 .
The hardware synthesis results are shown in Tab. 1, with averaging over four different impulse responses for the learned coefficients. The maximum deviation from the average was found to be 2%. Fewer taps and lower coefficient resolution for the deep-learned filters translate into sizable reductions in terms of power dissipation and chip area. We also compare to the (few) implementation results for linear CD compensation available in the literature. Pillai et al. 15 estimated the power dissipation of CD compensation for 2400-km propagation at 94 pJ/bit in 28-nm CMOS. An actual 40-nm ASIC receiver implementation showed 221 pJ/bit for CD compensation of 3500-km fiber 16 , which translates to roughly 150 pJ/bit in a 28-nm process technology. In our case, the 6-bit learned coefficients with a 9-bit signal resolution result in 33 steps × 0.2 W/80 Gb/s = 83 pJ/bit for 3200-km transmission. While such comparisons are not perfectly fair, they show that TD-DBP and deep learning offer a viable route to implementation of nonlinearity compensation based on split-step methods.
Conclusion
We have studied TD-DBP based on deep-learned CD filters from an ASIC perspective. It was shown that the obtained filters have similar signal resolution requirements compared to our previous work 5, 6 , and significantly reduced coefficient resolution requirements. Moreover, reduced filter lengths directly translate into lower power dissipation and chip area. Compared to LS-CO-based TD-DBP, a power-dissipation reduction of >40% for all considered configurations is shown.
