Abstract-We propose a novel and efficient multiplierless finiteimpulse response (FIR)-based filter architecture for chromatic dispersion equalization (CDE) in coherent optical communication systems. After quantizing the FIR coefficients, we take advantage of the high multiplicity of their real and imaginary parts, employing the distributive property of multiplication over addition to sharply reduce the number of multiplication operations, obtaining the
near-optimum post-detection equalization of linear propagation impairments [1] , [2] , including chromatic dispersion (CD) [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] , has become possible provided that the received signal is sampled and processed with sufficiently high temporal resolution [2] . Several CD equalization (CDE) algorithms in time domain (TD) [3] [4] [5] [6] [7] [8] and frequency domain (FD) [9] [10] [11] have been demonstrated and are now being commercially deployed in 100G transceivers. However, the computational effort required by CDE still remains a limiting aspect for compact transceiver manufacturing, due to its high power dissipation and required chip area [14] , [15] . In addition, the complexity associated with multi-step CDE in backpropatation-based nonlinear compensation algorithms is still preventing its real-time implementation [16] , [17] . Therefore, reducing the complexity associated to CDE is of critical importance to relax the hardware requirements and increase the energy efficiency in coherent transceivers. Besides, a CDE algorithm possessing a very low latency is of high relevance, since this is one of the requirements of data center communication [18] and 5G networks [19] .
Given the linear time-invariant (LTI) characteristic of CD, its compensation can be performed by a fractionally-spaced finite-impulse response (FIR) filter [3] , whose tap coefficients can be determined a priori from the amount of accumulated CD, through an inverse Fourier transform of the FD transfer function [3] or applying a closed-form TD analytical formulation [4] .
Alternatively, infinite-impulse response (IIR) filtering has also been proposed and demonstrated, with the main advantage of requiring a lower number of taps [6] . However, the feedback structure of IIR filters is a major drawback for realtime implementation, as it hinders parallel processing. Taking advantage of the computational efficiency of fast Fourier transform (FFT), FD-CDE has been extensively used [9] [10] [11] and pointed out as the most adequate solution for commercial transceivers [12] , [13] . Indeed, the complexity of FD-CDE evolves with N FFT log 2 (N FFT ), with N FFT being the FFT block-size, whereas FIR-CDE implies N 2 complexity, where N is the number of FIR filter taps. Consequently, for large accumulated CD, such as in uncompensated long-haul fiber links, FD-CDE tends to be more computationally efficient [13] . However, due to the use of FFTs, FD-CDE requires the use of overlap-save/add algorithms to implement linear filtering, while TD implementation avoids this requirement, which can simplify its practical implementation. Driven by this motivation, reduced complexity FIR-based CDE methods have been recently proposed [7] , [8] . However, despite of the enhanced computational efficiency provided by these FIR-based CDE algorithms, there is still a need for a more efficient algorithm in order to fulfill the requirements (chip area and power consumption) for commercial coherent transceivers.
In this paper, we propose a multiplierless distributive FIR-CDE algorithm, in which we apply a quantization process associated with a signed digit (SD) representation to decompose the multiplications by the filter coefficients into simple shiftand-add operations. Taking advantage of the high multiplicity of the real and imaginary parts of quantized FIR coefficients, we propose a reduced complexity FIR-CDE algorithm, which is directly compared with TD (FIR-CDE) and FD algorithms, considering a 100G long-haul transmission system. The comparison between TD algorithms reveals a reduction of the number of multiplier and addition operations by over 99% and 40%, respectively. In addition, the comparison with FD-CDE is also shown to be highly favorable both in terms of hardware requirements (over 99% less multipliers and 30% less adders) and processing latency (over 90% reduction).
The rest of the paper is organized as follows. In Section II, the theoretical formulation behind this approach is provided. In Section III, the multiplierless implementation is analysed. Section IV presents the computational effort analysis of TD and FD CDE architectures. In Section V the experimental results are presented, where the performance and complexity of proposed algorithms are evaluated through the experimental data. Finally, in Section VI, the main conclusions are drawn.
II. DISTRIBUTIVE FIR-CDE IMPLEMENTATION
The equalization of CD can be performed in time domain using a linear complex-valued FIR filter [4] , where each equalized sample, y(n), is obtained as a linear combination of N received samples, x(n − k), with k = 0, ..., N − 1, as
where c(k) represents the complex FIR coefficients, which can be obtained from the inverse Fourier transform of the linear transfer function [3] , [4] . Since the impulse response of the FIR filter is given by the inverse of the impulse response of the dispersive fiber, which is symmetric about its center, the coefficients c(k) are also symmetric [5] . Therefore, for an odd number of filter coefficients, the FIR architecture implementation can follow the one presented in Fig. 1 , which allows a reduction of approximately 50% in terms of complex multiplication (CM) operations [13] . In Fig. 1 , M is the length of symmetry, excluding the central coefficient, and is given as,
where N is the number of FIR taps. For simplicity, an odd number of coefficients is assumed in this paper, however the extension to an even number of coefficients is straightforward. This standard implementation of CDE using the FIR architecture of Fig. 1 will be henceforth designated as FIR-CDE. The implementation complexity of the FIR-CDE can be reduced by applying a quantization process to the filter coefficients and decomposing each CM into 3 real multiplications (RMs) [20] . The quantization of the filter coefficients, {c(k)}, into a set of discrete values, c Q (k) , can be obtained from,
where Δ is a positive integer value chosen as a power of 2, c r (k) and c i (k) are the real and imaginary parts of c(k), respectively, and · represents the nearest integer operation. An illustrative example of exact versus quantized real part of the coefficients is provided in Fig. 2 . Thereby, taking into account the coefficients symmetry, we achieve the quantized form of FIR-CDE , which is given after (1) as,
where y Q (n) is the equalized sample computed over M + 1 quantized coefficients, c Q (k), and x s (n − k) are the symmetrically summed input samples obtained as,
The quantized coefficients can be written in terms of their real and imaginary components as,
where j represents the imaginary unit. Using this approach, the number of obtained quantization levels is 2Δ + 1, which depends on the chosen value of Δ, applied to the FIR coefficients, and therefore it involves a compromise between performance and complexity. However, even in the limit of a very coarse quantization process (small Δ), the number of multiplication operations of the quantized FIR-CDE remains quadratically dependent on the number of FIR coefficients, thus requiring high computational resources in systems with a large value of accumulated chromatic dispersion and high throughput. To further reduce the implementation complexity, we proceed with a closer inspection into the quantized FIR-CDE. By analysing the quantization process, we can notice that a given value of Δ imposes a set of Similarly, the coefficients multiplicity also tends to increase for higher number of taps, N , which we have identified as being the primary source of complexity. In these cases of high multiplicity, we can take advantage of the distributive property of multiplication over addition to reduce the number of multiplication operations between input samples and quantized coefficients. However, since each FIR coefficient is composed of real and imaginary parts, the multiplicity of the complex-valued coefficients tends to be much smaller than that of each of its components. Therefore, to take full advantage of this property, we can independently treat the real and imaginary parts of the set of coefficients, performing the CMs between x(n − k) and c Q (k) in (4) as,
Note that x s (n − k) is a complex number that is multiplied by the real and imaginary parts of c Q (k) separately. Therefore, performing CMs acordding to (7) we can rewrite (4) as,
from which we can define y Q c r (n) and y
and write (8) as,
In order to write (9) and (10) 
we can rewrite (12) and (13) as
Following expressions (11) to (17) For the sake of simplicity, the Control Units are assumed to be responsible exclusively for the routing of input samples, based on the provided pre-processed information. Nevertheless, we do not restrict this idea as the unique implementation solution. This is an open optimization issue that should be considered in an hardware implementation. In this work we assume a dedicated architecture for the compensation of a fixed amount of CD, focusing our efforts on optimizing the trade-off between complexity and performance. Nevertheless, we consider that an in-depth investigation in terms of hardware implementation should also be performed to find the best configuration to update the taps coefficients to allow the equalization of different amounts of CD.
III. MULTIPLIERLESS DISTRIBUTIVE FIR-CDE IMPLEMENTATION
Previously we presented the D-FIR-CDE architecture, which only requires Δ possible values, {c m }, for all the coefficients. Thereby, the N 2 complexity dependence of FIR-CDE is now avoided, since the number of multiplication operations becomes only dependent on the value of Δ, regardless of the number of FIR taps. Therefore, when a low value of Δ is chosen, the number of unique quantized FIR coefficients, c m , will be low, as well as the number of multiplication operations. These peculiarities render D-FIR-CDE as an attractive architecture to exploit the implementation of their multiplication operations employing shift-and-add operations. Therefore, a practical implementation of D-FIR-CDE architecture can be facilitated by performing the multiplication operations between {S m } and {c m } employing shift-and-add multipliers (SAMs). A given SAM can be obtained by decomposing the associated multiplier value, c m , into shift and addition operations, taking advantage of SD representation [21] . An example of a SAM implementation is shown in Fig. 4 , where we can note that a c m value can impose several In order to facilitate the SD representation, the value of Δ can be chosen as a power of 2, which imposes that the possible values, {c m }, can be written as a finite sum of negative powers of 2, which in turn allows a direct decomposition into shifts and adds. Consequently, all multiplication operations can be efficiently performed, yielding a multiplierless D-FIR-CDE (MD-FIR-CDE) architecture. Therefore, the MD-FIR-CDE keeps the same architecture as D-FIR-CDE with the exception that the RMs are replaced by the SAMs. In practice, the multiplierless implementation of the D-FIR-CDE architectures can be preferable, since the complexity and energy consumption associated with the RMs are much higher than the shift and addition operations [15] .
IV. COMPUTATIONAL EFFORT
In this section, we assess the computational effort and the latency of CDE algorithms both in time domain (TD) and frequency domain (FD). For TD algorithms we have considered three FIR filter architectures: FIR-CDE, D-FIR-CDE and MD-FIR-CDE, whereas for FD we have considered the benchmark FD-CDE. It should be noted that the same approach of distributive property can also be applied in FD (by quantizing the CD transfer function), however, since the complexity of FD-CDE is mainly dominated by FFT and IFFT processing, major complexity savings are not expected.
The complexity estimation for the FIR-CDE, D-FIR-CDE and FD-CDE is based on the number of real additions (RAs) and RMs, whereas the estimation for the MD-FIR-CDE is based on the number of RAs and shifts. Since, from the implementation viewpoint, subtraction and addition have the same complexity, a subtraction is counted as an addition. Therefore, the implementation of a complex adder (CA) requires 2 RAs and a CM is considered to require 5 RAs and 3 RMs [20] . In addition, the latency associated with each CDE algorithm is estimated in terms of the minimum number of serial RMs required for its implementation, or conversely, the respective number of required clock cycles, assuming that one serial RM can be implemented per clock cycle. As can be noted the latency associated with the FIR filter architectures in TD can be considered the same, thus the latency is estimated only for the D-FIR-CDE and FD-CDE implementations, considering the delay of data acquisition and data processing.
A. FD-CDE
The complexity of FD-CDE is estimated following the works of [9] , [13] . For each polarization, FD-CDE requires the computation of one fast Fourier transform (FFT), FD multiplication with the transfer function of the equalizer and one inverse FFT. Considering a radix-2 algorithm, with FFT length N FFT , the required number of RMs and RAs per equalized sample are estimated respectively as,
where N 2 = N FFT − N + 1 and corresponds to the number of valid equalized samples per equalizer output, discounting the overhead required for overlap-save/overlap-add between FFT blocks. The latency of FD-CDE is estimated as,
where τ acq is the latency between data acquisition and processing, and the remaining is the latency of data processing, including the latency of FFT/IFFT pairs (2 log 2 (N FFT )) in series with an intermediary multiplication stage. The latency of data acquisition depends on the FFT block-size, N FFT , and on the number of parallel input samples, N p , which, in a practical scenario, can be obtained from the ratio between the ADC sample rate and the DSP clock frequency. Therefore, we can coarsely estimate τ acq as,
B. FIR-CDE
The complexity of FIR-CDE is estimated following the architecture of the complex FIR filter in Fig. 1 . Considering an odd number of taps, N , the filter requires
CMs and 2
CAs to obtain an equalized sample. The 2
2 CAs correspond to the symmetric summation over N input samples and the summation of
outputs of CMs. Based on these considerations, the number of RAs, N RA , and RMs, N RM , required by the FIR-CDE architecture for the equalization of each output sample are,
and
respectively.
C. D-FIR-CDE
The complexity estimation for the D-FIR-CDE filter is based on the architecture shown in Fig. 3 . It is worth to mention here that the complexity associated with the Control Units is neglected, since it can be performed with no cost in terms of addition, multiplication and shift operations. However, it should be noted that the cost of the routing engine may be considerable when the compensation of different amounts of CD is required. Thus, performing a top-down analysis of the FIR architecture of Therefore, for the D-FIR-CDE architecture, the total number of RAs, N RA , required to equalize a sample is,
where N 
where n 
Analogously to the FD-CDE case, the latency of D-FIR-CDE can be estimated as,
where the fractional part,
, accounts for the latency of data acquisition and the latency of data processing corresponds to a single multiplication stage. Note that both in TD and FD the multiplications are considered to be fully implemented in parallel.
D. MD-FIR-CDE
The complexity of the MD-FIR-CDE is obtained similarly to the analysis performed for D-FIR-CDE architecture, however considering the implementation of RMs with shift-and-add operations. In this case, the number of RAs, N RA , required to equalize a sample is estimated as, where N a (Δ) accounts for the additional RAs introduced by all the SAMs, for a given Δ. Note that the number of RAs for MD-FIR-CDE is similar to the D-FIR-CDE, at the exception of the last right-hand side term that corresponds to the number of RAs imposed by all shift-and-add operations. In turn, the number of shift operations, N shifts , is given by,
where N sf (Δ) directly provides the number of shifts, for a given Δ. It should be noticed that N sf (Δ) and N a (Δ) depends on Δ and SD representation of {c m }, and that they are estimated, respectively, by summing the number of RAs and shifts imposed by each SAM, in a total of 4(Δ − 1) SAMs. Table I shows the values of N sf (Δ) and N a (Δ) for trial values of Δ, using a canonical SD (CSD) representation for {c m }.
V. EXPERIMENTAL RESULTS
In order to experimentally validate the proposed CDE algorithms, we perform a comprehensive assessment over a long-haul 100G optical fiber link. The experimental setup for signal generation, transmission and detection is as follows. At the transmitter side, the optical carrier is generated by an external cavity laser (ECL) with 100 kHz of linewidth and fed to an IQ modulator (IQM), which is electrically driven by a pulse pattern generator (SHF 121000B), producing a 25 Gbaud QPSK signal. Polarization multiplexing is generated by an optical delay line with 221 symbols of delay, giving rise to the transmitted 100 Gb/s PM-QPSK signal. The launched optical power has been fixed to 0 dBm.
The optical signal is then propagated in a recirculating loop consisting of two spans of standard single-mode fiber (SSMF) with 80 km each and group velocity dispersion of β 2 = −20.4 ps 2 /km. Using acousto-optic switches to control the recirculating loop, the optical signal is recirculated before being captured at the receiver, for several propagation lengths. After coherent detection, the received electrical signal is sampled at 50 Gsa/s by a Tektronix DPO72004B oscilloscope with ∼20 GHz of electrical bandwidth and then post-processed in MATLAB. The DSP subsystem [4] includes: i) frontend correction to compensate for temporal misalignment and amplitude imperfections between the in-phase and quadrature components; ii) chromatic dispersion compensation , applied either in time-(FIR-CDE and D-FIR-CDE) or frequency-domain (FD-CDE); iii) adaptive linear equalization using a 25-taps 2×2 FIR filter driven by the constant modulus algorithm (CMA); iv) frequency estimation with a 4th-power spectral method; v) phase estimation with the Viterbi and Viterbi algorithm and vi) symbol decoding and bit error rate (BER) counting. Since the use of coefficient quantization is the only factor impacting CDE performance, for simplicity, in this section all performance assessment results refer only to the FIR-CDE and D-FIR-CDE algorithms. In terms of complexity, our analysis is divided into three scenarios:
i) D-FIR-CDE versus FIR-CDE; ii) D-FIR-CDE versus FD-CDE; iii) D-FIR-CDE versus MD-FIR-CDE.

A. Performance Analysis
We start by evaluating the impact of coefficient quantization on the performance of the D-FIR-CDE in comparison with the maximum performance achieved by the FIR-CDE, considering several propagation lengths, L. Fig. 5(a) depicts the dependence of BER on the propagation length for the FIR-CDE and D-FIR-CDE, considering different values of Δ. We can observe that as Δ increases, the performance for D-FIR-CDE tends to converge to the FIR-CDE performance. To facilitate the quantitative analysis of these results, Fig. 5(b) shows the performance penalty, in Q 2 -factor, as a function of the fiber length for trial values of Δ. It now becomes clear that as the transmission length increases the penalty tends to decrease, falling below 0.1 dB for high values of Δ (Δ ≥ 8). Considering the case, Δ = 4, we can still achieve a low penalty (≤0.3 dB), which tends to decrease as L increases, being less than 0.2 dB for L ≥ 5600 km.
In order to provide a more in-depth analysis on the performance of the D-FIR-CDE, we are now going to focus on a fixed propagation length. Considering L = 4000 km, the theoretical value N = 1341 is obtained for the employed transmission rate and fiber dispersion [4] . Nevertheless, this value can be significantly reduced without degrading the equalizer performance [4] . As shown in Fig. 6(a) , the FIR-CDE performance is kept almost constant down to approximately 60% of the theoretical value of N . Based on the results of Fig. 6(a) , we have then identified N = 901 as the minimum number of taps that yields virtually no performance penalty. The corresponding constellation diagram is illustrated in the inset of Fig. 6(a) . Note that the optimization of N has been carried out using the FIR-CDE architecture, which provides the maximum performance, since its coefficients are not quantized.
Defining N = 901, we have then evaluated the performance of the D-FIR-CDE algorithm by analyzing the evolution of BER as a function of the quantization parameter, Δ, as shown in Fig. 6(b) . Note that Δ is a critical parameter in the D-FIR-CDE implementation, since it interferes both on the complexity and performance of CDE. The obtained results show that the D-FIR-CDE reaches the maximum performance obtained with FIR-CDE for high values of Δ, corresponding to a highprecision quantization of the FIR coefficients. However, due to the large number of possible coefficient values, 2Δ + 1, their multiplicity is expected to be low, in which case the D-FIR-CDE architecture becomes inefficient. Fortunately, the results in Fig. 6 (b) also demonstrate that the quantization parameter can be greatly reduced at the expense of a small and controlled performance loss. Targeting the FEC limit of 1×10 −3 , the quantization factor can be decreased down to Δ = 4, corresponding to a Q 2 -factor penalty of ∼0.24 dB relatively to the maxi- mum CDE performance. Nevertheless, note that the D-FIR-CDE algorithm can also be applied with other values of Δ, thus greatly enhancing the flexibility on the performance versus complexity tradeoff of CDE. For instance, the performance penalty can be reduced to <0.1 dB using Δ = 8. Note that, due to laboratorial limitations the gross bit rate is 100 Gb/s, thus the net bit rate is actually lower than 100 Gb/s when FEC is applied for error-free transmission. The computational effort corresponding to these choices of Δ is thoroughly analyzed in the following subsection.
B. Computational Effort and Latency
We assess the computational effort and latency of the proposed distributive FIR-CDE architecture, directly comparing with the reference time-domain (FIR-CDE) and frequencydomain (FD-CDE) algorithms. Due to the specificities of each proposed and reference algorithm, our analysis is subdivided between TD and FD comparisons.
The computational effort is usually assessed in terms of number of operations per processed sample, as a way of evaluating the efficiency of the algorithm [12] , [13] . However, since this indicator can be misleading when comparing different parallel processing architectures, in this work we also perform a comparison in terms of total number of required hardware units (multipliers and adders), which directly relates with the chip area. The difference between TD and FD processing is evidenced in the schematics of Fig. 7(a) and (b) , which illustrate fully parallel implementations taking into account the ADC sampling rate, R ADC , and DSP clock frequency, R DSP . It is shown that the degree of parallelization for FIR-based CDE is directly obtained from the ratio between the ADC and DSP clocks, N p = R ADC /R DSP . In this work, we assume a typical scenario with an ADC sampling rate of 64 Gsps and a DSP clock frequency of 500 MHz, resulting in N p = 128. In contrast with the time-domain architecture, the correspondent fully parallel implementation of FD-CDE imposes a parallelization degree of N FFT , with part of the output equalized samples being discarded in order to correctly transform from circular to linear convolution.
The comparison between D-FIR-CDE and FD-CDE is then carried out for: i) number of multiplications, N RM , and additions, N RA , per equalized sample; ii) total number of multipliers, N The same ratios are then applicable in terms of number of hardware resources. In the end, the complexity analysis for MD-FIR-CDE is performed relatively to the D-FIR-CDE, evidencing its ease of implementation. In order to quantify the reduction gain, G R , in terms of the complexity and latency we introduce the following figure of merit,
where, O represents the number of operations or latency of the comparing method (D-FIR-CDE) and O ref represents the number of operations or latency of the reference method (FIR-CDE or FD-CDE). In case where the reference method achieves gain over the comparing method, the order is changed and the negative gain is represented for the comparing method.
1) Comparison of D-FIR-CDE With FIR-CDE:
The computational effort comparison between the FIR-CDE and D-FIR-CDE algorithms presented in Table II reveals that the complexity reduction gain achieved by the D-FIR-CDE can be over 99% in terms of number of RMs. Note that on the contrary of the standard FIR-CDE implementation, whose number of RMs per equalized sample directly depends on the number of taps, N , as given by expression (23), the number of RMs required by the D-FIR-CDE architecture only depends on the quantization parameter, Δ, as evidenced by expression (27). This leads to a very high reduction of computational effort in scenarios with large accumulated dispersion, as it is the case of long-haul optical fiber links. A similar comparison in terms of number of RAs per sample reveals a complexity reduction of up to 40.9% for the case of Δ = 2. Note that the avoidance of the null coefficients resulting from the quantization process also contributes to these gains, depending on the quantization parameter, Δ. In general, the complexity reduction gain of the D-FIR-CDE, in terms of number of RAs, increases with the decrease of Δ, due to the increasing multiplicity of the coefficients. For the selected value of Δ = 4 in Fig. 6(b) , the complexity reduction gain provided by the D-FIR-CDE is of 99.1% and 37.3% in terms of number of RMs and RAs per output sample, respectively. Also note that these computational gains are only slightly reduced for the high performance case of Δ = 8, demonstrating that the D-FIR-CDE architecture remains highly efficient even when there is small margin for tradeoff between complexity and performance.
2) Comparison of D-FIR-CDE With FD-CDE:
Despite of the very high complexity reduction obtained over the standard time-domain FIR-CDE, it is well-known that FD-CDE is currently the method of choice for long-haul optical fiber links, since it generally leads to a more computationally efficient implementation [13] . Therefore, a comprehensive comparison with the benchmark FD-CDE is mandatory to assess the merits of the proposed D-FIR-CDE algorithm. Firstly, it should be noted that for each value of N there is an optimal value of N FFT that provides the highest computational efficiency (lowest complexity per processed sample) for FD-CDE [10] . However, considering a fully parallel FFT implementation, it should be noted that the total number of hardware resources (instead of the number of operations per equalized sample) becomes the primary computational effort indicator, since it ultimately dictates the chip area. In that case, considering scenarios of typical longhaul optical links where the required value of N FFT is large, imposing N FFT > N p [15] , we assume that the complexity of FD-CDE is dictated by N FFT . Taking into account these considerations, we have conducted the comparison for the complexity per equalized sample and total number of operations. Thereby, Fig. 8(a) shows the reduction gain in terms of N RM and N t RM and Fig. 8(b) presents the reduction gain in terms of N RA , N t RA and latency. The results in Fig. 8(a) demonstrate that the D-FIR-CDE is indeed more efficient than the FD-CDE on the use of multipliers. Even for the optimum value of N FFT = 8192, the reduction in terms of N RM is over 70% for Δ = 4, while the reduction gain in terms of N t RM is kept over 90% and keeps increasing with N FFT . On the other hand, Fig. 8(b) shows that the FD-CDE tends to be more efficient on what concerns the required number of additions per equalized sample, N RA , achieving a gain of more than 90%. Nevertheless, when the impact of parallel processing is considered through the total number of required hardware adder units, N t RA , Fig. 8(b) shows that the comparison rapidly becomes beneficial for the D-FIR-CDE, enabling a reduction gain of more than 50% for the optimum value (in terms of processing efficiency) of N FFT = 8192. Moreover, Fig 8(b) also shows the higher latency efficiency achieved for D-FIR-CDE over FD-CDE, which tends to increase with increasing N FFT . This reduction of latency comes mainly from the avoidance of FFT/IFFT pairs. Therefore, it is apparent a compromise between computational efficiency, chip area and latency in function of a defined value for N FFT . Note that N FFT = 1024 imposes the lowest chip area and latency reduction gain for D-FIR-CDE, whereas N FFT = 8192 imposes the lowest computational efficiency reduction gain. To finalize this analysis, in Table III we provide a comprehensive computational effort and latency comparison between the D-FIR-CDE and FD-CDE algorithms, focusing on a specific case of study where N FFT = 4096. The obtained results corroborate the higher efficiency of the D-FIR-CDE in terms of the number of multipliers per equalized sample (76% gain for Δ = 4), total number of multipliers (99% gain for Δ = 4), adders (>30% gain for Δ = 4) and latency (91% reduction gain). The only aspect that does not compare favorably for the D-FIR-CDE is the number of additions per equalized sample, where FD-CDE was found to be up to 95% more efficient, even if the D-FIR-CDE still requires a lower number (30% less) of adder units for parallel implementation. For an overall comparison picture, it is important to mention that the multiplication operations are known to be the most important indicator of implementation complexity, roughly requiring N b times more power consumption than an addition operation, where N b is the average number of bits of the operands [15] . However, an extensive hardware implementation analysis would be required to fully address the power consumption issue, taking into account implementation details such as the varying number of bits and DSP clock frequencies throughout the processing chain. 
3) Comparison of D-FIR-CDE With MD-FIR-CDE Algorithms:
Aiming to further evidence the improvement that can be achieved for MD-FIR-CDE, Table IV shows its complexity, in terms of RAs and shifts. We have also presented the complexity of D-FIR-CDE, since we intend to compare the number of operations between the two architectures. We can note that the number of operations associated with the MD-FIR-CDE remains very close to the number of operations imposed by the D-FIR-CDE implementation, despite of a slight increase with the increasing Δ. This may be explained by observing that, although a SAM may require several shift and addition operations, for low Δ the required number of shift and addition operations per SAM is also low. Considering the benchmark case of study, Δ = 4, we observe that the number of shifts for MD-FIR-CDE is the same as the number of RMs for D-FIR-CDE, and an increase of less than 0.2% on the number of RAs is added when using the MD-FIR-CDE architecture. Therefore, since the complexity and energy consumption associated with the multiplication operations is much higher than the shift and addition operations [15] , a significant improvement can be obtained when equalization is performed by means of MD-FIR-CDE.
It is worth to mention that a direct comparison between multiplierless architectures for FD-CDE and FIR-CDE is not analysed in this work for the sake of simplicity. However, we can expect similar reduction gain when the comparison is performed against MD-FIR-CDE, since we expect that the total number of shifts and adders continues evolving similarly with N 2 for FIR-CDE and N FFT log 2 (N FFT ) for FD-CDE.
VI. CONCLUSIONS
Taking advantage of the high multiplicity of the real and imaginary parts of quantized CDE coefficients, we have proposed a low complexity distributive FIR-CDE filter architecture for CD equalization in digital coherent receivers. The hardware implementation of the D-FIR-CDE can be facilitated by applying an SD representation for the quantized coefficients, yielding the MD-FIR-CDE, which enables multiplierless CD equalization. Using a 100G PM-QPSK testbed with propagation over SSMF, we have experimentally demonstrated that the distributive FIR-CDE filter enables to efficiently trade-off performance with computational effort. Employing a coarse quantization of the FIR coefficients (Δ = 4) we have found a Q 2 -factor performance penalty of <0.25 dB for transmission distances of more than 4000 km, demonstrating that this architecture is specially well suited for ultra-long-haul uncompensated optical fiber links. The computational effort analysis has revealed a drastic reduction (over 95%) on the number of required multiplier hardware units relatively to other state-of-the-art TDand FD-CDE algorithms, even when very low performance penalty is tolerated (<0.1 dB Q 2 -factor penalty with Δ = 8). The sample-wise equalization of distributive FIR-CDE architecture also ensures a low processing latency, rendering it as an attractive low-complexity solution for applications that require very strict communication delays. Overall, the obtained results allow to conclude that the proposed distributive FIR-CDE architecture can be an advantageous alternative to the widely used FD-CDE, enabling significant gains in terms of chip area and processing latency at the expense of a small and controllable performance penalty.
